### abstract ###
We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation
We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis
In terms of algorithms, the equivalence  suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting
On the analysis front, the equivalence of robustness and regularization,  provides a robust optimization interpretation for the success of regularized SVMs
We use the this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the  reason  regularized SVMs generalize well
### introduction ###
Support Vector Machines (SVMs for short) originated in  CITATION  and can be traced back to as early as  CITATION  and  CITATION
They continue to be one of the most successful algorithms for classification
SVMs address the classification problem by finding the hyperplane in the feature space that achieves maximum sample margin when the training samples are separable, which leads to minimizing the norm of the classifier
When the samples are not separable, a penalty term that approximates the total training error is considered  CITATION
It is well known that minimizing the training error itself can lead to poor classification performance for new unlabeled data; that is, such an approach may have poor generalization error because of, essentially, overfitting  CITATION
A variety of modifications have been proposed to combat this problem, one of the most popular methods being that of minimizing a combination of the training-error and a regularization term
The latter is typically chosen as a norm of the classifier
The resulting regularized classifier  performs better on new data
This phenomenon is often interpreted from a statistical learning theory view: the regularization term restricts the complexity of the classifier, hence the deviation of the testing error and the training error is controlled  CITATION
In this paper we consider a different setup,  assuming that the training data are generated by the true underlying distribution, but some non- iid  (potentially adversarial) disturbance is then added to the samples we observe
We follow a robust optimization  CITATION  approach, i e , minimizing the worst possible empirical error under such disturbances
The use of robust optimization in classification is not new  CITATION
Robust classification models studied in the past have considered only box-type uncertainty sets, which allow the possibility that the data have all been skewed in some non-neutral manner by a correlated disturbance
This has made it difficult to obtain non-conservative generalization bounds
Moreover, there has not been an explicit connection to the regularized classifier, although at a high-level it is known that regularization and robust optimization are related  CITATION
The main contribution in this paper is solving the robust classification problem for a class of non-box-typed uncertainty sets, and providing a linkage between robust classification and the standard regularization scheme of SVMs
In particular, our contributions include the following:   We solve the robust SVM formulation for a class of non-box-type uncertainty sets
This permits finer control of the adversarial disturbance, restricting it to satisfy aggregate constraints across data points, therefore reducing the possibility of highly correlated disturbance
We show that the standard regularized SVM classifier is a special case of our robust classification, thus explicitly relating robustness and regularization
This provides an alternative explanation to the success of regularization, and also suggests new physically motivated ways to construct regularization terms
We relate our robust formulation to several probabilistic formulations
We consider a chance-constrained  classifier (i e , a classifier with probabilistic constraints on misclassification) and show that our robust formulation can approximate it far less conservatively than previous robust formulations could possibly do
We also consider a Bayesian setup, and show that this can be used to provide a principled means of selecting the regularization coefficient without cross-validation
We show that the robustness perspective, stemming from a non- iid 
analysis, can be useful in the standard learning ( iid  ) setup, by using it  to  prove consistency for standard SVM classification,  without using VC-dimension or stability arguments
This result implies that generalization ability is a direct result of robustness to local disturbances; it therefore suggests a new justification for good performance, and consequently allows us to construct learning algorithms that generalize well by robustifying non-consistent algorithms \subsubsection*{Robustness and Regularization} We comment here on the explicit equivalence of robustness and regularization
We briefly explain how this observation is different from previous work and why it is interesting
Certain equivalence relationships between robustness and regularization have been established for problems other than classification  CITATION , but their results do not directly apply to the classification problem
Indeed, research on classifier regularization mainly discusses its effect on bounding the complexity of the function class  CITATION
Meanwhile, research on robust classification has not attempted to relate robustness and regularization  CITATION , in part due to the robustness formulations used in those papers
In fact, they all consider robustified versions of  regularized  classifications
CITATION  considers a robust formulation for box-type uncertainty, and relates this robust formulation with regularized SVM
However, this formulation involves a non-standard loss function that does not bound the  SYMBOL  loss, and hence its physical interpretation is not clear
The connection of robustness and regularization in the SVM context is important for the following reasons
First, it gives an alternative and potentially powerful explanation of the generalization ability of the regularization term
In the classical machine learning literature, the regularization term bounds the complexity of the class of classifiers
The robust view of regularization regards the testing samples as a perturbed copy of the training samples
We show that when the total perturbation is given or bounded, the regularization term bounds the gap between the classification errors of the SVM on these two sets of samples
In contrast to the standard PAC approach, this bound depends neither on how rich the class of candidate classifiers is, nor on an assumption that all samples are picked in an  iid 
manner
In addition, this suggests novel approaches to designing good classification algorithms, in particular, designing the regularization term
In the PAC structural-risk minimization approach, regularization is chosen to minimize a bound on the generalization error based on the training error and a complexity term
This complexity term typically leads to overly emphasizing the regularizer, and indeed this approach is known to often be too pessimistic~ CITATION  for problems with more structure
The robust approach offers another avenue
Since both noise and robustness are physical processes, a close investigation of the application and noise characteristics at hand, can provide insights into how to properly robustify, and therefore regularize the classifier
For example, it is known that normalizing the samples so that the variance among all features is roughly the same (a process commonly used to eliminate the scaling freedom of individual features) often leads to good generalization performance
From the robustness perspective, this simply says that the noise is anisotropic (ellipsoidal) rather than spherical, and hence an appropriate robustification must be designed to fit this anisotropy
We also show that using the robust optimization viewpoint, we obtain some probabilistic results outside the PAC setup
In Section~ we bound the probability that a noisy training sample is correctly labeled
Such a bound considers the behavior of  corrupted  samples and is hence different from the known PAC bounds
This is helpful when the training samples and the testing samples are drawn from different distributions, or some adversary manipulates the samples to prevent them from being correctly labeled (e g , spam senders change their patterns from time to time to avoid being labeled and filtered)
Finally, this connection of robustification and regularization also provides us with new proof techniques as well (see Section~)
We need to point out that there are several different definitions of robustness in literature
In this paper, as well as the aforementioned robust classification papers, robustness is mainly understood from a Robust Optimization perspective, where a min-max optimization is performed over all possible disturbances
An alternative interpretation of robustness stems from the rich literature on Robust Statistics  CITATION , which studies how an estimator or algorithm behaves under a small perturbation of the statistics model
For example, the Influence Function approach, proposed in  CITATION  and  CITATION , measures the impact  of an infinitesimal amount of contamination of the original distribution on the quantity of interest
Based on this notion of robustness,  CITATION  showed that many kernel classification algorithms, including SVM, are robust in the sense of having a finite Influence Function
A similar result for regression algorithms is shown in  CITATION  for smooth loss functions, and in  CITATION  for non-smooth loss functions where a relaxed version of the Influence Function is applied
In the machine learning literature, another widely used notion closely related to robustness is the  stability , where an algorithm is required to be robust (in the sense that the output function does not change significantly) under a specific perturbation: deleting one sample from the training set
It is now well known that a stable algorithm such as SVM has desirable generalization properties, and is statistically consistent under mild technical conditions; see for example  CITATION  for details
One main difference between Robust Optimization and other robustness notions is that the former is constructive rather than analytical
That is, in contrast to robust statistics or the stability approach that measures the robustness of a  given  algorithm, Robust Optimization can  robustify  an algorithm: it converts a given algorithm to a robust one
For example, as we show in this paper, the RO version of a naive empirical-error minimization is the well known SVM
As a constructive process, the RO approach also leads to additional flexibility in algorithm design, especially when the nature of the perturbation is known or can be well estimated {Structure of the Paper:} This paper is organized as follows
In Section~ we  investigate the correlated disturbance case, and  show the equivalence between the robust classification and the regularization process
We develop the connections to probabilistic formulations in Section~, and prove a consistency result based on robustness analysis in Section~
The kernelized version is investigated in Section~
Some concluding remarks are given in Section~ {Notation:} Capital letters are used to denote matrices, and boldface letters are used to denote column vectors
For a given norm  SYMBOL , we use  SYMBOL  to denote its dual norm, i e ,  SYMBOL
For a vector  SYMBOL  and a positive semi-definite matrix  SYMBOL  of the same dimension,  SYMBOL  denotes  SYMBOL
We use  SYMBOL  to denote disturbance affecting the samples
We use superscript  SYMBOL  to denote the true value for an uncertain variable, so that  SYMBOL  is the true (but unknown) noise of the  SYMBOL  sample
The set of non-negative scalars is denoted by  SYMBOL
The set of integers from  SYMBOL  to  SYMBOL  is denoted by  SYMBOL
