### abstract ###
The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model
A central issue is learning these models in high-dimensions, such as when there is some sparsity pattern of the optimal parameter
This work characterizes a certain strong convexity property of  general  exponential families, which allow their generalization ability to be quantified
In particular, we show how this property can be used to analyze generic exponential families under  SYMBOL  regularization
### introduction ###
Exponential models are perhaps the most versatile and pragmatic statistical model for a variety of reasons --- modelling flexibility (encompassing discrete variables, continuous variables, covariance matrices, time series, graphical models, etc); convexity properties allowing ease of optimization; and robust generalization ability
A principal issue for applicability to large scale problems is estimating these models when the ambient dimension of the parameters,  SYMBOL , is much larger than the sample size  SYMBOL  --- the `` SYMBOL '' regime
Much recent work has focused on this problem in the special case of linear regression in high dimensions, where it is assumed that the optimal parameter vector is sparse (e g CITATION )
This body of prior work focused on: sharply characterizing the convergence rates for the prediction loss; consistent model selection; and obtaining sparse models
As we tackle more challenging problems, there is a growing need for model selection in more general exponential families
Recent work here includes learning Gaussian graphs ( CITATION ) and Ising models ( CITATION )
Classical results established that consistent estimation in  general  exponential families is possible, in the asymptotic limit where the number of dimensions is held constant (though some work establishes rates under certain conditions as  SYMBOL  is allowed to grow slowly with  SYMBOL   CITATION )
However, in modern problems, we typically grow  SYMBOL  rapidly with  SYMBOL  (so even asymptotically we are often interested in the regime where  SYMBOL , as in the case of sparse estimation)
While we have a handle on this question for a variety of special cases, a pressing question here is understanding how fast  SYMBOL  can scale as a function of  SYMBOL  in  general  exponential families --- such an analysis must quantify the relevant aspects of the particular family at hand which govern their convergence rate
This is the focus of this work
We should emphasize that throughout this paper, while we are interested in  modelling  with an exponential family, we are agnostic about the true underlying distribution (e
g we do not necessarily assume that the data generating process is from an exponential family) \paragraph{Our Contributions and Related Work}  The key issue in analyzing the convergence rates of exponential families in terms of their prediction loss (which we take to be the log loss) is in characterizing the nature in which they are strictly convex --- roughly speaking, in the asymptotic regime where we have a large sample size  SYMBOL  (with  SYMBOL  kept fixed), we have a central limit theorem effect where the log loss of any exponential family approaches the log loss of a Gaussian, with a covariance matrix corresponding to the Fisher information matrix
Our first main contribution is quantifying the rate at which this  effect occurs in general exponential families
In particular, we show that every exponential family satisfies a certain rather natural growth rate condition on their standardized moments and standardized cumulants (recall that the  SYMBOL -th standardized moment is the  unitless  ratio of the  SYMBOL -th central moment to the  SYMBOL -th power of the standard deviation, which for  SYMBOL  is the skew and kurtosis)
This condition is rather mild, where these moments can grow as fast as  SYMBOL
Interestingly, similar conditions have been well studied for obtaining exponential tail bounds for the convergence of a random variable to its mean~ CITATION
We show that this growth rate characterizes the rate at which the prediction loss of the exponential family behaves as a strongly convex loss function
In particular, our analysis draws many parallels to that of the analysis of Newton's method, where there is a ``burn in'' phase in which a number of iterations must occur until the function behaves as a locally quadratic function --- in our statistical setting, we now require a (quantified) ``burn in'' sample size, where beyond this threshold sample size, the prediction loss inherits the desired strong convexity properties (i e it is locally quadratic)
Our second contribution is an analysis of  SYMBOL  regularization in generic families, in terms of both prediction loss and the sparsity level of the selected model
Under a particular sparse eigenvalue condition on the design matrix (the Restricted Eigenvalue (RE) condition in  CITATION ), we show how  SYMBOL  regularization in general exponential families enjoys a convergence rate of  SYMBOL  (where  SYMBOL  is the number of relevant features)
This RE condition is one of the least stringent conditions which permit  this optimal convergence rate for linear regression case (see  CITATION ) --- stronger mutual incoherence/irrepresentable conditions considered in  CITATION  also provide this rate
We show that an essentially identical convergence rate can be achieved for  general  exponential families --- our results are non-asymptotic and precisely relate  SYMBOL  and  SYMBOL
Our final contribution is one of  approximate  sparse model selection, i e where our goal is to obtain a sparse model with low prediction loss
A drawback of the RE condition in comparison to the mutual incoherence condition is that the latter permits perfect recovery of the true features (at the price of a more stringent condition)
However, for the case of the linear regression,  CITATION  show that, under a sparse eigenvalue or RE condition, the  SYMBOL  solution is actually sparse itself (with a multiplicative increase in the sparsity level, that depends on a certain condition number of the design matrix) -- so while the the  SYMBOL  solution may not precisely recover the true model, it still is sparse (with some multiplicative increase) and does recover those features with large true weights
For general exponential families, while we do not have a characterization of the sparsity level of the  SYMBOL -regularized solution (an interesting open question), we do however provide a simple two stage procedure (thresholding and refitting) which provides a sparse model, with support on no more than merely  SYMBOL  features and which has nearly as good performance (with a rather mild increase in the risk) --- this result is novel even for the square loss case
Hence, even under the rather mild RE condition, we can obtain both a favorable convergence rate and a  sparse model for generic families
      nips07submit_e
sty                                                                                  0000644 0000000 0000000 00000016414 11272723522 013176  0                                                                                                    ustar   root                            root                                                                                                                                                                                                                   %%%% NIPS Macros (LaTex)    \renewcommand{\topfraction}{0 95}   % let figure take up nearly whole page \renewcommand{\textfraction}{0 05}  % let figure take up nearly whole page   \setlength{\paperheight}{11in} \setlength{\paperwidth}{8 5in} 5in    %   Note = 5in 0 07 true in -0 625in \addtolength{\headsep}{0 25in} 9 0 true in       % Height of text (including footnotes & figures) 5 5 true in        % Width of text line \widowpenalty=10000 \clubpenalty=10000   \def\addcontentsline#1#2#3{}  \def\maketitle{ \def\thefootnote{\fnsymbol{footnote}} \def\@makefnmark{to 0pt{ SYMBOL \hss}} % for perfect author \long\def\@makefntext##1{1em to1 8em{ SYMBOL }##1} \@maketitle \@thanks \setcounter{footnote}{0} \let\maketitle\let\@maketitle\gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}   \def\makeanontitle{ \def\thefootnote{\fnsymbol{footnote}} \def\@makefnmark{to 0pt{ SYMBOL \hss}} % for perfect author \long\def\@makefntext##1{1em to1 8em{ SYMBOL }##1} \@makeanontitle \@thanks \setcounter{footnote}{0} \let\makeanontitle\let\@makeanontitle\gdef\@thanks{}\gdef\@title{}\let\thanks\relax}    \def\@maketitle{\vbox{\hsize\linewidth0 1in {\LARGE\@title\par}  % 0 1in %  minus \def\And{\end{tabular}\hfil\linebreak[0]\hfil\linebreak[4]%  0 3in minus 0 1in}}  \def\@makeanontitle{\vbox{\hsize\linewidth0 1in {\LARGE\@title\par}  % 0 1in %  minus %  0 3in minus 0 1in}}  \renewenvironment{abstract}{\vskip 075in\centerline{\largeAbstract}1ex}  \def
