### abstract ###
% The problem is sequence prediction in the following setting
A sequence  SYMBOL  of discrete-valued observations is generated  according to some unknown probabilistic law (measure)  SYMBOL
After observing each outcome,  it is required to give the conditional probabilities of the next observation
The measure   SYMBOL  belongs to an arbitrary but known class  SYMBOL   of stochastic process measures
We are interested in predictors  SYMBOL  whose conditional probabilities converge (in some sense) to the ``true''  SYMBOL -conditional probabilities if any  SYMBOL  is chosen to generate the sequence
The contribution of this work is in characterizing the families  SYMBOL  for which such predictors exist,  and in providing a specific and simple form in which to look for a solution
We show that if any predictor works, then  there exists a Bayesian predictor,  whose   prior is discrete, and which works too
We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family  SYMBOL , as well as in terms of local behaviour of the measures in  SYMBOL ,  which in some cases lead to procedures for constructing such predictors
It should be  emphasized that the framework is completely general: the stochastic processes considered are not required to be   iid  , stationary, or to belong to any parametric or countable family
### introduction ###
Given a  sequence  SYMBOL  of observations  SYMBOL , where  SYMBOL  is a finite set, we  want to predict what are the probabilities of observing  SYMBOL  for each  SYMBOL , or, more generally, probabilities of observing different  SYMBOL , before  SYMBOL  is revealed, after which the process continues
It is assumed that the sequence is generated by some unknown stochastic process  SYMBOL , a probability measure on the space of one-way infinite sequences  SYMBOL
The goal is to have a predictor whose predicted probabilities converge (in a certain sense) to the correct ones (that is, to  SYMBOL -conditional probabilities)
In general this goal is impossible to achieve if  nothing is known about the measure  SYMBOL  generating the sequence
In other words, one cannot have a predictor whose error goes to zero for any measure  SYMBOL
The problem becomes tractable if we assume that the measure  SYMBOL  generating the data belongs to some known class  SYMBOL
The  questions addressed in this work are a part of the following general problem: given an arbitrary set   SYMBOL  of measures, how can we find  a predictor that performs well when the data is generated by any   SYMBOL , and whether it is possible to find such a predictor at all
An example of a generic  property  of a class  SYMBOL  that allows for construction of a predictor, is that  SYMBOL  is countable
Clearly, this condition is very strong
An example,  important from the applications point of view,  of a class  SYMBOL  of measures   for which  predictors are known,  is the class of all stationary measures
The general question, however, is very far from being answered
The contribution of this work to solving this question is, first, in that we  provide a specific form in which to look for a predictor
More precisely, we show that if a predictor that predicts every  SYMBOL  exists,  then such a predictor  can also be obtained as a weighted sum of  countably many elements of  SYMBOL
This result can also be viewed as a justification of the Bayesian approach to sequence prediction: if there exists  a predictor which predicts well every measure in the class, then there exists a Bayesian predictor (with a rather simple prior) that has this property too
In this respect it is important to note that the result obtained about such a  Bayesian predictor is pointwise (holds for every  SYMBOL  in  SYMBOL ), and stretches far beyond the set its prior is concentrated on
Next, we derive some characterizations of families  SYMBOL  for which a  predictor exist
We first analyze what is furnished by the notion of separability, when a suitable topology can be found: we find that  it is a sufficient but not always a necessary condition
We then derive some sufficient conditions for the existence of a predictor which are based on local (truncated to the first  SYMBOL  observation) behaviour of measures in the class  SYMBOL
Necessary conditions cannot be obtained in this way (as we demonstrate), but  sufficient conditions, along with rates  of convergence and construction of predictors, can be found
The {motivation} for studying predictors for arbitrary classes  SYMBOL  of processes is two-fold
First of all, prediction is a basic ingredient for  constructing intelligent systems
Indeed, in order to be able to find optimal behaviour in an unknown environment, an intelligent agent must be able, at the very least, to predict how the environment is going to behave (or, to be more precise, how relevant  parts of the environment are going to behave)
Since the response of the environment may in general depend on the actions of the agent, this response is necessarily non-stationary for explorative agents
Therefore, one cannot readily use prediction methods developed for stationary  environments, but rather has to find predictors for the classes of processes that can appear as a possible response of the environment
Apart from this, the problem of prediction itself has numerous applications in such diverse fields as data compression, market analysis,  bioinformatics, and many others
It seems clear that prediction methods constructed for one application cannot be expected to be optimal when applied to another
Therefore, an important question is how to develop specific prediction algorithms for each of the domains {Prior work}
As it was mentioned,  if the class  SYMBOL  of measures is countable (that is, if  SYMBOL  can be represented as  SYMBOL ), then  there exists a predictor which performs well for any  SYMBOL
Such a predictor can be obtained as a  Bayesian mixture   SYMBOL , where  SYMBOL  are summable positive real weights, and it has very strong predictive properties; in particular,   SYMBOL  predicts every  SYMBOL  in total variation distance, as follows from the result of   CITATION
Total variation distance measures the difference in (predicted and true) conditional probabilities of all future  events, that is, not only the probabilities of the next observations, but also of observations that are arbitrary far off in the future (see formal  definitions below)
In the context of sequence prediction the measure  SYMBOL   was first studied  by   CITATION
Since then, the idea of taking a convex combination of a finite or countable class of measures (or predictors) to obtain a predictor permeates most of the research on sequential prediction (see, for example,  CITATION ) and  more general learning problems~ CITATION
In practice it is clear that, on the one hand, countable models are not sufficient, since already the class  SYMBOL  of Bernoulli  iid 
processes, where  SYMBOL  is the probability of 0, is not countable
On the other hand, prediction in total variation can be too strong to require; predicting probabilities of the next observation may be sufficient, maybe even not  on every step but in the Cesaro sense
A key observation here is that a predictor  SYMBOL  may be a good predictor not only when the data is generated by one of the processes  SYMBOL ,  SYMBOL , but when it comes from a much larger class
Let us consider this point in more detail
Fix for simplicity  SYMBOL
The Laplace predictor  SYMBOL } predicts any Bernoulli  iid  ~process: although convergence in total variation distance of conditional probabilities does not hold, predicted probabilities of the next outcome converge to the correct ones
Moreover, generalizing the Laplace predictor,  a predictor   SYMBOL  can be constructed for  the class  SYMBOL  of all  SYMBOL -order Markov measures, for any given  SYMBOL
As was found by  CITATION , the combination  SYMBOL  is a good predictor not only for the set  SYMBOL  of all finite-memory processes, but also for any measure  SYMBOL  coming from a much larger class: that of all stationary measures on  SYMBOL
Here prediction is possible only in the Cesaro sense (more precisely,  SYMBOL  predicts every stationary process in expected time-average  Kullback-Leibler divergence, see definitions below)
The Laplace predictor itself can be obtained as a Bayes mixture over all Bernoulli  iid 
measures with uniform  prior on the parameter  SYMBOL  (the probability of 0)
However, as was observed in  CITATION  (and as is easy to see), the same (asymptotic) predictive properties are possessed by  a Bayes mixture with a countably supported prior which is dense in   SYMBOL  (e g taking  SYMBOL  where  SYMBOL  ranges over all Bernoulli  iid 
measures with rational probability of 0)
For a given  SYMBOL , the set of  SYMBOL -order Markov processes is parametrized by finitely many  SYMBOL -valued parameters
Taking a dense  subset of the values of these parameters, and a mixture of the corresponding measures, results in a predictor for the class of  SYMBOL -order Markov processes
Mixing over these (for all  SYMBOL ) yields, as in  CITATION , a predictor for the class of all  stationary processes
Thus, for the mentioned classes of processes, a predictor can be obtained as a Bayes mixture of  countably many measures in the class
An additional reason why this kind  of analysis is interesting is because of the difficulties arising in trying to construct  Bayesian predictors for classes of processes that can not be easily parametrized
Indeed, a natural way to obtain  a predictor for a class  SYMBOL  of stochastic processes is to take a Bayesian mixture of the class
To do this, one needs to define the structure of a probability space on  SYMBOL
If the class  SYMBOL  is well parametrized, as is the case with the set of all Bernoulli  iid 
process, then one can  integrate with respect to the parametrization
In general, when the problem lacks a natural parametrization, although one can define the structure of the probability  space on the set of (all) stochastic process measures in many different ways, the results one can obtain will then be with probability 1 with respect to the prior distribution (see, for example,  CITATION )
Pointwise consistency cannot be assured (see eg CITATION ) in this case, meaning that some  (well-defined) Bayesian predictors are not consistent on some (large) subset of  SYMBOL
Results with prior probability 1  can be hard to interpret if one is not sure that the structure  of the probability space defined on the set  SYMBOL  is indeed a natural one for the problem at hand (whereas if one does have a natural parametrization, then usually results for every value of the parameter can be obtained, as in the case with Bernoulli  iid 
processes mentioned above)
The results of the present work show that when a predictor exists it can indeed be given as  a Bayesian  predictor, which predicts  every (and not almost every) measure in the class, while its support is only a countable set
A related question is formulated as a question about two individual measures, rather than about a class of measures and a predictor
Namely, one can ask under which conditions one stochastic process  predicts another
In  CITATION  it was shown that  if one measure is absolutely continuous with respect to another, than  the latter predicts the former (the conditional probabilities converge in a very strong sense)
In  CITATION  a weaker form of convergence  of probabilities (in particular, convergence of expected average KL divergence) is obtained under  weaker assumptions {The results } First,  we show that if there is a predictor that performs well for every measure coming from a class  SYMBOL  of processes, then a predictor can also be obtained as a convex combination  SYMBOL  for some  SYMBOL  and some  SYMBOL ,  SYMBOL
This holds if the prediction quality is measured by either total variation distance, or expected average KL divergence:  one measure of performance that is very strong, the other rather weak
The analysis for the total variation case  relies on the fact that if  SYMBOL  predicts  SYMBOL  in total variation distance, then  SYMBOL  is absolutely continuous with respect to  SYMBOL , so that  SYMBOL  converges to a positive number with  SYMBOL -probability 1 and with a positive  SYMBOL -probability
However, if we settle for a weaker measure of performance, such as  expected average KL divergence, measures  SYMBOL  are typically singular with  respect to a predictor  SYMBOL
Nevertheless, since  SYMBOL  predicts  SYMBOL  we can show that   SYMBOL  decreases subexponentially with  SYMBOL  (with high probability or in expectation); then we can use this ratio as an analogue of the density for each time step  SYMBOL , and  find a convex combination of countably many measures from  SYMBOL  that has  desired predictive properties for each  SYMBOL
Combining these predictors for all  SYMBOL   results in a predictor that predicts every  SYMBOL  in average KL divergence
The proof techniques developed  have a potential to be used in solving other questions concerning sequence prediction, in particular, the general question of how to find a predictor for an arbitrary class  SYMBOL  of measures
We then  exhibit some sufficient conditions on the class  SYMBOL , under which a predictor for all measures in  SYMBOL  exists
It is important to note that none of these conditions relies on a parametrization of any kind
The conditions presented are of  two types: conditions on asymptotic behaviour of measures in  SYMBOL , and on their local (restricted to first  SYMBOL  observations) behaviour
Conditions of the first type concern separability of  SYMBOL  with respect to the total variation distance and the expected average KL divergence
We show that in the case of total variation separability is a necessary and sufficient condition for the existence of a predictor, whereas in the case of expected average KL divergence it is sufficient but is not necessary
The conditions of the second kind concern the ``capacity'' of the sets  SYMBOL ,  SYMBOL , where  SYMBOL  is the measure  SYMBOL  restricted to the first  SYMBOL  observations
Intuitively, if  SYMBOL  is small (in some sense), then prediction is possible
We measure the capacity of  SYMBOL  in two ways
The first way is  to find the maximum probability given to each sequence  SYMBOL  by some measure in the class, and then take a sum over  SYMBOL
Denoting the obtained  quantity   SYMBOL , one can show that  it grows polynomially in  SYMBOL  for  some important classes of processes, such as  iid 
or Markov processes
We show that, in general, if  SYMBOL  grows subexponentially then a predictor exists that predicts any measure in   SYMBOL  in expected average KL divergence
On the other hand, exponentially growing  SYMBOL  are not sufficient for prediction
A more refined way to measure the capacity of  SYMBOL  is using a concept of channel capacity from information  theory, which was developed for a closely related problem of finding optimal  codes for a class of sources
We extend corresponding results from information theory to show that sublinear growth of channel capacity  is sufficient for the existence of a predictor, in the sense of expected average divergence
Moreover, the obtained bounds on the divergence are optimal up to an additive logarithmic term
The rest of the paper is organized as follows
Section~  introduces the notation and definitions
In Section~ we show that if any predictor works than there is a Bayesian one that works,  while in Section~ we provide several characterizations of predictable classes of processes
Section~ is concerned with separability, while Section~ analyzes conditions based on local behaviour of measures
Finally, Section~ provides outlook and discussion
As running examples that illustrate the results of each section  we use countable classes of measures, the family of all Bernoulli  iid 
processes and that of  all stationary processes
