### abstract ###
Suppose we are given two probability measures on the set of one-way infinite finite-alphabet sequences and consider the question when one of the  measures predicts the other, that is, when conditional probabilities  converge (in a certain sense) when one of the measures is chosen to generate the sequence
This question may be considered a refinement of the problem of sequence prediction in its most general formulation: for a given  class of probability measures, does there exist a measure which predicts all of the measures in the class
To address this problem, we find some conditions on local absolute continuity which are sufficient for prediction and which generalize several different notions which are known to be sufficient for prediction
We also formulate some open questions to outline a direction for finding the conditions on classes of measures for which prediction is possible
### introduction ###
Let a sequence  SYMBOL ,  SYMBOL  of letters from some finite alphabet  SYMBOL  be generated by some probability measure  SYMBOL
Having observed the first  SYMBOL  letters  SYMBOL  we want to predict what is the probability of the next letter being  SYMBOL , for each  SYMBOL
This task is motivated by numerous applications --- from weather forecasting and stock market prediction to data compression
If the measure  SYMBOL  is known completely then the best forecasts one can make for the  SYMBOL st  outcome of a sequence  SYMBOL  is  SYMBOL -conditional probabilities of   SYMBOL  given  SYMBOL
On the other hand, it is immediately apparent that if nothing is known about the distribution  SYMBOL  generating the sequence then no prediction is possible, since for any predictor there is a measure on which it errs (gives inadequate probability forecasts) on every step
Thus one has to restrict the attention to some class of measures
Laplace was perhaps the first to address the question of sequence prediction, his motivation being as follows: Suppose that we know that the Sun has risen every day for 5000 years, what is the probability that it will rise tomorrow
He suggested to assume that the probability that the Sun rises is the same every day and the trials are independent of each other
Thus Laplace considered the task of sequence prediction when the true generating measure belongs to the family of Bernoulli  iid  \ measures  with binary alphabet  SYMBOL
The predicting measure suggested by Laplace was  SYMBOL  where  SYMBOL  is the number of 1s in  SYMBOL
The conditional probabilities of Laplace's measure  SYMBOL  converge to the true conditional probabilities  SYMBOL -almost surely under any Bernoulli  iid  measure  SYMBOL
This approach generalizes to the problem of predicting any finite-memory (e g \ Markovian) measure
Moreover, in  CITATION  a measure  SYMBOL  was constructed for predicting an arbitrary stationary measure
The conditional probabilities of  SYMBOL  converge to the true ones  on average , where average is taken over time steps (that is, in Cesaro sense),  SYMBOL -almost surely for any stationary measure  SYMBOL
However, as it was shown in the same work, there is no measure for which conditional probabilities converge to the true ones  SYMBOL -a s \ for every stationary  SYMBOL
Thus we can see that already for the problem of predicting outcomes of a stationary measure two criteria of prediction arise: prediction in the average (or in Cesaro sense) and prediction on each step, and the solution exists only for the former problem
But what if the measure generating the sequence is not stationary
A different assumption one can make is that the measure  SYMBOL  generating the sequence is computable
Solomonoff  CITATION  suggested a measure  SYMBOL  for predicting any computable probability measure
The key observation here is that the class of all computable probability measures is countable; let us denote it by  SYMBOL
A Bayesian predictor  SYMBOL  for a countable class of measures  SYMBOL   is constructed as follows:  SYMBOL  for any measurable set A, where the weights  SYMBOL  are positive  and sum to one
The best predictor for a measure  SYMBOL  is the measure  SYMBOL  itself
The Bayesian predictor simply takes the weighted average of the predictors for all measures in the class --- for countable classes this is possible
It was shown by Solomonoff  CITATION  that  SYMBOL -conditional probabilities converge to  SYMBOL -conditional probabilities almost surely for any computable measure  SYMBOL
In fact this is a special case of a more general (though without convergence rate) result of Blackwell and Dubins  CITATION  which states that if a measure  SYMBOL  is absolutely continuous with respect to a measure  SYMBOL  then  SYMBOL  converges to  SYMBOL  in total variation  SYMBOL -almost surely
Convergence in total variation means prediction in a very strong sense~--- convergence of conditional probabilities of arbitrary events (not just the next outcome), or prediction with arbitrary fast growing horizon
Since for  SYMBOL  we have  SYMBOL  for every measurable set  SYMBOL  and for every  SYMBOL , each  SYMBOL  is absolutely continuous with respect to  SYMBOL
Thus the problem of sequence prediction for certain  classes of measures (such as the class of all stationary measures or the class of all computable measures) was often addressed in the literature
Although the mentioned classes of measures are sufficiently interesting, it is often hard to decide in applications with which assumptions does a problem at hand comply; not to mention such practical issues as that a predicting measure for all computable measures is necessarily non-computable itself
Moreover, to be able to generalize the solutions of the sequence prediction problem to such problems as active learning, where outcomes of a sequence may depend on actions of the predictor, one has to understand better under which conditions  the problem of sequence prediction is solvable
In particular, in active learning, the stationarity assumption does not seem to be applicable (since the predictions are non-stationary), although, say, the Markov assumption is often applicable and is extensively studied
Thus, we formulate the following general questions which we start to address in the present work:   For which classes of measures is sequence prediction possible
Under which conditions does a measure  SYMBOL  predict a measure  SYMBOL
As we have seen, these questions have many facets, and in particular there are many criteria of prediction to be considered, such as almost sure convergence of conditional probabilities, convergence in average, etc
Extensive as the literature on sequence prediction is, these questions in their full generality have not received much attention
One line of research which exhibits this kind of generality consists in extending the result of Blackwell and Dubins mentioned above, which states that if  SYMBOL  is absolutely continuous with respect to  SYMBOL , then  SYMBOL  predicts  SYMBOL  in total variation distance
In  CITATION  a question of whether, given a class of measures  SYMBOL  and a prior (``meta''-measure)  SYMBOL  over this class of measures, the conditional probabilities of a Bayesian mixture of the class  SYMBOL  w r t
SYMBOL  converge to the true  SYMBOL -probabilities (weakly merge, in terminology of  CITATION ) for  SYMBOL --almost any measure  SYMBOL  in  SYMBOL
This question can be considered solved, since the authors provide necessary and sufficient conditions on the measure given by the mixture of the class  SYMBOL  w r t
SYMBOL  under which prediction is possible
The major difference from the general  questions we posed above is that we do not wish to assume that we have a measure on our class of measures
For large (non-parametric) classes of measures it may not be intuitive which measure over it is natural; rather, the question is  whether a ``natural'' measure which can be used for prediction exists
To address the general questions posed, we start with the following observation
As it was mentioned, for a Bayesian mixture  SYMBOL  of a countable class of measures  SYMBOL ,  SYMBOL , we have  SYMBOL  for any  SYMBOL  and any measurable set  SYMBOL , where  SYMBOL  is a constant
This condition is stronger than the assumption of absolute continuity and is sufficient for prediction in a very strong sense
Since we are willing to be satisfied with prediction in a weaker sense (e g \ convergence of conditional probabilities), let us make a weaker assumption: Say that  a measure  SYMBOL  dominates a measure  SYMBOL  with coefficients  SYMBOL   if \rho(x_1,\dots,x_n) \;\geq\; c_n \mu(x_1,\dots,x_n) for all  SYMBOL \paranodot{The first concrete question} we pose is, under what conditions on  SYMBOL  does () imply that  SYMBOL  predicts  SYMBOL
Observe that if  SYMBOL  for any  SYMBOL  then any measure  SYMBOL  is  locally  absolutely continuous with respect to  SYMBOL  (that is, the measure  SYMBOL  restricted to the first  SYMBOL  trials  SYMBOL  is absolutely continuous w r t
SYMBOL  for each  SYMBOL ), and moreover, for any measure  SYMBOL  some constants  SYMBOL  can be found that satisfy ()
For example, if  SYMBOL  is Bernoulli  iid  \ measure with parameter  SYMBOL  and  SYMBOL  is any other measure, then () is (trivially) satisfied with  SYMBOL
Thus we know that if  SYMBOL  then  SYMBOL  predicts  SYMBOL  in a very strong sense, whereas exponentially decreasing  SYMBOL  are not enough for prediction
Perhaps somewhat surprisingly, we will show that dominance with any subexponentially decreasing coefficients is sufficient for prediction, in a weak sense of convergence of expected averages
Dominance with any polynomially decreasing coefficients, and also with coefficients decreasing (for example) as  SYMBOL , is sufficient for (almost sure) prediction on average (i e \ in Cesaro sense)
However, for prediction on every step we have a negative result: for any dominance coefficients that go to zero there exists a pair of measures  SYMBOL  and  SYMBOL  which satisfy~() but  SYMBOL  does not predict  SYMBOL  in the sense of almost sure convergence of probabilities
Thus the situation is similar to that for predicting any stationary measure: prediction is possible in the average but not on every step
Note also that for Laplace's measure  SYMBOL  it can be shown that  SYMBOL  dominates any  iid  \ measure  SYMBOL  with linearly decreasing coefficients  SYMBOL ; a generalization of  SYMBOL  for predicting all measures with memory  SYMBOL  (for a given  SYMBOL ) dominates them with polynomially  decreasing coefficients
Thus dominance with decreasing coefficients generalizes (in a sense) predicting countable classes of measures (where we have dominance with a constant), absolute continuity (via local absolute continuity), and predicting  iid  \ and finite-memory measures
Another way to look for generalizations  is as follows
The Bayes mixture  SYMBOL , being a sum of countably many measures (predictors), possesses some of their predicting properties
In general, which predictive properties are preserved under summation
In particular, if we have two predictors  SYMBOL  and  SYMBOL  for two classes of measures, we are interested in the question whether  SYMBOL  is a predictor for the union of the two classes
An answer to this question would improve our understanding of how far a class of measures for which a predicting measure exists can be extended without losing this property \paranodot{{Thus,} the second question} we consider is the following: suppose that a measure  SYMBOL  predicts  SYMBOL  (in some weak sense), and let  SYMBOL  be some other measure (e g \ a predictor for a different class of measures)
Does the measure  SYMBOL  still predict  SYMBOL
That is, we ask to which prediction quality criteria does the idea of taking a Bayesian sum generalize
Absolute continuity is preserved under summation along with it's (strong) prediction ability
It was mentioned in  CITATION  that prediction in the (weak) sense of convergence of expected averages of conditional probabilities is preserved under summation
Here we find that several stronger notions of prediction are not preserved under summation
Thus we address the following two questions
Is dominance with decreasing coefficients sufficient for prediction in some sense, under some  conditions on the coefficients
And, if a measure  SYMBOL  predicts a measure  SYMBOL  in some sense, does the measure  SYMBOL  also predict  SYMBOL  in the same sense, where  SYMBOL  is an arbitrary measure
Considering different criteria of prediction (a s \ convergence of conditional probabilities, a s \ convergence of averages, etc ) in the above two questions we obtain not two but many different questions, some of which we answer in the positive and some in the negative,   yet some are left open
The paper is organized as follows
Section~ introduces necessary notation and measures of divergence of probability measures
Section~ addresses the question of whether dominance with decreasing coefficients is sufficient for prediction, while in Section~ we consider the question of summing a predictor with an arbitrary measure
Both sections~ and~ also propose some open questions and directions for future research
In Section~ we discuss some interesting special cases of the questions considered, and also some related problems
