### abstract ###
often research in judgment and decision making requires comparison of multiple competing models
researchers invoke global measures such as the rate of correct predictions or the sum of squared or absolute deviations of the various models as part of this evaluation process
reliance on such measures hides the often very high level of agreement between the predictions of the various models and does not highlight properly the relative performance of the competing models in those critical cases where they make distinct predictions
to address this important problem we propose the use of pair-wise comparisons of models to produce more informative and targeted comparisons of their performance  and we illustrate this procedure with data from two recently published papers
we use multidimensional scaling of these comparisons to map the competing models
we also demonstrate how intransitive cycles of pair-wise model performance can signal that certain models perform better for a given subset of decision problems
### introduction ###
the field of behavioral decision making is  to a large degree  phenomena driven
after a certain empirical regularity is discovered and validated  researchers test multiple models some old  and some new to explain the result
for example  every model of decision making under risk is expected to account for the classical allais paradox
when new models are proposed  researchers often justify them by a series of comparisons against the older models in the field
there are several approaches for testing decision models
in the context of axiomatic models  there is a focus on small subsets of problems judiciously chosen to be diagnostic and differentiate optimally between certain models  CITATION
others seek data from multiple published studies involving decision problems selected by different researchers by various often  unspecified criteria and compare how well the models predict them  CITATION
an alternative approach is to compare the models' ability to predict decision behavior in a sample of problems that are sampled randomly from a well defined universe of problems  CITATION
in all these methods the researcher assembles a data set consisting of an array of n decision problems and m models
for each problem there is one empirical response decision  di  NUMBER  n  which can take one of many forms such as a binary choice  a probability of a choice pattern  a numerical value such as a probability estimate  a certainty equivalent  etc
  and a set of predictions di  NUMBER  n  generated by the various models
there are numerous ways to evaluate the fit of the models  CITATION  and a full review is well beyond the scope of this note
for our purposes it is sufficient to say that most of them are based on some discrepancy function fd d between the responses and the predictions that summarizes discrepancies across all n decisions  and can be formulated to take its optimal desirable value in the case of n perfect predictions
thus  one can always rank and in some cases also scale models according to how close or distant they are from a perfect fit
some simple examples of such functions are a proportion of correct predictions  or of predictions corrected for chance  b mean or median fd d where f could rely on squared deviations d-d  absolute deviations  d-d   ratios d d  or their logarithms logd d  c relative measures such as relative squared deviations d-d   d  d measures based on the likelihood function of the data under a certain model  etc
to illustrate the approach  we review in some detail a few such studies  brandstatter  gigerenzer  and hertwig  CITATION  report results of four model contests using different data sets with a total of n    NUMBER  decision problems n  NUMBER   n  NUMBER   n  NUMBER   n  NUMBER 
they compared m    NUMBER  models and used several sets of parameters for models with free parameters such as cumulative prospect theory  CITATION
their measure of fit was percent correct prediction of a majority of subjects henceforth majority choice  averaged across all the n decision problems in each data set
they also report percent correct prediction of majority choice and the percent agreement between all model pairs across the n    NUMBER  decision problems
a similar approach is used in several chapters in gigerenzer  todd  and the abc research group  CITATION
hau  pleskac  kiefer  and hertwig  CITATION  considered n    NUMBER  decision problems involving  NUMBER  subjects from three experiments and compared m    NUMBER  models see figure  NUMBER  in their paper
their measure of fit was overall percent correct predictions
erev et al CITATION  analyzed three model contests  each using a different decision paradigm  with two problem sets n    NUMBER  in each set
