### abstract ###
A collaborative filtering system recommends to users products that similar users like
Collaborative filtering systems influence purchase decisions, and hence have become targets of manipulation by unscrupulous vendors
We provide theoretical and empirical results demonstrating that while common nearest neighbor algorithms, which are widely used in commercial systems, can be highly susceptible to manipulation, two classes of collaborative filtering algorithms which we refer to as  linear  and  asymptotically linear  are relatively robust
These results provide guidance for the design of future collaborative filtering systems
### introduction ###
While the expanding universe of products available via Internet commerce provides consumers with valuable options, sifting through the numerous alternatives to identify desirable choices can be challenging
Collaborative filtering (CF) systems aid this process by recommending to users products desired by similar individuals
At the heart of a CF system is an algorithm that predicts whether a given user will like various products based on his past behavior and that of other users
Nearest neighbor (NN) algorithms, for example, have enjoyed wide use in commercial CF systems, including those of Amazon, Netflix, and Youtube  CITATION
A prototypical NN algorithm stores each user's history, which may include, for instance, his product ratings and purchase decisions
To predict whether a particular user will like a particular product, the algorithm identifies a number of other users with similar histories
A prediction is then generated based on how these so-called neighbors have responded to the product
This prediction could be, for example, a weighted average of past ratings supplied by neighbors
Because purchase decisions are influenced by CF systems, they have become targets of manipulation by unscrupulous vendors
For instance, a vendor can create multiple online identities and use each to rate his own product highly and competitors' products poorly
As an example, Amazon's CF system was manipulated so that users who viewed a spiritual guide written by a well-known Christian evangelist were subsequently recommended a sex manual for gay men  CITATION
Although this incident may not have been driven by commercial motives, it highlights the vulnerability of CF systems
The research literature offers further empirical evidence that NN algorithms are susceptible to manipulation  CITATION
In order to curb manipulation, one might consider authenticating each user by asking for, say, a credit card number to limit the number of fake identities
This may be effective in some situations
However, in web services that do not facilitate financial transactions, such as Youtube, requiring authentication would intrude privacy and drive users away
One might also consider using only customer purchase data, when they are available, as a basis for recommendations because they are likely generated by honest users
Recommendation quality may be improved, however, if higher-volume data such as page views are also properly utilized
In this paper, we seek to understand the extent to which manipulators can hurt the performance of CF systems and how CF algorithms should be designed to abate their influence
We find that, while NN algorithms can be quite sensitive to manipulation, CF algorithms that carry out predictions based on a particular class of probabilistic models are surprisingly robust
For reasons that we will explain in the paper, we will refer to algorithms of this kind as  linear CF algorithms
We find that as a user rates an increasing number of products, the average accuracy of predictions made by a linear CF algorithm becomes insensitive to manipulated data
For instance, even if half of all ratings are provided by manipulators who try to promote half of the products, predictions for users with long histories will barely be distorted, on average
To provide some intuition for why our results should hold, we now offer an informal argument
A robust CF algorithm should learn from its mistakes
In particular, differences between its predictions and actual ratings should help improve predictions on future ratings
A linear CF algorithm generates predictions based on a probability distribution that is a convex combination of two distributions: one that it would learn given only data generated by honest users and one that it would learn given only manipulated data
As a user whose ratings we wish to predict provides more ratings, it becomes increasingly clear which of these two distributions better represents his preferences
As a result, the weight placed on manipulated data diminishes and distortion vanishes
The main theoretical result of this paper formalizes the above argument
In particular, we will define a notion of distortion induced by manipulators and establish an upper bound on distortion, which takes a particularly simple form:  SYMBOL  SYMBOL r SYMBOL n SYMBOL n SYMBOL 80\% SYMBOL 10\% SYMBOL 75\% SYMBOL 21$ products before receiving recommendations
To broaden the scope of our analysis, we will also study CF algorithms that behave like linear CF algorithms asymptotically as the size of the training set grows
This class of algorithms, which we refer to as  asymptotically linear , is more flexible in accommodating modeling assumptions that may improve prediction accuracy
We will establish that a relaxed version of our distortion bound for linear CF algorithms applies to asymptotically linear CF algorithms
We will also show that our distortion bound does not generally hold for NN algorithms
Intuitively, this is because prediction errors do not always improve the selection of neighbors
In particular, as a user provides more ratings, manipulated data that contribute to inaccurate predictions of his future ratings may remain in the set of neighbors while data generated by honest users may be eliminated from it
As a result, distortion of predictions may not decrease
We will later provide an example to illustrate this
In addition to theoretical results, this paper provides an empirical analysis using a publicly available set of movie ratings generated by users of Netflix's recommendation system
We produce a distorted version of this data set by injecting manipulated ratings generated using a manipulation technique studied in prior literature
We then compare results from application of three CF algorithms: an NN algorithm, a linear CF algorithm called the kernel density estimation algorithm, and an asymptotically linear CF algorithm called the naive Bayes algorithm
Results demonstrate that while performance of the NN algorithm is highly susceptible to manipulation, those of kernel density estimation and naive Bayes algorithms are relatively robust
In particular, the latter two experience distortion lower than the theoretical bound we provide, whereas the distortion for the former exceeds it by far
One might also wonder whether manipulation robustness of a CF algorithm comes at the expense of its prediction accuracy
As an example, consider an algorithm that fixes predictions for all ratings to be a constant, without regard to the training data
This algorithm is uninfluenced by manipulation but is likely to yield poor predictions, and is therefore not useful
In our experiments, the accuracy demonstrated by the three algorithms all seems reasonable
This suggests that accuracy of a CF algorithm may be achieved alongside robustness
Our theoretical and empirical results together suggest that commercial recommendation systems using NN algorithms can be made more robust by adopting approaches that we describe
Note that we are not proposing that real-world systems should implement the specific algorithms we present in this paper
Rather, our analysis highlights properties of CF algorithms that lead to robustness and practitioners may benefit from taking these properties into consideration when designing CF systems
This paper is organized as follows
In the next section, we discuss some related work
In Section , we formulate a simplified model that serves as a context for studying alternative CF algorithms
We then establish results concerning the manipulation robustness of NN, linear, and asymptotically linear CF algorithms in Section
In Section , we present our empirical study
We make some closing remarks in a final section
