K nearest neighbour vs User based nearest neighbour _问答_开发者

I was reading up on recommender systems on wikipedia and the section o开发者_运维百科n "Algorithms" seems to suggest that K nearest neighbour and collaborative filtering based user based algorithm are two different things. Is this correct? Given my understanding, aren't they both the same? If not, what are the differences between them? Thanks.

Not exactly. They are similar (they share same ideas), but there are several major differences between them. In fact, article on Wikipedia describes only 2 most distinct ways to implement recommender systems, but there are much more of them that use idea from both these ways.

So here's how I understood Wikipedia article.

1st approach (KNN/profiles similarity)

First of all, KNN is not the main feature of the first approach. It's just an algorithm to find nearest items among the whole collection, so it can be used in collaborative filtering as well. The most important idea lies in a term "similarity". To recommend something to the user in question, you find people from his neighborhood that have similar profile. For example, you want to make recommendation for user John on the Facebook. You look at his Fb profile and then at profiles of his friends. You find 10 people with similar profiles and check what they like. If 8 of 10 people with similar profiles like new film, most probably John will like it too.

So, there are 2 important points here:

you look at user's neighborhood
you measure similarity of their profiles

Wikipedia article doesn't cover question of how to find similarity measure, but there are many ways, including searching for common terms in profile's text, finding best friends (my number of messages between them, connection graph analysis, etc.) and many others.

2nd approach (collaborative filtering)

In the second approach you don't need to analyze neighborhood and find similar profiles, but you need to collect users choices. Let's recall an example with Facebook user John. Imagine, that we can get all "likes" of all Fb users, including those of John. With them you can build very large correlation matrix, where rows are user ids and columns are all possible items they may "like". If the actually "liked" an item, cell for current user and current item is set to 1, otherwise it is 0.

With such a matrix (built or abstract) you can use association mining to find the most strong associations. For example, 10000 people who liked "Pirates of the Caribbean 2" also liked "Pirates of the Caribbean 3", but only 500 of them liked "Saw". So we can suppose that association between 2 episodes of "Pirates" is much stronger. Note, that we haven't analyzed neither users, nor films themselves (we didn't take into account film names, plots, actors or something like that - only "likes"). This is major advantage of collaborative filtering over methods based on similarity.

Finally, to recommend film to out user John you just iterate over his "likes" and find other items that have strongest associations with the current one.

So, important points here are:

you don't use neighborhood, but instead complete database of all users
you use people's choices and find associations

Both approaches have their strong and weak points. 1st approach is based on some kind of connections between people (e.g. friends on Facebook) and hardly can be used for services like Amazon. At the same time 2nd approach is based on the averaged preferences of all users and thus isn't good option for systems with highly different favors.

I'm going to give an illustration of two methods:

The difference between the 2 methods is pretty much the same as asking your neigbhbours for an advice compared to asking your friends :

The similarity between 2 persons for collaborative filtering is define by preference you share
The similarity for K-nearest neighbour is defined by a distance (but the distance can be the same as for collaborative filtering)

Moreover in the first case you look K neighbourgh (that is a fix number) and in the second you look at all your dataset.

In some cases on can give better advice than the other :

For example:

If I want to buy a TV and I ask my friend that live in a big house, He will advice me a large screen. But my neighbour who lives in the appartment next to me will advice me a small one because the big won't fit in his appartment. (similar to mine). So my neighbour advice is better in this case.

If I want to buy a movie, my best friend will abviously give me a better advice than my neighbour.

The k-Nearest Neighbors algorithm is a more general algorithm and domain-independent, whereas User-based Methods are domain specific and can be seen as an instance of a k-Nearest Neighbors method.

In k-Nearest Neighbors methods you can use a specific similarity measure to determine the k-closest data-points to a certain data-point x. Then you can use this local-neighbourhood to make some predictions about some (unknown) properties of x, for example it's class-label (classification) or a certain attribute value (regression). The closeness between the data-points and x can be measured with any similarity function you define (e.g. Euclidean distance, cosine similarity). In addition, points that are further away from x can have less influence on the prediction of the value of interest of x. For example, by some weighting function you define you can weight the data-points in the neighborhood of x inversersely proportional to their distance from x. The intuition is that the further away a point of x is the less influence it has on the prediction for x as it is less similar. Lastly, the parameter k determines how many data-points you wish to consider for the construction of this local-neighborhood.

User-based nearest neighbours are a type of collaborative filtering methods coming from the field of Information Retrieval (IR). The fact that you used "User-based" in your question means that you refer to a specific domain, usually based on some user-behaviour like which movies/products did the user rate highly/buy and what other movies could that person like based on this. In this domain it is much more frequent that multiple values are missing, e.g. almost no-one rated or bought most products/services. However, you can implement the User-based nearest neighbourhood algorithm in such a way that it equivalent to the k-Nearest Neighbors method. Once again you can calculate the similarity between x (user) and all other data-points (user) based on some similarity function and weight it inversely to the similarity with a weighting function. The similarity of x and other data-points is only defined over the non-missing values, and k can be set to include all data-points (but is not required). In order to speed-up calculations in finding the local neighbourhood of x you can use techniques such as Locality Sensitive Hashing and use only a subset of the neighborhood where k << total data-set size. In practice, you could want to predict the missing values in x indicating that the user has not bought/used a service before. If based on the local neighborhood of x, a high value is predicted for a missing value you can use this to advertise that item to the user.

In short, User-Based methods are just an instance of the k-Nearest Neighbor algorithm most used in the specific domain of Information Retrieval.