🎬 Movies recomendation using Content-Based and Collaborative Filtering


1. Recommendation Systems

Recommendation Systems (RS) are a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. Typically, the suggestions refer to various decision-making processes, such as what product to purchase, what music to listen to, or what online news to read. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer. There are many types of RS:

  1. Content-Based Filtering: Recommendations based on product attributes and it's similarities;
  2. Collaborative Filtering (CF): Uses 'wisdom of the crowd' to match recommendations to users;
    • Memory-based CF: Relies on historical data to fit recommendations
      • User-based filtering: Based on users with similar tastes.
      • Item-based filtering: Based on items liked by similar users.
    • Model-based CF: Finds underlying patterns inside the data to predict best recommendations.

2. Importing Data

We begin the project importing libraries and files to work with. The Archives used here comprises data from many user evaluations of movie titles, from 1 (bad) to 5 (excelent) stars. Before starting to look into the data, we have to combine two archives to translate movie_id into the movie title.Then, we can perform an explanatory analysis within the DF.

A quick look shows us that our data file contains ~100,000 not null registers of user-movie ratings. Let's follow the process with EDA:

3. EDA

Here we search for insights within the data. Let's check for data distributions, aggregate informations and engineer features to analyze, before perform the first recommendation system.

To summarize:

4. Content-based systems

For this model, we build a User-Item Matrix with every user rating to every movie (or else, NaNs). We choose a movie that might been a blockbuster (has many reviews), and build a recommendation for it.

This is a list of the 10th most likely movies to Toystory. We can simplify this process using a function that prints a list of most similar movies according to its name:

5. Collaborative Filtering


5.1 Model Based CF

This system finds underlying patterns in data as latent variables, and uses them to predict how any user is expected to rate an item.

To prepare the model, we need to create user-item matrices for train and test. First, we fill both matrices with zeros for every client (for the rows) and product (for columns). Then, we use the pandas intertuples() function to replace the corresponding values with rates from the dataframe sample (both for trainning and test data).

After we have built the train and test matrices, we will rely on a technique called matrix factorization (MF), which can restructure the user-item matrix into low-rank structure, and represents the matrix as the multiplication of two low-rank matrices (where the rows contain the latent vector). We fit this matrix to approximate the original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

5.1.1 SVD

A well-known matrix factorization method is Singular value decomposition (SVD). Collaborative Filtering can be formulated by approximating a matrix X by using singular value decomposition.

The general equation can be expressed as follows:

Given m x n matrix X:

Elements on the diagnoal in S are known as singular values of X.

Matrix X can be factorized to U, S and V. The U matrix represents the feature vectors corresponding to the users in the hidden feature space and the V matrix represents the feature vectors corresponding to the items in the hidden feature space.

Now we can make predictions by taking the dot product of U, S and V^T.

To evaluate this model, we need to flatten the prediction matrix and the test matrix into 1D-arrays, and compare how similar they are. To do so, we also need to drop the values with no evaluations in the test matrix, therefore, the zeros.

Considering the results above, we expect a root squared difference of, on average, 2.68 from the real value. We might as well try to group products by genre, or decade of production, to enhance the model precision.

5.2 Memory-Based CF

Memory-Based Collaborative Filtering approaches can be divided into two main sections:

Here we choose User-item filtering to predict movie rates.

For the model assessment, we use a distance metric to calculate the similarity between users.

A distance metric commonly used in recommender systems is cosine similarity, where the ratings are seen as vectors in n-dimensional space and the similarity is calculated based on the angle between these vectors. Cosine similiarity for users a and m can be calculated using the formula below, where you take dot product of the user vector $u_k$ and the user vector $u_a$ and divide it by multiplication of the Euclidean lengths of the vectors.

Next, we can use the metric to make predictions. We can make a prediction by applying following formula for user-based CF:

Imagine the similarity between users k and a as weights that are multiplied by the ratings of a similar user a (corrected for the average rating of that user). Before applying the formula, we need to normalize it so that the ratings stay between 1 and 5 and, as a final step, sum the average ratings for the user that you are trying to predict.

The idea here is that some users may tend always to give high or low ratings to all movies. The relative difference in the ratings that these users give is more important than the absolute values.

6. Conclusion


To conclude, we can assure that every Recommending System has its pros and couns, but they clearly can suit better accordingly with the problem and data available. Collaborative Filtering usually performs better, but requires higher computational capacity (and sometimes more data). In this case, the model-based with SVD matrix factorization had a better performance in terms of RMSE than the Memory-based with cosine similarity. The content-based is simpler, though it cannot be measured and is less likely to overcome the former ones.