Deploying a recommender system for the movie-lens dataset – Part 1

15.7.2019 | 10 minutes reading time

Introduction

In this post I will discuss building a simple recommender system for a movie database which will be able to:
– suggest top N movies similar to a given movie title to users, and
– predict user votes for the movies they have not voted for.
In the next part of this article I will show how to deploy this model using a Rest API in Python Flask, in an attempt to make this recommendation system easily useable in production.

A recommender system for a movie database

Recommender systems are so prevalently used in the net these days that we all have come across them in one form or another. Have you ever received suggestions on Amazon on what to buy next? Or suggestions on what websites you may like on Facebook?
Aside from the natural disconcerting feeling of being chased and traced, they can sometimes be helpful in navigating us into the right direction. Let’s look at an appealing example of recommendation systems in the movie industry.
I will be using the data provided from Movie-lens 20M datasets to describe different methods and systems one could build. With a bit of fine tuning, the same algorithms should be applicable to other datasets as well.

I find the above diagram the best way of categorising different methodologies for building a recommender system. I will briefly explain some of these entries in the context of movie-lens data with some code in python. Full scripts for this article are accessible on my GitHub page . Suppose someone has watched “Inception (2010)” and loved it! What can my recommender system suggest to them to watch next?
Well, I could suggest different movies on the basis of the content similarity to the selected movie such as genres, cast and crew names, keywords and any other metadata from the movie. In that case I would be using an item-content filtering. I could also compare the user metadata such as age and gender to the other users and suggest items to the user that similar users have liked. In that case I would be using a user-content filtering. The movie-lens dataset used here does not contain any user content data. So in a first step we will be building an item-content (here a movie-content) filter.

Memory-based content filtering

In memory-based methods we don’t have a model that learns from the data to predict, but rather we form a pre-computed matrix of similarities that can be predictive. Please read on and you’ll see what I mean!
The data sets I have used for an item content filtering are movies.csv and tags.csv.
I skip the data wrangling and filtering part which you can find in the well-commented in the scripts on my GitHub page . We collect all the tags given to each movie by various users, add the movie’s genre keywords and form a final data frame with a metadata column for each movie.

1# create a mixed dataframe of movies title, genres 
2# and all user tags given to each movie
3mixed = pd.merge(movies, tags, on='movieId', how='left')
4mixed.head(3)

1# create metadata from tags and genres
2mixed.fillna("", inplace=True)
3mixed = pd.DataFrame(mixed.groupby('movieId')['tag'].apply(
4                             lambda x: "%s" % ' '.join(x))
5Final = pd.merge(movies, mixed, on='movieId', how='left')
6Final ['metadata'] = Final[['tag', 'genres']].apply(
7                             lambda x: ' '.join(x), axis = 1)
8Final[['movieId','title','metadata']].head(3)

We then transform these metadata texts to vectors of features using Tf-idf transformer of scikit-learn package. Each movie will transform into a vector of the length ~ 23000! But we don’t really need such large feature vectors to describe movies. Truncated singular value decomposition (SVD) is a good tool to reduce dimensionality of our feature matrix especially when applied on Tf-idf vectors . As you can see from the explained variance graph below, with 200 latent components (reduction from ~23000) we can explain more than 50% of variance in the data which suffices for our purpose in this work. So we will keep a latent matrix of 200 components as opposed to 23704 which expedites our analysis greatly.
We name this latent matrix the content_latent and use this matrix a few steps later to find our top N similar movies to a given movie title. But let’s learn a bit about the ratings data.

1from sklearn.feature_extraction.text import TfidfVectorizer
2tfidf = TfidfVectorizer(stop_words='english')
3tfidf_matrix = tfidf.fit_transform(Final['metadata'])
4tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=Final.index.tolist())
5print(tfidf_df.shape)

1(26694, 23704)

1# Compress with SVD
2from sklearn.decomposition import TruncatedSVD
3svd = TruncatedSVD(n_components=200)
4latent_matrix = svd.fit_transform(tfidf_df)
5 
6# plot var expalined to see what latent dimensions to use
7explained = svd.explained_variance_ratio_.cumsum()
8plt.plot(explained, '.-', ms = 16, color='red')
9plt.xlabel('Singular value components', fontsize= 12)
10plt.ylabel('Cumulative percent of variance', fontsize=12)        
11plt.show()

Memory-based collaborative filtering

Aside from the movie metadata we have another valuable source of information at our exposure: the user rating data. Our recommender system can recommend a movie that is similar to “Inception (2010)” on the basis of user ratings. In other words, what other movies have received similar ratings by other users? This would be an example of item-item collaborative filtering. You might have heard of it as “The users who liked this item also liked these other ones.” The data set of interest would be ratings.csv and we manipulate it to form items as vectors of input rates by the users. As there are many missing votes by users, we have imputed Nan(s) by 0 which would suffice for the purpose of our collaborative filtering.

Here we have movies as vectors of length ~80000. Again as before we can apply a truncated SVD to this rating matrix and only keep the first 200 latent components which we will name the collab_latent matrix. The next step is to use a similarity measure and find the top N most similar movies to “Inception (2010)” on the basis of each of these filtering methods we introduced. Cosine similarity is one of the similarity measures we can use. To see a summary of other similarity criteria, read Ref [2]- page 93.
In the following, you will see how the similarity of an input movie title can be calculated with both content and collaborative latent matrices. I have also added a hybrid filter which is an average measure of similarity from both content and collaborative filtering standpoints. If I list the top 10 most similar movies to “Inception (2010)” on the basis of the hybrid measure, you will see the following list in the data frame. For me personally, the hybrid measure is predicting more reasonable titles than any of the other filters.

1from sklearn.metrics.pairwise import cosine_similarity
2# take the latent vectors for a selected movie from both content 
3# and collaborative matrixes
4a_1 = np.array(Content_df.loc['Inception (2010)']).reshape(1, -1)
5a_2 = np.array(Collab_df.loc['Inception (2010)']).reshape(1, -1)
6 
7# calculate the similartity of this movie with the others in the list
8score_1 = cosine_similarity(Content_df, a_1).reshape(-1)
9score_2 = cosine_similarity(Collab_df, a_2).reshape(-1)
10 
11# an average measure of both content and collaborative 
12hybrid = ((score_1 + score_2)/2.0)
13 
14# form a data frame of similar movies 
15dictDf = {'content': score_1 , 'collaborative': score_2, 'hybrid': hybrid} 
16similar = pd.DataFrame(dictDf, index = Content_df.index )
17 
18#sort it on the basis of either: content, collaborative or hybrid 
19similar.sort_values('content', ascending=False, inplace=True)
20similar[['content']][1:].head(11)

We could use the similarity information we gained from item-item collaborative filtering to compute a rating prediction,
\(r_{ui}\), for an item \((i)\) by a user \((u)\) where the rating is missing. Namely by taking a weighted average on the rating values of the top K nearest neighbours of item \((i)\). The Ref [2] page 97 discusses the parameters that can refine this prediction.

As mentioned right at the beginning of this article, there are model-based methods that use statistical learning rather than ad hoc heuristics to predict the missing rates. In the next section, we show how one can use a matrix factorisation model for the predictions of a user’s unknown votes.

Model-based collaborative filtering

Previously we used truncated SVD as a means to reduce the dimensionality of our matrices. To that end, we imputed the missing rating data with zero to compute SVD of a sparse matrix. However, one could also compute an estimate to SVD in an iterative learning process. For this purpose we only use the known ratings and try to minimise the error of computing the known rates via gradient descent. This algorithm was popularised during the Netflix prize for the best recommender system. Here is a more mathematical description of what I mean for the more interested reader. Otherwise you can skip this part and jump to the implementation part.

Mathematical description

SVD factorizes our rating matrix \(M_{m \times n}\) with a rank of \(k\), according to equation (1a) to
3 matrices of \(U_{m \times k}\), \(\Sigma_{k \times k}\) and \(I^T_{n \times k}\):

\(M = U \Sigma_k I^T \tag{1a}\)
\(M \approx U \Sigma_{k\prime} I^T \tag{1b}\)

where \(U\) is the matrix of user preferences and \(I\) the item preferences and \(\Sigma\) the matrix of singular values. The beauty of SVD is in this simple notion that instead of a full \(k\) vector space, we can approximate \(M\) on a much smaller \(k\prime\) latent space as in (1b). This approximation will not only reduce the dimensions of the rating matrix, but it also takes into account only the most important singular values and leaves behind the smaller singular values which could otherwise result in noise. This concept was used for the dimensionality reduction above as well.
To approximate \(M\), we would like to find \(U\) and \(I\) matrices in \(k\prime\) space using all the known rates which would mean we will solve an optimisation problem. According to (2), every rate entry in \(M\), \(r_{ui}\) can be written as a dot product of \(p_u\) and \(q_i\):

\(r_{ui} = p_u \cdot q_i \tag{2}\)

where \(p_u\) makes up the rows of \(U\) and \(q_i\) the columns of \(I^T\). Here we disregard the diagonal \(\Sigma\) matrix for simplicity (as it provides only a scaling factor). Graphically it would look something like this:

Finding all \(p_u\) and \(q_i\)s for all users and items will be possible via the following minimisation:

\( \min_{p_u,q_i} = \sum_{r_{ui}\in M}(r_{ui} – p_u \cdot q_i)^2 \tag{3}\)

A gradient descent (GD) algorithm (or a variant of it such as stochastic gradient descent SGD) can be used to solve the minimisation problem and to compute all \(p_u\) and \(q_i\)s. I will not describe the minimisation procedure in more detail here. You can read more about it on this blog or in Ref [2]. After we have all the entries of \(U\) and \(I\), the unknown rating r_{ui} will be computed according to eq. (2).
The minimisation process in (3) can also be regularised and fine-tuned with biases.

Implementation

A SVD algorithm similar to the one described above has been implemented in Surprise library, which I will use here. Aside from SVD, deep neural networks have also been repeatedly used to calculate the rating predictions. This blog entry describes one such effort. SVD was chosen because it produces a comparable accuracy to neural nets with a simpler training procedure. In the following you can see the steps to train a SVD model in Surprise .
We gain a root-mean-squared error (RMSE) accuracy of 0.77 (the lower the better!) for our rating data, which does not sound bad at all. In fact, with a memory-based prediction from the item-item collaborative filtering described in the previous section, I could not get an RMSE lower that 1.0; that’s 23% improvement in prediction!
Next we use this trained model to predict ratings for the movies that a given user \(u\), here e.g. with the \(id\) = 7010, has not rated yet. The top 10 highly rated movies can be recommended to user 7010 as you can see below.

1from surprise import Dataset, Reader, SVD, accuracy
2from surprise.model_selection import train_test_split
3 
4# instantiate a reader and read in our rating data
5reader = Reader(rating_scale=(1, 5))
6data = Dataset.load_from_df(ratings_f[['userId','movieId','rating']], reader)
7 
8# train SVD on 75% of known rates
9trainset, testset = train_test_split(data, test_size=.25)
10algorithm = SVD()
11algorithm.fit(trainset)
12predictions = algorithm.test(testset)
13 
14# check the accuracy using Root Mean Square Error
15accuracy.rmse(predictions)
16RMSE: 0.7724
17 
18# check the preferences of a particular user
19user_id = 7010
20predicted_ratings = pred_user_rating(user_id)
21pdf = pd.DataFrame(predicted_ratings, columns = ['movies','ratings'])
22pdf.sort_values('ratings', ascending=False, inplace=True)  
23pdf.set_index('movies', inplace=True)
24pdf.head(10)

Conclusion

As you saw in this article, there are a handful of methods one could use to build a recommendation system. The data scientist is tasked with finding and fine-tuning the methods that match the data better.
In the next part of this article I will be showing how the methods and models introduced here can be rearranged and categorised differently to facilitate serving and deployment. We will serve our model as a REST-ful API in Flask-restful with multiple recommendation endpoints.

References

Ref [1] – IEEE Transactions on knowledge and data engineering, Vol. 17, No. 6, JUNE 2005, DOI: 10.1109/TKDE.2005.99 .
Ref [2] – Foundations and Trends in Human–Computer Interaction Vol. 4, No. 2, DOI: 10.1561/1100000009 .

Was this post helpful?

Blog author

Sherri Hadian

Do you still have questions? Just send me a message.

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Deploying a recommender system for the movie-lens dataset – Part 1

Introduction

A recommender system for a movie database

Memory-based content filtering

Memory-based collaborative filtering

Model-based collaborative filtering

Mathematical description

Implementation

Conclusion

References

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...