Building a content-based movie recommender system

5 min readSep 13, 2022

Why recommend?

Ever visited an offline clothing shop? Noticed how when you like a particular dress, the salesperson shows you half a dozen others that you may prefer? On the back of his mind, either by choosing clothes with similar features or the clothes liked by previous users, he is suggesting the products. These, grossly, are content-based and collaborative filtering methods respectively.

In our case, the ML system is the salesperson. Whether it be an e-commerce company, or a music/book/video streaming one, a highly accurate recommendation system is paramount if one is to achieve high conversion rates (On a personal note, I feel Spotify makes the most ‘personalized’ playlists for you).

In this project, we are interested to build a content-based movie recommendation system and deploy it on the cloud. We are essentially trying to mimic Netflix.

Dataset-

The dataset was obtained from https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv.

The ‘movies’ csv file contains information purely about the movie- its language, genre, budget, plot and stuff.

The ‘credits’ datafile contains information about the cast and crew involved in the movie.

As we’ll see, both will be vital in predicting the ratings of the movie.

The following are the attributes in the movies file explained-

budget — budget of the movie in USD
genres — genre(s) in which the movie can be classified into
homepage — official website of the movie
id — id of the movie in the TMDB website
keywords — roughly the most important words in the description of the movie
original_language — language of the movie
original_title — title in which the movie was actually released (maybe in some other language)
overview— description of the movie
popularity — a score to judge the current popularity of the movie. Read more: https://developers.themoviedb.org/3/getting-started/popularity
production_companies — movie houses involved in production
production_countries — country in which the movie was produced
release_date — date of release of the movie (to be parsed as datetime)
revenue — total revenue earned in USD
runtime — duration of the movie in minutes
spoken_languages — languages spoken inside the movie
status — current status of the movie, whether released, in post-production or rumored.
tagline — tagline of the movie
title — name of the movie
vote_average — average rating of the movie by users in the TMDB website.
vote_count — number of users who voted.

Here are the attributes of the credits file-

movie_id — foreign key referencing to the id of the movie table
title — name of the movie
cast — names of the actors alongwith the character played (and other metadata)
crew — names of other members involved in production (directors, producers, sound directors etc.)

Both the dataframes are merged on ‘title’ column.

Data pre-processing-

Since in the merged dataframe there are 23 columns, we need to do some manual feature removal. The following are the features we would keep-

genres — highly important in predicting if a user will like it
id — will be later used within the website
keywords , overview — contains descriptions of the movie which are important content.
title — is the name of the movie
cast- actors of the movie
crew- people involved in production (director etc.) .

Numerical features (like release_date, revenue, popularity, vote_average etc.) have been removed as they don’t relate with the ‘content’ of the movie. Tags are often weird and don’t provide relevant information of the movie content. Removal of other features is self-explanatory.

Note — release_date can be a discerning factor in judging the content of the movie.

After this, simple missing value imputation (on ‘overview’) and duplicates were checked.

Problem with genres, keywords, cast, crew columns-

Unfortunately, these columns did not have the names in straightforward fashion but as a list of dictionaries. We need to pre-process these.

def helper (lst, if_crew=False, if_top=False):ans=list()if (if_crew==True):for dic in eval(lst):if (dic['job']=='Director' or dict['job']=='Writer'): #for crew column only director, writer selectedans.append(dic['name'])return anselse:for dic in eval(lst): #eval() evaluates a string to the datatype contained inside itans.append(dic['name'])if (if_top==False):return anselse:return ans[0:10] #in case, we need to return top 10 actors

For the cast, we only return the top 10 actors. For the crew, we only return writer the directors. Code is a bit complicated as it attempts to handle all 4 columns at the same time.

Also, the overview column was split as a list.

Removing space inside words-

‘Sam Worthington’ is a very different person from ‘Sam Mendes’. However, if we don’t remove the space inside the strings, both of them will contain the word ‘Sam’ which will cause erroneous recommendations. Thus ‘Sam Worthington’ must be mapped to ‘SamWorthington’.

The spaces are removed using regex.

Creating the ‘tags’ columns-

A tags column is created by combining genres, keywords, overview, cast and crew columns. They were combined with 2X weightage given to genre (other ratios were experimented but greater weightage to ‘crew’ resulted in recommendations of movies ONLY directed by the query director).

Pre-processing the text in tags-

We create a new dataframe with just the id, title and tags columns. The text is pre-processed, i.e converted to lowercase, stopwords are removed, lemmatized, special characters removed etc.

Modelling and main-function-

The corpus is vectorized using Tf-idf vectorizer with maximum 5000 features. After this, a similarity matrix is created using cosine similarity.

We also define a function where the query movie name is taken as input, and as output the top 5 most similar movie names are returned. This function will be executed at runtime when the model is deployed.