The universal recommender in Action(ML)

18.4.2021 | 11 minutes reading time

Introduction

Recommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep their users engaged by showing them the right content. Current machine learning techniques for recommender systems vary from collaborative filtering algorithms to methods based on neural networks and reinforcement learning.

In this article a new collaborative filtering technique is presented and applied to a sample dataset. The universal recommender is a recommender system based on co-occurrences between events. It can be run as “engine” of Harness , a service provided by ActionML that offers different end-to-end machine learning solutions. We will go through its architecture but our focus lies on the algorithm and its interpretations.

Universal recommender: architecture and main idea

Architecture

The universal recommender is configured as a machine learning engine in the Harness server. Harness provides a REST API for input data and queries. Once the input data is imported to the engine instance a model is trained and the engine becomes “queryable”: queries are sent via the Harness REST API and recommendations (for example: a list of recommended items with scores) are given in return.

For a native installation different requirements are needed (Java, MongoDB, Elasticsearch, etc.). The model is trained using Apache Spark. Harness and all needed dependencies can also be run in docker containers . For the experiments described below, a docker-compose installation has been used.

Main idea

A usual task of recommender systems is to find personalized recommendations based on user interactions with items like rating (explicit user feedback) or purchase, view, click, etc. (implicit user feedback). In matrix factorization techniques like the alternating least square algorithm (ALS ), the user-item matrix is decomposed into the product of two matrices in lower-dimensional space. In this way users and items can be represented as vectors which are then used to create recommendations.

The approach used in the universal recommender is different. User interactions with items are still considered but there is no matrix factorization involved: items are scored to give recommendations based on co-occurrences between events. Let’s look at this in more detail:

Assume we have n users that can buy m items on a website. Let h_p be the vector representing a user’s purchase history. We want to recommend new items to the user (i.e. items the user has not purchased yet) based on a “score”:

- First consider the n x m user-item matrix of purchases:
  
  Here the value 1 at position (i,j) means the user i has purchased item j.
- Calculate m x n the matrix of log-likelihood ratio between users purchases:
  
  The parentheses [ ] indicate that we are taking the log-likelihood ratio between purchase vectors (row of P^t compared to column of P), a sort of similarity score between items. For more details on log-likelihood ratio similarity see next section .
- Finally the scores vector is given as the product:

We can then give recommendations to the user based on the scores in r. More on this can be found on the universal recommender presentation slides .

The main advantage of this approach is that any type of user interaction and information can be ingested: user views, preferences of categories but also profile data and contextual information. In other words, the formula above can be extended to other secondary actions as follows:

In this example V denotes the matrix of user-item views and h_v is the vector of the user’s views.

Log-likelihood ratio score

The log-likelihood ratio (LLR ) is a similarity score that does not only depend on the number of times two events have occurred together (k₁₁ in the table below) but also on the number of times two events haven’t occurred together (k₂₂ in the table below) and the number of times one event has occurred and the other not (k₁₂ and k₂₁ in the table below)

LLR will be higher if there is a correlation or anti-correlation between events A and B.
For example consider the following vectors representing three different item purchase histories (this would be three columns of the user-purchase matrix (P) above):

i₁ = (1, 1, 1, 0, 0, 0, 0)

i₂ = (1, 1, 1, 1, 1, 0, 0)

i₃ = (1, 1, 1, 1, 0, 0, 0)

Even if the co-correlation values between i₁ and i₂ and between i₁ and i₃ are the same (the number of users who purchased ₁ AND i₂ and the number of users who purchased i₁ AND i₃) the LLR score between i₁ and i₃ will be higher as the anti-correlation value, i.e the number of users who haven’t purchased i₁ nor i₃, is higher.

Application

In this section we present an application of the universal recommender to build recommendations for a multi-category online store. The data is available on Kaggle . Jupyter notebooks with analysis and instructions on how to run the application can be found in the GitHub repository .

Data preparation

For more details on this section see the data preparation notebook .
We first load the data in a pandas dataframe:

1df = pd.read_csv("../datasets/2019-Nov.csv")

This is how it looks like:

It contains information about the product type (category, brand, price) and the user’s interaction with the site (event type): purchase, view, cart.

For simplicity and to avoid sparsity we restrict the dataframe to the category of smartphones and we take only the top 10.000 users by number of purchases.

1df_el = df[df["category_code"] == "electronics.smartphone"]
2n = 10000
3purch_by_users = df_el[df_el["event_type"] == "purchase"].groupby(
4    "user_id"
5)["product_id"].nunique().reset_index(name="nr_purch")
6 
7top_n = list(purch_by_users.sort_values("nr_purch", ascending=False).head(n)["user_id"]))

We train three different recommenders based on user-item interactions:

For the first recommender we use purchase-item as the unique main action. The recommendation scores are computed using the formula above .

1#for the first recommender we only consider purchases as interaction
2df_el_purch = df_el[(df_el["event_type"] == "purchase") &amp; (df_el["user_id"].isin(top_n))]

For the second recommender we add view-item as secondary action. For the score’s computation the LLR between purchases and views is added as described in the extended formula .
In the third recommender we add cart-item as a further action.

We prepare three different train sets for each recommender by considering the first interactions of the users with items. We keep the last purchased item in a test set and we use this to compare the recommenders: we calculate for how many users the purchased item in the test set is in the list of recommendations.

Engine configuration

We need to create an engine for each of the recommenders we want to train. The engine will be specified by a json file containing information like: algorithm type (universal recommender), spark resources, name of Elasticsearch container (where the input events will be sent) and indicators, i.e. the user interactions considered in each recommender.

This is how the engine template for the first recommender looks like:

1{
2 "engineId": "ecommerce_electronic_purchase",
3 "engineFactory": "com.actionml.engines.ur.UREngine",
4 "sparkConf": {
5   "master": "local",
6   "spark.driver.memory": "4g",
7   "spark.executor.memory": "4g",
8   "spark.es.index.auto.create": "true",
9   "spark.es.nodes": "elasticsearch",
10   "spark.es.nodes.wan.only": "true",
11   "spark.kryo.referenceTracking": "false",
12   "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
13   "spark.kryoserializer.buffer": "300m",
14   "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
15 },
16 "algorithm": {
17   "indicators": [
18     {
19       "name": "purchase"
20     }
21   ],
22   "num": 4
23 }
24}

Purchase is the only interaction specified in the list of indicators. For the other two recommenders we just need to add view (and cart) in the list (see the engine configuration files ). More details on how to set up engine templates can be found in the ActionML documentation .

Input

We need to send the train data to each model as input: this contains information of the user’s historical interactions with the items. This is how an input for the first recommender looks like:

1{
2   "event" : "purchase",
3   "entityType" : "user",
4   "entityId" : "520088904",
5   "targetEntityType" : "item",
6   "targetEntityId" : "1003461",
7   "properties" : {},
8}

In the first recommender the event is always purchase, entityType is always user and targetEntityType is always item. EntityId and TargetEntityId denote the user_id and product_id respectively. Properties can be used to specify item properties (type of category, expiration date) but we do not need this.
Recommender 2 and 3 will be trained using similar inputs but in this case the event can be purchase, view or cart.
The input data can be sent via the harness REST API using curl. In this application we create events via the Python SDK using the following function:

1import csv
2from datetime import datetime
3import harness
4import argparse
5import pytz
6 
7def create_event(row, client):
8   """Create input events for the recommender.
9   Args:
10       - row: list denoting a row in a pandas dataframe
11       - client: harness client
12   """
13   event_time = datetime.strptime(row[1],"%Y-%m-%d %H:%M:%S+00:00").replace(tzinfo=pytz.utc)
14   event = row[2]
15   entity_id = row[8]
16   target_id = row[3]
17   client.create(
18       event = event,
19       entity_type="user",
20       entity_id = str(entity_id),
21       target_entity_type = "item",
22       target_entity_id = str(target_id)
23       #event_time=event_time
24   )
25   print("Event: {}, entity_id: {}, target_entity_id: {}".format(event, entity_id, target_id))

See import_ecommerce_data.py .

Queries

Once the model has been trained we need to retrieve recommendations for our users. Similarly to input events, queries can be retrieved using curl as follows:

1curl -i -X POST http://harness-address:9090/engines/some_engine_id/queries" \
2-H "Content-Type: application/json" \
3-d '{
4  "user": "520088904"
5}

The curl request above will then return recommended items for the specified user. If the user has never been seen by the recommender (not present in the train set) most popular items will be recommended (by default items with most events).
Notice that harness address with port and engine_id need to be specified in the curl request.

As we use Python, we send queries using the requests library with the following function:

1import requests
2import os
3 
4# requests module need to be installed
5def query_for_user(user_id, host_url, engine_id):
6"""Creates POST requests for recommendation"""
7 
8   h = {'Content-Type': 'application/json'}
9   if user_id:
10       d = {'user': user_id}
11   else:
12       d = {} #user not specified; returns most popular items
13   url = os.path.join(host_url,"engines", engine_id, "queries")
14   r = requests.post(url, data=json.dumps(d), headers=h)
15   return r

See create_recommendation.py .

Communicate with harness server via harness-cli and put everything together

Now that we know how to send input data and queries we need to communicate this information to the harness server and train our model. For this purpose few harness-cli commands are needed (the operations described below are summarized in the bash file of this application).

We first create a new engine instance by specifying the path to the json file

1harness-cli add ${engine_json}

Now we are ready to send input data to the specified engine-id as we described in the Input section . We do this by running the Python file import_ecommerce_data.py and specifying the engine-id and the file containing train set data:

1data_folder=data
2python3 ${data_folder}/python/import_ecommerce_data.py --engine_id ${engine} --input_file ${train_file} --url ${host_url}

Once all the data is imported we can train the recommender using the command:

1harness-cli train

We can then retrieve the recommendations for the specified users by running the Python file create_recommendation.py as follows:

1python3 ${data_folder}/python/create_recommendation.py --engine_id ${engine} --url ${host_url} --input_file ${test_file} --output_file ${results_file}

Here we need to give the engine id as input, the test file (a csv file containing the unique users id of our selected users) and an output file where we store the recommendations.

Analyse results

For more details on this section see the data analysis notebook in the GitHub repository.

By running the bash file resumed in the section above for each recommender we create three different output files containing recommended items.
We load the result files as pandas dataframes:

1import os
2 
3main_path = "../data/"
4#result recommender with only purchase as action on eletronics products
5res_p = pd.read_json(os.path.join(main_path,"results/predictions-ecommerce-eletronics-p-10kusers.json"))
6#result recommender with purchase (main action) and view on eletronics products
7res_pv = pd.read_json(os.path.join(main_path, "results/predictions-ecommerce-eletronics-pv-10kusers.json"))
8#result recommender with purchase (main action), view and cart as secondary actions on eletronics products
9res_pvc = pd.read_json(os.path.join(main_path, "results/predictions-ecommerce-eletronics-pvc-10kusers.json"))

The dataframes contain a list of recommended items with a score for each user_id

We also load the common test set containing user id and the last item purchased by each user:

1test = pd.read_csv(os.path.join(main_path, 
2"input_data/2019-Nov-sample-test-eletronics-purch-10kusers.csv"))

After extracting information from the recommended results (result_items containing item ids, result_score containing scores), we merge each result dataframe with the test set above. As a simple comparison we calculate the number of users for which the item in the test set is in the list of recommendations:

As we can see adding view-item and cart-item to the recommender as secondary actions slightly increases the average recall. On the other hand, these secondary actions also create some “noise” as the second and third recommenders “miss” user-items that were correctly recommended by the first one.

As a further analysis we consider which position the item in the test set has in the list of recommendations (when recommended)

As we can see from the plots above ~38% of the items in the test set that are correctly recommended occupy the first position in the list of recommendations. This percentage slightly decreases for the second and the third recommenders: it seems the correctly recommended items become “less” important in the list of recommendations once we add further user interactions.

Further ideas

Here is a list of ideas how to experiment further with the universal recommender in the case of e-commerce data:

- Instead of view-item consider view-category as a second interaction. An analysis on this is contained in the repository notebooks.
- Try out “Item-based” recommendations: instead of outputting the top n recommended items for a user, try to find the items with a similar user behaviour to a given item (“people who purchased item x also purchased item y,z,…”).
- Add business rules: restrict recommendations to certain categories or boost/favour certain types of items by adding a bias factor to the business rule (see business rules for queries )

Was this post helpful?

Blog author

Francesca Diana

Do you still have questions? Just send me a message.

fromFrancesca Diana

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 19 minutes reading time

Francesca Diana

Raimar Falke

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 27 minutes reading time

Raimar Falke

Francesca Diana

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 minutes reading time

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 minutes reading time

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 minutes reading time

Lukas Lehmann

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

The universal recommender in Action(ML)

Introduction

Universal recommender: architecture and main idea

Architecture

Main idea

Log-likelihood ratio score

Application

Data preparation

Engine configuration

Input

Queries

Communicate with harness server via harness-cli and put everything together

Analyse results

Further ideas

Was this post helpful?

Blog author

More articles

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Answer questions about your documents with OpenAI and Pinecone

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals