Data Science for Fraud Detection

5.9.2017 | 10 minutes reading time

What is fraud and why is it interesting for Data Science?

Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business, there is the potential for one party scamming the other.
With an ever-increasing use of the internet for shopping, banking, filing insurance claims etc., these businesses have become targets of fraud in a whole new dimension. Fraud has become a major problem in e-commerce and a lot of resources are being invested to recognize and prevent it.

Traditional approaches to identifying fraud have been rule-based. This means that hard and fast rules for flagging a transaction as fraudulent have to be established manually and in advance. But this system isn’t flexible and inevitably results in an arms race between the seller’s fraud detection system and criminals finding ways to circumnavigate these rules.
The modern alternative is to leverage the vast amounts of Big Data that can be collected from online transactions and model it in a way that allows us to flag or predict fraud in future transactions. For this, Data Science and Machine Learning techniques such as Deep Neural Networks (DNNs) are the obvious solution!

Here, I am going to show an example of how Data Science techniques can be used to identify fraud in financial transactions. I will offer some insights into the inner workings of fraud analysis, aimed at non-experts to understand.

Synthetic financial datasets for fraud detection

A synthetic financial dataset for fraud detection is openly accessible via Kaggle. It has been generated from a number of real datasets to resemble standard data from financial operations and contains 6,362,620 transactions over 30 days (see Kaggle for details and more information).

By plotting a few major features, we can already get a sense of the data. The two plots below, for example, show us that fraudulent transactions tend to involve larger sums of money. When we also include the transaction type in the visualization, we find that fraud only occurs with tranfers and cash-out transactions and we can adapt our input features for machine learning accordingly.

Fraudulent transactions tend to involve larger sums of money. This plot shows the distribution of transferred amounts of money (log + 1) in fraudulent (Class = 1) and regular (Class = 0) transactions.

Fraud only occurs with tranfers and cash-out transactions. This plot shows the distribution of transferred amounts of money (log + 1) in different transaction types for fraudulent (Class = 1) and regular (Class = 0) transactions.

Dimensionality reduction

In preparation for machine learning analysis, dimensionality reduction techniques are powerful tools for identifying hidden patterns in high-dimensional datasets. In addition, we can use them to reduce the number of features for machine learning while preserving the most important patterns of the data. Similar approaches use clustering algorithms, like k-means clustering.

The most common dimensionality reduction technique is Principal Component Analysis (PCA). PCA is good at picking up linear relationships between features in the data. The first dimension, also called the first principal component (PC), reflects the majority of variation in our data, the second PC reflects the second-biggest variation and so on. When we plot the first two dimensions against each other in a scatterplot, we see patterns in our data: The more dissimilar two samples in our dataset, the farther apart they will be in a PCA plot. PCA will not be able to deal with more complex patterns, though. For non-linear patterns, we can use t-Distributed Stochastic Neighbor Embedding (t-SNE). In contrast to PCA, t-SNE will not only show sample dissimilarity, it will also account for similarity by clustering similar samples close together in a plot. This might not sound like a major difference, but when we look at the plots below, we can see that it is much easier to identify clusters of fraudulent transactions with t-SNE than with PCA. PCA and t-SNE can both be used with machine learning.

Here, I want to use dimensionality reduction and visualization to perform a sanity check on the labelled training data. Because we can assume that some fraud cases might not have been identified as such (and are therefore mis-labelled), we could now advise to take a closer look at non-fraud samples that cluster with fraud cases.

Dimensionality reduction techniques in fraud analytics. The plots show the first two dimensions of PCA (left) and t-SNE (right) for fraudulent (Class = 1) and regular (Class = 0) transactions.

Which Machine Learning algorithms are suitable for fraud analysis?

Machine learning is a broad field. It encompasses a large collection of algorithms and techniques that are used in classification, regression, clustering or anomaly detection. Two main classes of algorithms, for supervised and unsupervised learning, can be distinguished.

Supervised learning is used to predict either the values of a response variable (regression tasks) or the labels of a set of pre-defined categories (classification tasks). Supervised learning algorithms learn how to predict unknown samples based on the data of samples with known response variables/labels.

In our fraud detection example, we are technically dealing with a classification task: For each sample (i.e. transaction), the pre-defined label tells us whether it is fraudulent (1) or not (0). However, there are two main problems when using supervised learning algorithms for fraud detection:

Data labelling: In many cases, fraud is difficult to identify. Some cases will be glaringly obvious – these are easy to recognize with rule-based techniques and usually won’t require complex models. Where it becomes interesting are the subtle cases; they are hard to recognize as we don’t usually know what to look for. Here, the power of machine learning comes into play! But because fraud is hard to detect, training data sets from past transactions are probably not classified correctly in many of these subtle cases. This means that the pre-defined labels will be wrong for some of the transactions. If this is the case, supervised machine learning algorithms won’t be able to learn to find these types of fraud in future transactions.
Unbalanced data: An important characteristic of fraud data is that it is highly unbalanced. This means that one class is much more frequent than the other; in our example, less than 1% of all transactions are fraudulent (see figure “Synthetic financial dataset for fraud detection”). Most supervised machine learning classification algorithms are sensitive to unbalance in the predictor classes, and special techniques would have to be used to account for this unbalance.

Synthetic financial dataset for fraud detection. Fraud cases are rare compared to regular transactions; in the simulated example dataset less than 1% of all transactions are fraudulent.

Unsupervised learning doesn’t require pre-defined labels or response variables; it is used to identify clusters or outliers/anomalies in data sets.

In our fraud example data set we don’t trust the predictor labels to be 100% correct. But we can assume that fraudulent transactions will be sufficiently different from the vast majority of regular transactions, so that unsupervised learning algorithms will flag them as anomalies or outliers.

Anomaly detection with deep learning autoencoders

Neural networks are applied to supervised and unsupervised learning tasks. Autoencoder neural networks are used for anomaly detection in unsupervised learning; they apply backpropagation to learn an approximation to the identity function, where the output values are equal to the input. They do so by minimizing the reconstruction error or loss. Because the reconstruction error is minimized according to the background signal of regular samples, anomalous samples will have a larger reconstruction error.

For modeling, I am using the open-source machine learning software H2O via the “h2o” R package. On the fraud example data set described above, an unsupervised neural network was trained using deep learning autoencoders (Gaussian distribution, quadratic loss, 209 weights/biases, 42,091,943 training samples, mini-batch size 1, 3 hidden layers with [10, 2, 10] nodes). The training set contains only non-fraud samples, so that the autoencoder model will learn the “normal” pattern in the data; test data contains a mix of non-fraud and fraud samples. We need to keep in mind, though, that autoencoder models will be sensitive to outliers in our data in that they might throw off otherwise typical patterns. This trained autoencoder model can now identify anomalies or outlier instances based on the reconstruction mean squared error (MSE): transactions with a high MSE are outliers compared to the global pattern of our data. The figure below shows that the majority of test cases that had been labelled as fraudulent indeed have a higher MSE. We can also see that a few regular cases have a slightly higher MSE; these might contain cases of novel fraud mechanisms that have been missed in previous analyses.

Anomalies based on reconstruction mean squared errors (MSE).

This plot shows reconstruction MSE (y-axis) for every transaction (instance) in the test data set (x-axis); points are colored according to their pre-defined label (fraud = 1, regular = 0).

Pre-training supervised models with autoencoders

Autoencoder models can also be used for pre-training supervised learning models. On an independent training sample, another deep neural network was trained – this time for classification of the response variable “Class” (fraud = 1, regular = 0) using the weights from the autoencoder model for model fitting (2-class classification, Bernoulli distribution, CrossEntropy loss, 154 weights/biases, 111,836,076 training samples, mini-batch size 1, balance_classes = TRUE).

Model performance is evaluated on the same test set that was used for showing the MSE of the autoencoder model above. The plot below shows the predicted versus actual class labels. Because we are dealing with severely unbalanced data, we need to evaluate our model based on the rare class of interest, here fraud (class 1). If we looked at overall model accuracy, a model that never identifies instances as fraud would still achieve a > 99% accuracy. Such a model would not serve our purpose. We are therefore interested in the evaluation parameters “sensitivity” and “precision”: We want to optimize our model so that a high percentage of all fraud cases in the test set is predicted as fraud (sensitivity), and simultaneously a high percentage of all fraud predictions is correct (precision).
An optimal outcome from training a supervised neural network for binary classification is shown in the plot below.

Results from training a supervised neural network for binary classification. The plot shows the percentage of correctly classified transactions by comparing actual class labels (x-axis) with predicted labels (color; fraud = 1, regular = 0).

Understanding and trusting machine learning models

Decisions made by machine learning models are inherently difficult – if not impossible – for us to understand. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well. But it also basically makes them a black box. This can be problematic, because executives will be less inclined to trust and act on a decision they don’t understand.

Local Interpretable Model-Agnostic Explanations (LIME) is an attempt to make these complex models at least partly understandable; With LIME, we are able to explain in more concrete terms why, for example, a transaction that was labelled as regular might have been classified as fraudulent. The method has been published in “Why Should I Trust You? Explaining the Predictions of Any Classifier” by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin from the University of Washington in Seattle. It makes use of the fact that linear models are easy to explain; LIME approximates a complex model function by locally fitting linear models to permutations of the original training set. On each permutation, a linear model is being fit and weights are given so that positive weights support a decision and negative weights contradict them. In sum, this will give an approximation of how much and in which way each feature contributed to a decision made by the model.

Code

A full example with code for training autoencoders and for using LIME can be found on my personal blog:

Was this post helpful?

Blog author

Shirin Elsinghorst

Topic Lead Data & AI Strategy Consulting

Do you still have questions? Just send me a message.

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

Garbage In, Garbage Out. This computing truism has never been more critical than in the age of AI. Large Language Models don't amplify poor data quality, they wrap it in confident-sounding prose that can mislead even experienced users. As organizations...

Generative AI
LLM
AI
Data

7.5.2026 | 8 minutes reading time

Niklas Niggemann

Ask Your Data(bricks) with Natural Language

The hottest topic in data and AI today is arguably talking to your own data. Writing SQL queries is far from intuitive when exploring data, so the ability to simply ask questions in natural language and receive AI-powered answers backed by your business...

Data
Big Data

16.4.2026 | 9 minutes reading time

Niklas Niggemann

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

Talk to your Data Part 3: The Potential of Natural Language

This is the last and final part of our article series covering the new MCP server by MotherDuck. We have already presented the basics and challenges in previous parts. Now, we want to conclude with our findings and comments on the current state and give...

MotherDuck
Data

27.2.2026 | 7 minutes reading time

Hendrik Kamp

Niklas Niggemann

Talk to your Data Part 2: Limits and Performance Enhancements

In part one of this series, we introduced the MotherDuck MCP server in combination with opencode and showcased initial context engineering. We also showed deeper knowledge retrieval using natural language instead of SQL. In this article we will dive ...

MotherDuck
Data

19.2.2026 | 8 minutes reading time

Niklas Niggemann

Hendrik Kamp

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

MotherDuck's new MCP server gives us the opportunity to have a conversation with an AI models like Claude or ChatGPT and ask questions about our data that are directly transformed into SQL. The queries are executed against the actual data in our cloud...

MotherDuck
Data

12.2.2026 | 6 minutes reading time

Niklas Niggemann

Hendrik Kamp

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article. Our previous benchmark compared DuckDB, Polars, and Pandas...

MotherDuck
Data Science
Data

20.1.2026 | 15 minutes reading time

Niklas Niggemann

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Data Science for Fraud Detection

What is fraud and why is it interesting for Data Science?

Synthetic financial datasets for fraud detection

Dimensionality reduction

Which Machine Learning algorithms are suitable for fraud analysis?

Anomaly detection with deep learning autoencoders

Pre-training supervised models with autoencoders

Understanding and trusting machine learning models

Code

Was this post helpful?

Blog author

More articles in this subject area

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

Ask Your Data(bricks) with Natural Language

MotherDuck Dives: From Natural Language to Live Dashboards

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

Talk to your Data Part 3: The Potential of Natural Language

Talk to your Data Part 2: Limits and Performance Enhancements

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)