Simple Fraud Detection with PyMC

26.1.2023 | 7 minutes reading time

In one of my last projects, we were facing a prediction problem with very limited data. Each set of data took a specialist hours to compile, and results were not always successful. Therefore, we were looking for a tool to handle these requirements, as artificial intelligence could not be trained with the limited amount of raw data. Thus, we turned to statistical approaches, namely Bayesian statistics with the Python package PyMC. I will explain the theory it is based on and describe it with the example of fraud detection on dice.

Bayesian statistics

Bayesian statistics is a branch of statistics that utilises Bayes' theorem to update our beliefs about the probability of a hypothesis as new data becomes available. Bayes' theorem states that the probability of a hypothesis (H) given some data (D) is equal to the probability of the data given the hypothesis (D|H) multiplied by the prior probability of the hypothesis (p(H)) divided by the total probability of the data (p(D)). This allows us to update our beliefs about a hypothesis as new data becomes available, rather than relying solely on the data at hand.

One of the most popular Python packages for implementing Bayesian statistics is PyMC. PyMC is a powerful package that allows users to easily define, fit, and analyse Bayesian models. It includes a variety of built-in distributions, such as the normal, binomial, and Poisson distributions, as well as a variety of samplers, such as Metropolis-Hastings and the No-U-Turn Sampler (NUTS). PyMC also includes a variety of convenient tools for diagnosing and visualising the results of your model.

An important thing to keep in mind when using PyMC is that you should think about the problem in terms of probability distributions and not in terms of point estimates. This can take some getting used to, but it is essential for accurately modelling complex systems. Every result you obtain has a probability. Confidence intervals are therefore crucial parts of any results from PyMC. This is an advantage compared to neural networks.

Bayesian statistics use the term prior probability and posterior probability. The prior probability is the probability distribution before any data has been seen. Basically, it shows the possible range of outcomes for all assumed parameters by simply guessing. It is noted as p(H)Posterior probability, however, is the distribution taking the data into account, i.e, (D|H) times p(H).

Rolling a tampered dice

As an example, we are using PyMC to model dice rolls. We want to find out whether the die has been tampered with or is good to use. We specify that the probability of each face (1-6) is modelled with a Dirichlet distribution with equal probability for the good dice. We also assume that the observed data, which is the outcome of the dice roll, follows a categorical distribution with a parameter p, that is the inferred probability of each face. Without any data, this is the prior probability. We then perform inference on the model using Markov Chain Monte Carlo (MCMC) sampling and use the samples to infer the probabilities of each face, i.e., the posterior probability. The tampered die yields a 6 in 7 out of 12 rolls – so this is really obvious.

Here is a sample code:

1import numpy as np  
2import pymc as pm  
3from matplotlib import pyplot as plt  
4  
5# Defining the dice and preparing data  
6tampered_dice = [1 / 12, 1 / 12, 1 / 12, 1 / 12, 1 / 12, 7 / 12]  
7good_dice = [1 / 6] * 6  
8num_rolls = 10000  
9x = list(range(0, num_rolls, 1))  
10  
11for idx, dice in enumerate([tampered_dice, good_dice]):  
12    # generate the dice rolls data  
13    p = np.array(dice)  
14    dice_rolls = np.random.choice(6, size=num_rolls, p=p)  
15  
16    # specify the dice model  
17    with pm.Model() as dice_model:  
18        p = pm.Dirichlet("p", a=np.ones(6))  
19  
20        # specify the likelihood  
21        face = pm.Categorical("face", p=p, observed=dice_rolls)  
22  
23        # perform inference using the data  
24        trace = pm.sample(draws=100, tune=100, chains=2)  
25  
26        # sampling data from before and after data is available  
27        prior_predictive = pm.sample_prior_predictive()  
28        post_pred = pm.sample_posterior_predictive(trace)  
29  
30    # presenting prior predictions  
31    fig, ax = plt.subplots()  
32    ax.hist(prior_predictive.observed_data.face)  
33    plt.xlabel("Die face")  
34    plt.ylabel("Occurences in 10k rolls.")  
35    if idx == 0:  
36        plt.title("Tampered dice")  
37    else:  
38        plt.title("Good dice")  
39    plt.savefig(f"prior_predictive_{idx}.png", dpi=100)  
40    plt.show()  
41  
42    # presenting posterior predictions  
43    trace.extend(post_pred)  
44  
45    fig, ax = plt.subplots()  
46    ax.hist(trace.observed_data.face)  
47    plt.xlabel("Die face")  
48    plt.ylabel("Occurences in 10k rolls.")  
49    if idx == 0:  
50        plt.title("Tampered dice")  
51    else:  
52        plt.title("Good dice")  
53    plt.savefig(f"posterior_predictive_{idx}.png", dpi=100)  
54    plt.show()  
55  
56    # calculate the expected probabilities of a fair dice  
57    expected_probs = np.ones(6) / 6  
58  
59    # calculate the difference between the posterior probabilities and the expected probabilities  
60    prob_diff = np.abs(trace.posterior.p - expected_probs)  
61  
62    # calculate the mean and standard deviation of the difference  
63    mean_diff = np.mean(prob_diff, axis=0)  
64    std_diff = np.std(prob_diff, axis=0)  
65  
66    # set a threshold for the difference  
67    threshold = 0.05  
68  
69    # check if the difference between the inferred probabilities and the expected probabilities is above the threshold  
70    tampered = mean_diff > threshold  
71  
72    if tampered.any():  
73        print("Dice may have been tampered with.")  
74    else:  
75        print("Dice does not seem to have been tampered with.")

By running it, the program checks whether the die faces are within 5 % of allowed deviation from the ideal die. Probably, your regular at-home die could be seen as tampered by this code, because the face 1 is heavier than 6, making 6 appear more often than 1, on average. For the prior probability, I have chosen 100 draws. The draws are the number of wyld guesses the program does to obtain potential model parameters. It is advisable to use larger values, e.g., 1000 to have enough granularity in the probability distribution.

Why PyMC?

Now what are the advantages and disadvantages of PyMC? PyMC is a powerful package that allows users to easily define, fit, and analyse models. It provides a variety of built-in distributions and samplers suitable for a wide range of models. PyMC is also very flexible and can handle complex models. However, PyMC is a probabilistic programming library, which can take some getting used to. It's important to understand the concepts of probability distributions and Bayesian statistics. PyMC can be computationally expensive, especially for large and complex models and may not be suitable for real-time applications or large datasets. Especially since PyMC is built in Python. Bayesian methods are a powerful tool for data analysis, and PyMC makes it easy to implement these methods in Python. PyMC is, however, far more tedious and not as well documented as Tensorflow, but then again, Tensorflow is the current standard.

Alternatives to PyMC with different requirements and advantages are:

Stan: Written in C++ and thus faster than PyMC for large and complex models.
JAGS: Written in C++
Edward: Built on top of TensorFlow, which allows for the use of deep learning models. Useful for Bayesian deep learning problems.

Relevant applications are found in probabilistic forecasting, a type of forecasting that provides a range of possible outcomes, rather than a single point estimate. Some examples include weather forecasting, financial forecasting and most relevant currently in Europe, energy forecasting.

Sometimes, neural networks and deep learning yield results, which appear seemingly magically. The most prominent examples right now are ChatGPT and MidJourney. How they achieve their results is only partially understandable. Bayesian statistic require us to think about the model that underlies the observed data and thus understand more deeply the problem at hand. While not as powerful as artificial intelligence, probabilistic approaches can help us prepare data and set requirements for AI. More understanding about the problem is often a key benefit.

Conclusion

PyMC is an interesting tool in your toolbox. It uses Bayesian statistics for powerful data analysis, giving the Data Scientist or Engineer the capabilities to fine tune their model and find the best representation for the observed data. PyMC may not be as convenient and powerful as neural networks or deep learning approaches in artificial intelligence, but it helps humans to understand more about the problem at hand. Furthermore, it gives a range of results with different probabilities and not a point estimate. All this makes PyMC an interesting tool.

Was this post helpful?

Blog author

Robert Meißner

Do you still have questions? Just send me a message.

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article. Our previous benchmark compared DuckDB, Polars, and Pandas...

MotherDuck
Data Science
Data

20.1.2026 | 15 minutes reading time

Niklas Niggemann

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

In Part 1 of this series, Developing AWS Lambda Functions with Python and CDK, we covered the initial setup of a CDK and Python project. We walked through the process of creating a basic Hello World* Lambda function, testing it with a unit test, defining...

AWS
Serverless
Python

2.6.2023 | 2 minutes reading time

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

This blog post assumes that you are familiar with Python development and know the basic concepts of Amazon CDK. What's more, you should have an AWS account and have configured the AWS CLI. If you're new to CDK, go here, if you need to configure the AWS...

AWS
Serverless
Python

6.3.2023 | 6 minutes reading time

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Let's build a modern CMD tool with Python using Typer and Rich

Let's build a modern CMD tool with Python using Typer and Rich I often have a need for a small CMD tool for my projects - e.g. to query an API or perform some operation. What do I want from the tool? Quick development cycle Nice output, e.g. with syntax...

API
Python

14.10.2022 | 12 minutes reading time

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

How to use Java classes in Python

There is an old truism: “Use the right tool for the job.” However, in building software, we are often forced to nail in screws, just because the rest of the application was built with the figurative hammer Java. Of course, one of the preferred solutions...

AI
Java
Python

15.11.2021 | 8 minutes reading time

Hendrik Schawe

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

IIoT product development: lessons from past projects

In this overview article on industrial IoT product development we will guide you along the essential questions and directions to consider. We will go with you along the relevant topics and preconditions, when you start to connect large numbers of small...

IIoT
IoT
Python

11.11.2020 | 10 minutes reading time

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Simple Fraud Detection with PyMC

Bayesian statistics

Rolling a tampered dice

Why PyMC?

Conclusion

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck Dives: From Natural Language to Live Dashboards

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

DuckDB vs. DataFrame Libraries

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Let's build a modern CMD tool with Python using Typer and Rich

Python on an M1 chip: Running smoothly using Docker

How to use Java classes in Python

The universal recommender in Action(ML)

Take control of named entity recognition with your own Keras model!

IIoT product development: lessons from past projects

From PDF data sheets to shared understanding with serverless SHACL

Evaluating machine learning models: The issue with test data sets