Hit me baby one more time – What are cache hits and why should you care?

6.12.2019 | 11 minutes reading time

Motivation

When reasoning about algorithm performance we often look at complexity. Especially when comparing different algorithms, looking at asymptotic complexity (e.g. the big-O notation) is useful. We have to keep in mind, however, that the big-O notation “swallows” everything but the largest complexity factor. A prominent example where the big-O notation can be misleading is when finding a value in a collection. Hash maps are the default candidate for this use case, as the access to one particular element requires constant time. If you have only a few elements, however, using a tree or even a simple list might be faster.

While you should look out for cases like this, there is another huge factor that influences performance: the computer hardware / architecture. Your CPU might be fast but it has to wait for I/O to finish. Your distributed algorithm might be powerful but your network topology does not suffice. In this blog post we are going to look at one particular part of the computer architecture which might impact algorithm performance in orders of magnitude: The memory hierarchy – and the CPU cache in particular.

The post is structured as follows. First we are going to look at a very popular example where the runtime performance is heavily influenced by the CPU cache utilization. The second section will give some theoretical background about computer architecture, allowing the reader to understand what is happening in the example. Afterwards we are going to revisit the example from the first section with the newly acquired knowledge. We are closing the blog post by summarizing the main ideas.

The matrix multiplication example

The formula

The example we want to use is a very basic one: matrix multiplication. Matrix multiplication is used in a lot of places, e.g. image processing, AI, robotics, data compression. For multiplying the matrices A * B = C, the formula is as follows:

For each element of the result matrix, it goes row-wise through A and column-wise through B, multiplying each pair of elements, adding up the results. There is also a nice visual representation of the formula:

Now let’s look at two different variations of this algorithm and measure the execution time for different matrix sizes. For simplicity reasons we are using two square matrices. They are filled with randomly generated double values.

We are using ScalaMeter for measuring the runtime performance. Feel free to check out my previous blog post which contains more details about the tool.

The experiments

Algorithm 1

The first algorithm is a naive implementation of the formula above. For two square matrices of size n it has to perform n³ multiplications and additions, thus having O(n³) complexity.

1def mult1(m1: Array[Array[Double]],
2          m2: Array[Array[Double]],
3          size: Int): Array[Array[Double]] = {
4  var res = Array.fill(size)(new Array[Double](size))
5  var i = 0
6  while (i < size) {
7    var j = 0
8    while (j < size) {
9      var k = 0
10      while (k < size) {
11        res(i)(j) += m1(i)(k) * m2(k)(j)
12        k += 1
13      }
14      j += 1
15    }
16    i += 1
17  }
18  res
19}

Algorithm 2

The second algorithm transposes the second matrix before applying a slight variation of the formula, which has the indices of the second matrix within the multiplication inverted. For two square matrices of size n it first has to perform n² copy operations and then again n³ multiplications and additions. This also leads to O(n³) complexity, but with more overhead.

1def mult2(m1: Array[Array[Double]],
2          m2: Array[Array[Double]],
3          size: Int): Array[Array[Double]] = {
4  var m2t = Array.fill(size)(new Array[Double](size))
5  var x = 0
6  while (x < size) {
7    var y = 0
8    while (y < size) {
9      m2t(x)(y) = m2(y)(x)
10      y += 1
11    }
12    x += 1
13  }
14
15  var res = Array.fill(size)(new Array[Double](size))
16  var i = 0
17  while (i < size) {
18    var j = 0
19    while (j < size) {
20      var k = 0
21      while (k < size) {
22        res(i)(j) += m1(i)(k) * m2t(j)(k)
23        k += 1
24      }
25      j += 1
26    }
27    i += 1
28  }
29  res
30}

The results

Looking at both algorithms an educated guess might be that the first one is faster than the second one, as they are almost identical except for the transpose operation and the inverted indices. The relative difference between the two should decrease as the size increases, as n³ grows faster than n², which is also reflected in both algorithms having the same asymptotic complexity. The memory requirements of mult2 are a bit higher, however, as it needs to transpose the matrix first.

Let’s look at the execution time of both implementations for different matrix sizes n. We are also adding a column indicating the speedup (mult1 / mult2 – 1) you get when choosing the second implementation.

n	Time `mult1` (ms)	Time `mult2` (ms)	Speedup
10	0.011	0.014	-20.5 %
50	0.166	0.128	29.6 %
100	1.307	1.102	18.5 %
500	242.1	147.4	64.2 %
1000	9 347	1 244	651.2 %
3000	398 619	33 846	1077.7 %

Impressive! Even though we are doing extra work, we are more than 10 times faster using mult2 for n = 3000! What is happening?

Before I am going to explain, it is useful to have some basic knowledge in computer architecture. The next section is going to cover that part. In case you are already familiar with that or cannot wait for the answer, feel free to skip the next section and revisit it later.

Computer architecture

Von Neumann architecture

Most of the modern computers, especially in the commodity segment, are based on the Von Neumann Architecture. Originally described in 1945 it evolved over time to what we have today. It describes the basic components of a computer, such as CPU, memory, and I/O devices, as well as how they are working together.

The following scheme depicts the main components and how they are connected. The central processing unit (CPU) is responsible for doing the computational work, e.g. arithmetic operations. It is connected to the main random access memory (RAM) through the northbridge. The northbridge also connects other high speed interfaces but we are not going to go into details here. If data is not available within memory, it has to be loaded from persistent storage. The different interfaces for this (e.g. IDE, SATA, USB) are connected through the southbridge.

This means that there is no direct connection from the processor to the data. If the data is in the main memory, it has to go through the northbridge, if it is not there, it also has to go through the southbridge. This storage hierarchy is necessary as there is a trade-off between storage cost, speed and size.

Storage hierarchy

Storage can be categorized into persistent and volatile storage. In order for data to survive a reboot of the computer, it needs to be stored persistently, e.g. in a hard disk drive (HDD) or a solid state drive (SSD). The disadvantage of this storage is that is often not optimized for random access (which is fine for files but might not be for individual data or code). It is also very far from the CPU, which adds additional transfer latency.

When booting the operating system all required data is loaded into the volatile main memory. If the computer has no power the data will be lost. Also the main memory is typically much smaller. However, it is faster, especially in terms of latency, and it is optimized for random access.

While using the RAM for more frequently used data is a good idea, it is still not optimal as the CPU needs to go through the northbridge to read or modify the data. To overcome this problem, another layer of memory has been added to computers: the CPU cache.

Commodity main memory is typically based on dynamic RAM (DRAM ). Consisting of fundamentally only one transistor and one capacitor per element, DRAM is cheap and can be packed very well, allowing for more capacity. However, due to its design it is much slower than static RAM (SRAM ), which is typically used when even lower latency is required. SRAM is used in CPU caches. The following diagram summarizes the storage hierarchy we were discussing.

The CPU caches themselves typically also have different layers, reaching from the fastest but smallest level 1 (L1), up to the biggest but slowest level 3 (L3) cache. Additionally the L1 cache is devided into one part for data (L1d) and one for instructions / code (L1i).

If you have a multi-core CPU, some of the caches are shared between cores. If hyperthreading is enabled, the actual cache size heavily depends on the program you are executing as both hyperthreads have to share the same cache. For more details on this topic I recommend to take a look at the awesome paper What Every Programmer Should Know About Memory by Ulrich Drepper.

To give the reader an impression on the actual size of the different layers, the following table contains the specifications of my 2016 laptop.

Storage	Size
L1i Cache	32 KB
L1d Cache	32 KB
L2 Cache	256 KB
L3 Cache	4 MB
Main Memory	16 GB
SSD	500 GB

CPU cache usage

The cache stores data copies of frequently used main memory locations. While storing something on disk or in memory is a deliberate decision of the programmer, what is stored in the cache is mostly managed behind the scenes.

Data is transferred between main memory and cache in blocks of fixed size, so called cache lines. A cache entry contains the data as well as the memory location it caches. Whenever the CPU needs to access a memory location, it first checks the cache for a corresponding entry. If the entry exists, it is called a cache hit, otherwise a cache miss.

Cache hits are generally what you want because it avoids the (relatively) expensive main memory access. Note that you cannot avoid cache misses and they are not generally a bad thing. If a memory location is accessed for the first time, a cache miss is natural.

Modern CPUs, however, have mechanisms to avoid cache misses also in this situation. The processor has a small module responsible for prefetching cache lines. One example is when working with structural types, e.g. a person, and accessing one attribute of the person, e.g. the address, it is likely that you are going to access another attribute later. So the CPU will prefetch also other attributes, e.g. the age and name. Another example is sequential access on arrays. When accessing the first element of an array, the CPU will prefetch more elements in parallel while still dealing with the first one.

With this knowledge in mind, we want to go back to the original example of matrix multiplication and analyze why mult2 performed so much better for bigger matrices than mult1.

Matrix multiplication revisited

In the initial example we had two implementations for matrix multiplication: mult1 and mult2. Both were fundamentally the same, except that mult2 transposed the second matrix before entering the three nested loops performing the computation. But why does it make a difference? And why is is this difference only visible for bigger matrices?

Before we analyze the difference from a more theorical perspective, let’s look at more numbers. We are going to execute both implementations again with different matrix sizes. This time we are not measuring the runtime, however, but instead we are looking at the CPU counters for L1d cache misses using perf .

1for s in 10 50 100 250 500 1000 1500 2000 2500 3000
2do
3  for m in 1 2
4  do
5    # Run mult$m with size $s and record CPU counters
6    perf stat -e L1-dcache-loads,L1-dcache-load-misses \
7      java -jar cpumemory.jar $s $m
8  done
9done

What you can see is that starting with n >= 250, the L1d cache miss percentage increases for mult1. This means that the CPU has to fall back to the bigger but slower L2 cache. It wastes cycles, waiting for the required data to be available from higher level memory. But why is this the case?

As explained in the previous section, the CPU is prefetching data to the cache. That means when accessing the first element of the matrix, it is going to prefetch more elements in parallel. Recap the three nested loops with the three indices i, j, and k.

The first access of m1(i)(k) will trigger prefetching more elements, e.g. m1(i)(k + 1). The first access of m2(k)(j) as in mult1, however, is going to trigger prefetching m2(k)(j + 1). While this is not a big problem as long as the whole matrix fits into the cache, it becomes one with bigger matrices.

The inner-most loop increments k, and not j. So by the time j gets incremented, m2(k)(j + 1) might have already been evicted. This is the reason why mult2 performs so much better with bigger matrices. By transposing the matrix and then swapping the access to m2t(j)(k), even if the cache is full and parts of the matrix have to be evicted, we at least prefetch correctly for the inner loop.

Final thoughts

In this blog post I was trying to show you why it is important to know some basics about computer architecture as a software developer. While most university grade computer scientists should learn this during their studies, it is certainly a good idea for everyone else who is writing code to read a bit more about this topic.

We looked at an implementation of the naive matrix multiplication algorithm where changing only a small detail lead to execution speed-up of over 1000%. It is important to note that mult2 is by far not the most optimal implementation. You can check out the paper Cache oblivious matrix multiplication using an element ordering based on the Peano curve for a cache optimized implementation that does not depend on the actual cache size. Also note that other algorithms having better asymptotic complexity than O(n³) exist, e.g. the Strassen algorithm .

Have you ever encountered a slow program that turned out to have a lot of cache misses? Did you use perf before? Do you think developers should be more aware of the cache logic in order to write efficient programs? Let me know your thoughts in the comments!

Was this post helpful?

Blog author

Frank Rosner

Do you still have questions? Just send me a message.

Nested Fixture Pattern for JUnit

JUnit's @Nested classes are usually presented as a way to group related tests. But combined with @RegisterExtension and ExtensionContext.Store, they become something more powerful: a declarative scenario tree where each level adds a scope in which fixtures...

Testing
Java
Software development

9.3.2026 | 11 minutes reading time

Rüdiger zu Dohna

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

How-To: Seamless development in WSL2 with git, SSH and podman desktop

Weather you want a more uniform development environment across your team to avoid compatibility issues between different operating systems, want to work closer to your target environment, or need to run a linux exclusive tool like Claude Code, an AI ...

Git
Microsoft
Software development

5.1.2026 | 5 minutes reading time

20 years of coding

We all grow older. It is simply inevitable. As the saying goes, The only way to not grow old is to die young. Recently, I've completed my 20th year in the development industry. Through academia, consulting, and a stint in product development, I've learned...

Software development
Training
Culture

11.4.2025 | 10 minutes reading time

Elisabeth Schulz

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 minutes reading time

Danny Keller

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 minutes reading time

Manuel Zapf

React is dead, long live React - React 19 is here

The world of frontend development has changed once again, and this time React 19 is leading the way. This version brings a variety of new features and improvements, but the most exciting innovation is the brand new compiler, which already requires React...

React
Frontend
Software development
JavaScript
Webdevelopment

19.7.2024 | 6 minutes reading time

Michel Ehmen

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 minutes reading time

Manuel Zapf

Spring Boot and HTMX: The boring app

Motivation Most apps I touched in the wild follow the same two tiered approach. A backend delivering JSON (some may call this REST) and a frontend framework, consuming JSON from the backend converting it to the HTML displayed to the user. Worst case,...

Software architecture
Software development
Spring
Kotlin

28.6.2024 | 16 minutes reading time

Charge your APIs Volume 25: Contract Testing

I feel the way we do integration testing is sort of like setting your house on fire to test your smoke alarm. It is excessive, tiresome and way too costly. This is not a quote from myself. I typically don't come up with such good ideas when I need....

Testing
Software development
API

2.4.2024 | 11 minutes reading time

Pasquale Brunelli

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 minutes reading time

Macro annotations in Scala 3

In a previous blog post we took a look at macro annotations in Scala 2, where they have been present for a while. Only recently they have been added to Scala 3 as well, specifically in the pre-release version 3.3.0-RC2 of the Dotty compiler. Same as...

Scala

4.4.2023 | 9 minutes reading time

Lukas Lehmann

Macro annotations in Scala 2

In this blog post we will take a look at macro annotations, a powerful tool for code transformation and generation in Scala. Macro annotations allow us to transform the code of a definition, e.g., a class or method, at compile time. This can be used ...

Scala

28.3.2023 | 12 minutes reading time

Lukas Lehmann

The best of both worlds: Harnessing the benefits of object-oriented and...

Functional programming and OOP are often viewed as two separate paradigms in programming. And it is true that programming languages lean more towards one or the other, which influences how we are "supposed to" solve a problem in this language. In this...

Pattern
Functional programming
Software development

1.2.2023 | 8 minutes reading time

Thomas Buß

API consumers – between search and feedback

Approaches for API consumers“We do know our consumers. We know exactly what they want.” Very often I hear these two sentences at the beginning or in the middle of projects. But who is a consumer of an API or a digital product in the fist place?This is...

API
Software development

19.9.2022 | 9 minutes reading time

Daniel Kocot

GitHub Actions test a full Tekton CI installation

In my previous blog posts I gave an intro to Tekton and showed how to configure and integrate it. This post will show you a way how to build CI pipelines with built-in testing. In software development, it is good practice to write tests for the code...

CI/CD
Software development

28.4.2022 | 5 minutes reading time

Marco Paga

Standing on the shoulders of the Tekton community: Tekton buildpack pipeline

In the first article we mastered the Tekton installation, got to know the first API objects and created a first small pipeline. You might want to have a look at this Remarkable – note for a recap. Now we will create a practical pipeline that, as usual...

CI/CD
Software development

11.2.2022 | 5 minutes reading time

Marco Paga

Tekton Cloud-native CI/CD – a pragmatic intro

In this article I want to give an overview of Tekton with the goal of explaining the basics and getting you started quickly.According to its own homepage, Tekton wants to become the standard for CI/CD. On the one hand, it offers a framework for building...

CI/CD
Software development

19.1.2022 | 6 minutes reading time

Marco Paga

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

Hit me baby one more time – What are cache hits and why should you care?

Motivation

Was this post helpful?

Blog author

More articles in this subject area

Nested Fixture Pattern for JUnit

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

How-To: Seamless development in WSL2 with git, SSH and podman desktop

20 years of coding

Hexagonal Architecture is just an island

ArchUnit in practice: Keep your Architecture Clean

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

React is dead, long live React - React 19 is here

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Spring Boot and HTMX: The boring app

Charge your APIs Volume 25: Contract Testing

How to gain visibility as a software developer?

Macro annotations in Scala 3

Macro annotations in Scala 2

The best of both worlds: Harnessing the benefits of object-oriented and...

API consumers – between search and feedback

GitHub Actions test a full Tekton CI installation

Standing on the shoulders of the Tekton community: Tekton buildpack pipeline

Tekton Cloud-native CI/CD – a pragmatic intro

Evaluating machine learning models: Establishing quality gates