Fork/Join and other Techniques to Improve Performance

2.11.2012 | 13 minutes reading time

In the last few years there has been nearly no improvement in single thread performance of CPUs. On the other hand, the number of cores increases: Laptops with eight cores are common (okay, including hyperthreading, only four real cores). Even modern smartphones often have four cores. To utilize these modern beasts, you need parallel programming.

In this article, I use a simple board game as an example for a parallel algorithm and other optimization techniques, a variant of peg solitaire . The problem to solve is: How many different solutions exist for a board with n pegs on a side? The focus is on different optimization techniques, not only the Fork/Join framework. You may be surprised to find other techniques are much more efficient for these problems.

Definition of the Problem

Lets start with a more precise definition of the problem. We play on a triangular board. A board with edge length 5 (n = 5) before any move has been done looks like this:

          x
         x x
        x o x
       x x x x
      x x x x x

The middle peg of the third row is empty. A legal move is a jump over one peg in one of the six different directions. The leapfrogged peg is removed from the board. So the board could look like this after one move:

          x
         x x
        x x x
       x o x x
      x o x x x

A solution is found when there is only one peg left, wherever it is located on the board. You get different results for different starting positions, see Dan O’Briens Puzzle Solution Page for some more information on the topic.

Given a Java class which can represent a position and which is capable to compute a list of all resulting positions after one move, the solver is a simple recursive function :

1long countSolutions(Board start) {
2      if (start.isSolution()) {
3          return 1;
4      } else {
5          long count = 0;
6          for (Board board : start.nextPositions()) {
7              count += countSolutions(board);
8          }
9          return count;
10      }
11  }

When you feed it with the start board with edge length five, it takes about a tenth of a second and you can see there are 1,550 solutions for n = 5. A tenth of a second is a short time, so why optimize? Let’s see bigger values, e.g. n = 6. Takes a little bit longer. Much longer. Not as long as to compute 42, but about 30 hours resulting in 29,235,690,234 (now it should be obvious why countSolutions() returns a long and not an int).

Why is there such a huge difference for a slightly larger board? Because the number of positions for a board of size n is 2^(n * (n+1)/2). The exponent is the number of holes/pegs on the board, which increases quadratically.

Fork/Join

When you know the Java Fork/Join framework (otherwise read the fork/join tutorial ), you should see the perfect match: In each recursion level, you can fork a thread for the list of next positions. Here is the code, first the initialization of the pool and the code for starting the computation:

1ForkJoinPool pool = new ForkJoinPool(numThreads);
2  RecursiveSolver root = new RecursiveSolver(startBoard, sequential);
3  solutions = pool.invoke(root);

Then the implementing class:

1class RecursiveSolver extends RecursiveTask<Long> {
2  private Board start;
3  private int sequential;
4 
5  public RecursiveSolver(Board start, int sequential) {
6    this.start = start;
7    this.sequential = sequential;
8  }
9 
10  @Override
11  protected Long compute() {
12    int card = start.cardinality();
13    if (card == 1) {
14       return Long.valueOf(1);
15    } else if (card < sequential) {
16       return Long.valueOf(countSolutions(start));
17    } else {
18      List<Board> nextPositions = start.nextPositions();
19      List<Board> tasks = new ArrayList<>(nextPositions.size());
20      for (Board b : nextPositions) {
21        tasks.add(new RecursiveSolver(b, sequential));
22      }
23      invokeAll(tasks);
24      long count = 0;
25      for (RecursiveSolver rs : tasks) {
26        count += rs.join();
27      }
28      return count;
29    }
30    return Long.valueOf(0);
31  }
32}

The recursion of the sequential algorithm has been replaced by the creation of new instances of RecursiveTask. I introduced another optimization (as proposed in the fork/join tutorial): The parallel algorithm switches back to a sequential one when there are less than sequential pegs left. This avoids the overhead of task creation for small problems. After some experiments I used eight as threshold in my test runs.

Starting this, my laptop (eight cores with hyperthreading, four real ones) was unusable for the next 7 hours and 28 minutes. Compared to the 30 hours of the sequential solver, a factor of four, which matches the number of “real” cores. So why bother? Four cores, four times faster than sequential, perfect speedup.

But what about n = 7? How many solutions are there for a board with edge length seven? I didn’t run this on my laptop, neither sequential nor parallel. I assume it would exceed the lifetime of the poor machine. So let’s look for some other optimizations.

Caching

As in most board games, there is often more than one sequence of moves which result in the same position. An obvious optimization is to store the number of solutions for already computed positions in a HashMap. This is a well known technique called transposition table . As a precondition, the class Board has to implement hashCode() and equals(). For n = 5 this doesn’t make a big difference, we get the answer in 0.07 seconds, 70% of the time needed by the simple sequential solver. For n = 6 we get a more impressing effect, only 0.4 seconds elapse before we can see the result. That’s about 270,000 times faster compared to the sequential solver and even 67,500 times faster compared to the parallel solver running with four cores.

This sounds very promising, so let’s try the next board size, n = 7. Starting this without any JVM options results in an OutOfMemoryError, the HashMap does not fit into the standard heap. Increasing the heap size with the well known -Xmx does not help on a 32 bit JVM: The memory needed does not fit into the 32 bit address space. The next step is to use the brute force approach: 64 bit JVM and the -d64 option to activate the 64 bit mode.

Stop!

I like the HashMap, it’s is one of my favorite data structures and amazingly fast. But in this case there is a simpler, more efficient data structure, the good old array. A position in the game can be represented by some bits, for n = 7 you need 7*(7+1)/2=28 bits, which fits into an integer which can be used as the index of the array. The value in the array is the number of solutions for this position, -1 for positions which have not been evaluated so far. This still doesn’t fit into the 32 bit address space for n = 7, but is more efficient (in time and space) than the HashMap solution. For n = 6, we need only 0.2 seconds compared to the 0.4 seconds.

When we have a 64 bit JVM, we can attack n = 7. But for one moment let’s assume we can’t afford the amount of memory and still want to solve the problem. When you add some debugging output to your code, you will find some strange behavior for n = 7: For n = 5 or n = 6 there are a lot of different solutions, usually the algorithms finds the first solutions quite fast. Not for n = 7. When I tried this first (some years ago, with C instead of Java on an old SUN workstation), the code found no solutions running several minutes. I had a strong suspicion: The triangle peg solitaire has no solution for n = 7. So I modified the code and used only one bit for each position: 0 = position not evaluated so far, 1 = position evaluated, no solution found.

Last week, when I tried this again I was too lazy to use bits, instead I changed the array from long to byte, which was small enough to fit into the 32 bit address space. I could have used a Java BitSet, which saves even more space, but was too lazy. It confirmed what I already knew: There is no solution for n = 7, it took 34 seconds to compute this. Using the 64 bit JVM and long is a little bit slower: 37 seconds. I attribute the three seconds to worse cache locality.

Parallelism Again

We have seen two orthogonal ways to improve performance: Parallelism and caching. Is it possible to combine the approaches? Will this be even faster? Yes, we can combine them, but it becomes uglier. The sheer elegance of the fork join is based on its simplicity: We create new tasks, invoke them in a parallel fashion, wait for the result: You need no synchronized blocks or synchronized methods, every thread works on its own data. A global data structure like a HashMap or array destroys this simplicity, they both need some way of synchronization. But what is the granularity? Locking the complete array for each access? This causes two problems:

Much of the parallelism will be destroyed because all threads compete for one resource.
It does not solve the problem of duplicate work: After one thread sees an unevaluated position and starts to evaluate it, another thread may evaluate the same position in parallel, wasting resources.

So let’s try a more fine grained approach: Locking an entry for one position. Because we need an object as lock holder, we have to change the array of longs to an array of some sort of objects:

1class Value {
2  public Value() {
3    v = -1;
4  }
5  public long v;
6}

The rest of the code looks similar, but with a synchronized block:

1long countSolutions(Board start) {
2  Integer startAsInt = Integer.valueOf(start.asInteger());
3  Value value = cache[startAsInt];
4  synchronized (value) {
5    if (value.v != -1) {
6      return value.v;
7    } else if (start.isSolution()) {
8      value.v = 1;
9      return 1;
10    } else {
11      long count = 0;
12      List nextPositions = start.nextPositions();
13      for (Board board : nextPositions) {
14        count += countSolutions(board);
15      }
16      value.v = count;
17      return count;
18    }
19  } // synchronized
20}

With this approach, we have a separate lock for each position. A thread holds the lock until the evaluation of the position is complete. This avoids duplicate work by several threads, but limits parallelism. For this reason, you should start this algorithm with more threads than CPUs on your system.

Unfortunately the overhead caused by the value object compared to the primitive data type and the synchronization is not compensated by the parallelism: For n = 6 we need 1 second, five times slower compared to the fastest sequential solution with the array of longs.

Lessons Learned

What can we learn from this experiment? Is there anything valuable learned here you can make use of when coding enterprise applications with boring/interesting (No)SQL-Databases as back end? For me it was the first time I used the Fork/Join framework, so I learned this :-). I was surprised, it’s quite easy. The load balancing and work stealing mechanisms seem to work good, the speedup compared to the sequential algorithm was as expected. This is definitively much easier comparing to create threads manually.

The second lesson is about better algorithms. As we have seen, this can make a world of difference, not just a factor of four gained by parallelism. This is far more important than eliminating some function calls or saving a few cycles by replacing double with float or some other tricky programming. This is especially true for large problems, where – for example – the time complexity n log(n) of a good algorithm is much smaller than a time complexity n^2 of a bad algorithm (hint: Sorting).

The third lesson is simple: Don’t do the work at all. At least, don’t repeat it, use caching instead of repeated expensive operations. In this example, the expensive operation was the evaluation of identical branches in the tree. In enterprise applications, accessing the database usually takes most of the time. Given a good JPA provider or application server, you don’t have to implement the caching yourself, just plug in the cache recommended/supported by your provider/server and use the saved time to find a good set of configuration parameters.

In other cases, you have to do some work yourself. But don’t implement everything, there are helping classes available. The HashMap or array used in this post are no real caches, they miss the feature of forgetting entries, so they will blow up your memory at some point. But the JDK has other classes which attack this problem: A WeakHashMap forgets entries automatically when the garbage collector is running, but you have no control when entries are removed or which entries are removed. So it is not recommended to implement a cache. To regain some sort of control, extend LinkedHashMap and override removeEldestEntry() (see javadoc for details). This gives you an LRU cache with just a few lines of code.

When you want even more control, I recommend the Google Guava Cache . It allows eviction on time base or on weight base with a user defined comparison function for the weight.

Another important lesson not learned here is the proper use of a profiler. It can give you valuable information where your application spends all the time. For this simple example, it was clear without a profiler.

Epilogue

It may come as a surprise that there is no solution for n = 7. In fact, you can prove there is no solution for every n where n modulo 3 = 1. I will give a short sketch of the parity based proof.

First let’s place numbers on the board according to the following two patterns:

     1                1
    1 0              0 1
   0[1]1            1[1]0
  1 1 0 1          1 0 1 1
 1 0 1 1 0        0 1 1 0 1
0 1 1 0 1 1      1 1 0 1 1 0

The field in brackets is the field with no peg at the start of a game. The parity is computed by adding all numbers of the fields with a peg and to apply modulo 2. For n = 6 there is an even number of ones on the board. Because the empty field has a one, too, the parity of the start position is odd. If you look at the pattern in a row or on one of the diagonals, you see a repeated sequence of 1 1 0. For every move in such a pattern, the parity stays the same.

Obviously, when the parity of the start position is odd (which is true for the left and right pattern), it must be odd for every position in the game, including the end position. An odd parity with one peg is possible only if this peg is on a field marked with a one.

If you record all end positions with one peg for n = 5, you se it’s always on the same place, which is marked with a one in both patterns:

    o
   o o
  o o o
 o o o o
o o x o o

For n = 6 there are several fields where the last peg may end. Note all these fields are marked with a one on both boards shown above:

     x
    o o
   o x o
  x o o x
 o o x o o
o x o o x o

When n modulo 3 = 1, the number of fields modulo three is one, too. If you extend the patterns shown above, you see there is always a one on the lower left and lower right corner. As a consequence, you have a number of 1 1 0 groups and one additional one. Together with the empty field in the start position located on a one, this results in an even parity for the start position. Even parity with one peg left implies the last peg has to end on a field marked with zero. But whenever a field is marked zero in the left pattern, it is marked with a one in the right pattern (and vice versa). So there is no possible end position left for the last peg…

Wouldn’t it be evil to sell this game with size n = 7?

Was this post helpful?

Blog author

Roger Butenuth

Do you still have questions? Just send me a message.

Nested Fixture Pattern for JUnit

JUnit's @Nested classes are usually presented as a way to group related tests. But combined with @RegisterExtension and ExtensionContext.Store, they become something more powerful: a declarative scenario tree where each level adds a scope in which fixtures...

Testing
Java
Software development

9.3.2026 | 11 minutes reading time

Rüdiger zu Dohna

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

How-To: Seamless development in WSL2 with git, SSH and podman desktop

Weather you want a more uniform development environment across your team to avoid compatibility issues between different operating systems, want to work closer to your target environment, or need to run a linux exclusive tool like Claude Code, an AI ...

Git
Microsoft
Software development

5.1.2026 | 5 minutes reading time

Why Scanners Fail in Practice: Lessons from the Shai-Hulud Attacks on ...

2025 marks the year supply chain security stopped being a theoretical risk and became a practical nightmare for anyone managing a package.json file. The recent attack waves on the NPM ecosystem demonstrated this vividly, turning trusted libraries into...

IT-Security
DevSecOps
CI/CD
JavaScript

9.12.2025 | 9 minutes reading time

Felix Dreißig

20 years of coding

We all grow older. It is simply inevitable. As the saying goes, The only way to not grow old is to die young. Recently, I've completed my 20th year in the development industry. Through academia, consulting, and a stint in product development, I've learned...

Software development
Training
Culture

11.4.2025 | 10 minutes reading time

Elisabeth Schulz

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

Spring and Vue - A setup for small projects (Part 2)

In the first part we presented a setup for a combination of Spring Boot and Vue.js. Now we have to look at how to connect two type-safe languages, TypeScript for the frontend and Java for the backend, through a REST-API and in a type-safe manner. We ...

Spring
Frontend
API
JavaScript
Java

17.1.2025 | 10 minutes reading time

Roger Butenuth

Nils Winking

Spring and Vue - A setup for small projects (Part 1)

Quickly adding a new Vue.js application to an existing Spring Boot project should be pretty easy, or at least a googleable problem, or so we thought. But in the end, it wasn't. However, with the right combination of configuration, components, and some...

Spring
Frontend
JavaScript
Java
API

10.1.2025 | 8 minutes reading time

Roger Butenuth

Nils Winking

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 minutes reading time

Danny Keller

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 minutes reading time

Manuel Zapf

React is dead, long live React - React 19 is here

The world of frontend development has changed once again, and this time React 19 is leading the way. This version brings a variety of new features and improvements, but the most exciting innovation is the brand new compiler, which already requires React...

React
Frontend
Software development
JavaScript
Webdevelopment

19.7.2024 | 6 minutes reading time

Michel Ehmen

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 minutes reading time

Manuel Zapf

Spring Boot and HTMX: The boring app

Motivation Most apps I touched in the wild follow the same two tiered approach. A backend delivering JSON (some may call this REST) and a frontend framework, consuming JSON from the backend converting it to the HTML displayed to the user. Worst case,...

Software architecture
Software development
Spring
Kotlin

28.6.2024 | 16 minutes reading time

Server Actions in Next.js 14

Server Actions were introduced in Next.js 14 as a new method to send data to the server (see the documentation). They are asynchronous functions that can be used in server components, within server-side forms, as well as in client-side components. While...

Webdevelopment
React
JavaScript

10.6.2024 | 9 minutes reading time

Lukas Lehmann

Charge your APIs Volume 25: Contract Testing

I feel the way we do integration testing is sort of like setting your house on fire to test your smoke alarm. It is excessive, tiresome and way too costly. This is not a quote from myself. I typically don't come up with such good ideas when I need....

Testing
Software development
API

2.4.2024 | 11 minutes reading time

Pasquale Brunelli

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 minutes reading time

Building desktop apps with web technologies

Building desktop apps with web technologies In this article I share insights into Electron and what to consider when shipping an desktop app with Electron. After that I introduce you to a new alternative called Tauri. It the end I provide an estimation...

Frontend
JavaScript
Node.js
Open Source
Webdevelopment

20.9.2023 | 13 minutes reading time

Modern data fetching with Redux Toolkit Query

First released seven years ago, Redux was already modernized four years ago with Redux Toolkit (RTK). Then in June 2021, Redux reached the next stage of evolution by adding a dedicated data fetching solution with Redux Toolkit Query. With respect to ...

React
JavaScript
Frontend

28.2.2023 | 10 minutes reading time

Björn Heiß

The best of both worlds: Harnessing the benefits of object-oriented and...

Functional programming and OOP are often viewed as two separate paradigms in programming. And it is true that programming languages lean more towards one or the other, which influences how we are "supposed to" solve a problem in this language. In this...

Pattern
Functional programming
Software development

1.2.2023 | 8 minutes reading time

Thomas Buß

Fork/Join and other Techniques to Improve Performance

Definition of the Problem

Fork/Join

Caching

Parallelism Again

Lessons Learned

Epilogue

Was this post helpful?

Blog author

More articles in this subject area

Nested Fixture Pattern for JUnit

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

How-To: Seamless development in WSL2 with git, SSH and podman desktop

Why Scanners Fail in Practice: Lessons from the Shai-Hulud Attacks on ...

20 years of coding

Hexagonal Architecture is just an island

Spring and Vue - A setup for small projects (Part 2)

Spring and Vue - A setup for small projects (Part 1)

ArchUnit in practice: Keep your Architecture Clean

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

React is dead, long live React - React 19 is here

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Spring Boot and HTMX: The boring app

Server Actions in Next.js 14

Charge your APIs Volume 25: Contract Testing

A/B Testing: Tool support and testing GrowthBook

How to gain visibility as a software developer?

Building desktop apps with web technologies

Modern data fetching with Redux Toolkit Query

The best of both worlds: Harnessing the benefits of object-oriented and...