Posted Tue Dec 6, 2022 • Reading time: 5 minutes

Optimizing concurrent Go part 2: multicore performance

In part 1 we explored single-core performance. The vast majority of the performance gain came from algorithmic improvements, and yet we discovered that there was a further 40% performance improvement availale from making our code cache-friendly. In this second and final part, we make our code concurrent, and, in doing so, discover that understanding cache effects is even more important when writing concurrent code.

The concurrent question

So let’s continue with the interview question. We’re now going to make our code from part 1 faster by using Go’s concurrency to share the work across multiple goroutines - which, assuming we have a multicore CPU - should mean that we should be able to go much faster.

The fourth solution

We spawn N goroutines (where N in is the number of CPUs on our machine) and give each goroutine 1/Nth of the flows to check. As before, here’s the initial code. Can you spot how to make it faster?

// CountBadIPs4 returns the number of bad IPs in flows.
func CountBadIPs4(badIPs IPSet, flows []PackedFlow) int {
	var mu sync.Mutex
	var wg sync.WaitGroup
	totalBadIPs := 0
	numGoroutines := runtime.NumCPU()
	wg.Add(numGoroutines)
	for i := 0; i < numGoroutines; i++ {
		go func(i int) {
			defer wg.Done()
			// Goroutine 0 processes 0, numGoroutines, 2*numGoroutines
			// Goroutine 1 processes 1, numGoroutines+1, 2*numGoroutines+1
			// Goroutine 2 processes 2, numGoroutines+2, 2*numGoroutines+2
			// etc.
			for j := i; j < len(flows); j += numGoroutines {
				flow := flows[j]
				if _, ok := badIPs[flow.SrcIP]; ok {
					mu.Lock()
					totalBadIPs++
					mu.Unlock()
				}
				if _, ok := badIPs[flow.DstIP]; ok {
					mu.Lock()
					totalBadIPs++
					mu.Unlock()
				}
			}
		}(i)
	}

	wg.Wait()

	return totalBadIPs
}

First let’s see how it performs on an Apple MacBook Air M1 with 8 cores (4 performance cores and 4 efficiency cores):

$ go test '-bench=[3-4]' .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs3-8   	     255	   4690940 ns/op
BenchmarkCountBadIPs4-8   	       7	 163487869 ns/op

Wait, what? Our concurrent implementation is over 30x slower than our single core implementation! What’s going wrong?

You certainly spotted:

All goroutines update a single variable protected by a mutex. There’s going to be a lot of contention on that mutex. Instead, we should have each goroutine update its own count and then combine them at the end.
Although each goroutine counts 1/Nth of the flows, they all effectively have to read the entire dataset. Caches typically load 64 bytes of data into separate cache lines. Accessing data within an already-loaded cacheline is very fast. The fact that each goroutine only reads a few bytes from each cacheline that it loads is very suboptimal.

The fourth solution

Let’s fix both these problems:

// CountBadIPs5 returns the number of bad IPs in flows.
func CountBadIPs5(badIPs IPSet, flows []PackedFlow) int {
	var wg sync.WaitGroup
	numGoroutines := runtime.NumCPU()
	chunkSize := len(flows) / numGoroutines
	badIPsByGoroutine := make([]int, numGoroutines)
	wg.Add(numGoroutines)
	for i := 0; i < numGoroutines; i++ {
		go func(i int) {
			defer wg.Done()

			// Goroutine 0 gets the first 1/Nth chunk
			// Goroutine 1 gets the second 1/Nth chunk
			// Goroutine 2 gets the third 1/Nth chunk
			// etc.
			chunkStart := i * chunkSize
			chunkEnd := (i + 1) * chunkSize
			if chunkEnd > len(flows) {
				chunkEnd = len(flows)
			}

			for j := chunkStart; j < chunkEnd; j++ {
				flow := flows[j]
				if _, ok := badIPs[flow.SrcIP]; ok {
					badIPsByGoroutine[i]++
				}
				if _, ok := badIPs[flow.DstIP]; ok {
					badIPsByGoroutine[i]++
				}
			}
		}(i)
	}

	wg.Wait()

	totalBadIPs := 0
	for i := 0; i < numGoroutines; i++ {
		totalBadIPs += badIPsByGoroutine[i]
	}
	return totalBadIPs
}

Now each goroutine has its own counter in the badIPsByGoroutine slice so there’s no longer need for a shared mutex. We also split the input data more sensibly: each goroutine reads its own contiguous chunk of flows, which doesn’t overlap any other goroutine’s chunk.

Let’s run the benchmark:

$ go test '-bench=[345]' .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8   	       1	57264455083 ns/op
BenchmarkCountBadIPs2-8   	     178	   6720723 ns/op
BenchmarkCountBadIPs3-8   	     255	   4690940 ns/op
BenchmarkCountBadIPs4-8   	       7	 163487869 ns/op
BenchmarkCountBadIPs5-8   	     614	   1876674 ns/op

That’s more like it! We’re now 2.5x quicker than our already-high-performance single core code. Depending on how the operating system allocates its cores to our process, we probably won’t get more than four performance cores, so out theoretical maximum is probably around 4x. Not bad!

The sixth solution

But wait, there are two subtle problems with our CountBadIPs5 function. Did you spot them?

The first is a typical chunking error: if the number of flows is not exactly divisible by the number of CPUs then we miss the last few flows (e.g. if the number of flows is seven, and we have two CPUs, then we only count the first 3x2=6 flows, missing the last one). This can by fixed by adding an extra check at the end, which will not significantly affect the runtime.

The second is more subtle. Each of the goroutines we spawn has its own counter. However, all these counters are adjacent in memory in the badIPsByGoroutine slice. This means that many of the counters share the same cacheline, and as separate goroutines running on different cores will each be updating data in the same cacheline, which leads to contention and performance degradation. This phenomenan is known as false sharing.

Let’s fix it with our final code:

// CountBadIPs6 returns the number of bad IPs in flows.
func CountBadIPs6(badIPs IPSet, flows []PackedFlow) int {
	numGoroutines := runtime.NumCPU()
	chunkSize := len(flows) / numGoroutines
	countCh := make(chan int, numGoroutines)
	for i := 0; i < numGoroutines; i++ {
		go func(i int) {
			chunkStart := i * chunkSize
			chunkEnd := (i + 1) * chunkSize
			if chunkEnd > len(flows) {
				chunkEnd = len(flows)
			}

			count := 0
			for j := chunkStart; j < chunkEnd; j++ {
				flow := flows[j]
				if _, ok := badIPs[flow.SrcIP]; ok {
					count++
				}
				if _, ok := badIPs[flow.DstIP]; ok {
					count++
				}
			}

			countCh <- count
		}(i)
	}

	totalBadIPs := 0
	for i := 0; i < numGoroutines; i++ {
		totalBadIPs += <-countCh
	}
	close(countCh)
	return totalBadIPs
}

And run the full set of benchmarks:

$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8   	       1	57264455083 ns/op
BenchmarkCountBadIPs2-8   	     178	   6720723 ns/op
BenchmarkCountBadIPs3-8   	     255	   4690940 ns/op
BenchmarkCountBadIPs4-8   	       7	 163487869 ns/op
BenchmarkCountBadIPs5-8   	     614	   1876674 ns/op
BenchmarkCountBadIPs6-8   	    1009	   1213470 ns/op

That’s more like it. We’ve got a 1.5x speed up from our previous concurrent code, and are now 4.4x faster than our optimized single-core implementation.

Conclusion

Our final optimized concurrent implementation is over 25000x faster than our initial naive code. Roughly 8000x of the performance improvement came from using the right algorithms and data structures, but to get the final 3x to make really fast code, we had to understand caches.

All the source code for this post is in github.com/twpayne/gopher-advent-2022-hot-function.