Optimizing concurrent Go part 2: multicore performance
In part 1 we explored single-core performance. The vast majority of the performance gain came from algorithmic improvements, and yet we discovered that there was a further 40% performance improvement availale from making our code cache-friendly. In this second and final part, we make our code concurrent, and, in doing so, discover that understanding cache effects is even more important when writing concurrent code.
The concurrent question
So let’s continue with the interview question. We’re now going to make our code from part 1 faster by using Go’s concurrency to share the work across multiple goroutines - which, assuming we have a multicore CPU - should mean that we should be able to go much faster.
The fourth solution
We spawn N goroutines (where N in is the number of CPUs on our machine) and give each goroutine 1/Nth of the flows to check. As before, here’s the initial code. Can you spot how to make it faster?
// CountBadIPs4 returns the number of bad IPs in flows.
func CountBadIPs4(badIPs IPSet, flows []PackedFlow) int {
var mu sync.Mutex
var wg sync.WaitGroup
totalBadIPs := 0
numGoroutines := runtime.NumCPU()
wg.Add(numGoroutines)
for i := 0; i < numGoroutines; i++ {
go func(i int) {
defer wg.Done()
// Goroutine 0 processes 0, numGoroutines, 2*numGoroutines
// Goroutine 1 processes 1, numGoroutines+1, 2*numGoroutines+1
// Goroutine 2 processes 2, numGoroutines+2, 2*numGoroutines+2
// etc.
for j := i; j < len(flows); j += numGoroutines {
flow := flows[j]
if _, ok := badIPs[flow.SrcIP]; ok {
mu.Lock()
totalBadIPs++
mu.Unlock()
}
if _, ok := badIPs[flow.DstIP]; ok {
mu.Lock()
totalBadIPs++
mu.Unlock()
}
}
}(i)
}
wg.Wait()
return totalBadIPs
}
First let’s see how it performs on an Apple MacBook Air M1 with 8 cores (4 performance cores and 4 efficiency cores):
$ go test '-bench=[3-4]' .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs3-8 255 4690940 ns/op
BenchmarkCountBadIPs4-8 7 163487869 ns/op
Wait, what? Our concurrent implementation is over 30x slower than our single core implementation! What’s going wrong?
You certainly spotted:
-
All goroutines update a single variable protected by a mutex. There’s going to be a lot of contention on that mutex. Instead, we should have each goroutine update its own count and then combine them at the end.
-
Although each goroutine counts 1/Nth of the flows, they all effectively have to read the entire dataset. Caches typically load 64 bytes of data into separate cache lines. Accessing data within an already-loaded cacheline is very fast. The fact that each goroutine only reads a few bytes from each cacheline that it loads is very suboptimal.
The fourth solution
Let’s fix both these problems:
// CountBadIPs5 returns the number of bad IPs in flows.
func CountBadIPs5(badIPs IPSet, flows []PackedFlow) int {
var wg sync.WaitGroup
numGoroutines := runtime.NumCPU()
chunkSize := len(flows) / numGoroutines
badIPsByGoroutine := make([]int, numGoroutines)
wg.Add(numGoroutines)
for i := 0; i < numGoroutines; i++ {
go func(i int) {
defer wg.Done()
// Goroutine 0 gets the first 1/Nth chunk
// Goroutine 1 gets the second 1/Nth chunk
// Goroutine 2 gets the third 1/Nth chunk
// etc.
chunkStart := i * chunkSize
chunkEnd := (i + 1) * chunkSize
if chunkEnd > len(flows) {
chunkEnd = len(flows)
}
for j := chunkStart; j < chunkEnd; j++ {
flow := flows[j]
if _, ok := badIPs[flow.SrcIP]; ok {
badIPsByGoroutine[i]++
}
if _, ok := badIPs[flow.DstIP]; ok {
badIPsByGoroutine[i]++
}
}
}(i)
}
wg.Wait()
totalBadIPs := 0
for i := 0; i < numGoroutines; i++ {
totalBadIPs += badIPsByGoroutine[i]
}
return totalBadIPs
}
Now each goroutine has its own counter in the badIPsByGoroutine
slice so
there’s no longer need for a shared mutex. We also split the input data more
sensibly: each goroutine reads its own contiguous chunk of flows, which doesn’t
overlap any other goroutine’s chunk.
Let’s run the benchmark:
$ go test '-bench=[345]' .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8 1 57264455083 ns/op
BenchmarkCountBadIPs2-8 178 6720723 ns/op
BenchmarkCountBadIPs3-8 255 4690940 ns/op
BenchmarkCountBadIPs4-8 7 163487869 ns/op
BenchmarkCountBadIPs5-8 614 1876674 ns/op
That’s more like it! We’re now 2.5x quicker than our already-high-performance single core code. Depending on how the operating system allocates its cores to our process, we probably won’t get more than four performance cores, so out theoretical maximum is probably around 4x. Not bad!
The sixth solution
But wait, there are two subtle problems with our CountBadIPs5
function. Did
you spot them?
The first is a typical chunking error: if the number of flows is not exactly divisible by the number of CPUs then we miss the last few flows (e.g. if the number of flows is seven, and we have two CPUs, then we only count the first 3x2=6 flows, missing the last one). This can by fixed by adding an extra check at the end, which will not significantly affect the runtime.
The second is more subtle. Each of the goroutines we spawn has its own counter.
However, all these counters are adjacent in memory in the badIPsByGoroutine
slice. This means that many of the counters share the same cacheline, and as
separate goroutines running on different cores will each be updating data in the
same cacheline, which leads to contention and performance degradation. This
phenomenan is known as false
sharing.
Let’s fix it with our final code:
// CountBadIPs6 returns the number of bad IPs in flows.
func CountBadIPs6(badIPs IPSet, flows []PackedFlow) int {
numGoroutines := runtime.NumCPU()
chunkSize := len(flows) / numGoroutines
countCh := make(chan int, numGoroutines)
for i := 0; i < numGoroutines; i++ {
go func(i int) {
chunkStart := i * chunkSize
chunkEnd := (i + 1) * chunkSize
if chunkEnd > len(flows) {
chunkEnd = len(flows)
}
count := 0
for j := chunkStart; j < chunkEnd; j++ {
flow := flows[j]
if _, ok := badIPs[flow.SrcIP]; ok {
count++
}
if _, ok := badIPs[flow.DstIP]; ok {
count++
}
}
countCh <- count
}(i)
}
totalBadIPs := 0
for i := 0; i < numGoroutines; i++ {
totalBadIPs += <-countCh
}
close(countCh)
return totalBadIPs
}
And run the full set of benchmarks:
$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8 1 57264455083 ns/op
BenchmarkCountBadIPs2-8 178 6720723 ns/op
BenchmarkCountBadIPs3-8 255 4690940 ns/op
BenchmarkCountBadIPs4-8 7 163487869 ns/op
BenchmarkCountBadIPs5-8 614 1876674 ns/op
BenchmarkCountBadIPs6-8 1009 1213470 ns/op
That’s more like it. We’ve got a 1.5x speed up from our previous concurrent code, and are now 4.4x faster than our optimized single-core implementation.
Conclusion
Our final optimized concurrent implementation is over 25000x faster than our initial naive code. Roughly 8000x of the performance improvement came from using the right algorithms and data structures, but to get the final 3x to make really fast code, we had to understand caches.
All the source code for this post is in
github.com/twpayne/gopher-advent-2022-hot-function
.