Posted Sun Dec 4, 2022 • Reading time: 5 minutes

Optimizing concurrent Go part 1: single core performance

This started out as an interview question for identifying engineers who had a good understanding of what’s needed to write high performance code. It was never asked, but explores a couple of possibly surprising effects.

Part 1 of this two part series focuses on single core performance. Part 2 adds concurrency to the mix. Spoiler alert: understanding cache behavior is critical.

The question

The technical interview starts like this:

Our company’s router logs every IPv4 network flow that it sees: the source and destination IP addresses, ports, and the network protocol. We’ve also got a long list of “bad” IPv4 addresses (malware, botnet C&C servers, piracy sites, etc.). How many times do we see any of these bad IPv4 addresses in our router logs?

We have some code that parses the logs to produce a slice of Flow structs in Go:

import "net"

// A Flow is an IPv4 network flow.
type Flow struct {
	SrcIP   net.IP
	SrcPort uint16
	DstIP   net.IP
	DstPort uint16
	Proto   byte
}

We have a list of bad IPs as []net.IP and we want a function called CountBadIPs:

// CountBadIPs1 returns the number of bad IPs in flows.
func CountBadIPs1(badIPs []net.IP, flows []*Flow) int

The first solution

The problem statement continues:

The performance of our system turns out to be very poor, so we’ve run done some profiling, and we’ve identified that most of the time is spent in the CountBadIPs1 function. Here’s the function, do you see anything that could be improved?

// isBadIP returns if ip is in badIPs.
func isBadIP(ip net.IP, badIPs []net.IP) bool {
	for _, badIP := range badIPs {
		if ip.Equal(badIP) {
			return true
		}
	}
	return false
}

// CountBadIPs1 returns the number of bad IPs in flows.
func CountBadIPs1(badIPs []net.IP, flows []*Flow) int {
	totalBadIPs := 0
	for _, flow := range flows {
		if isBadIP(flow.SrcIP, badIPs) {
			totalBadIPs++
		}
	}
	for _, flow := range flows {
		if isBadIP(flow.DstIP, badIPs) {
			totalBadIPs++
		}
	}
	return totalBadIPs
}

For reference, here is the benchmark on our CountBadIPs function running on one million flows, ten thousand bad IPs, where approximately 1% of the IPs are bad:

$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8                1        56966546041 ns/op

That’s pretty slow. Surely we can do better? As this is an interview question, with some carefully chosen low-haning fruit to pick. What problems can you spot in the code?

We loop over the server logs twice, once checking for source IPs, and once checking for destination IPs. Surely these can be combined?
We do linear search through the bad IP addresses. Surely we can use a more efficient data structure, like one a Go map?

The second solution

Let’s fix both of these problems:

We merge loops over the same data into a single loop.
We replace linear search with a map lookup. We can’t use net.IPs as map keys, but we can pack an IPv4 address into a uint32 and use that as a map key.

Let’s do it:

type IPSet map[uint32]struct{}

// netIPToUint32 converts an net.IP (which is assumed to be an IPv4 address) to
// a uint32.
func netIPToUint32(netIP net.IP) uint32 {
	return uint32(netIP[0])<<24 + uint32(netIP[1])<<16 + uint32(netIP[2])<<8 + uint32(netIP[3])
}

// newIPSet returns a new set of IPs.
func newIPSet(ips []net.IP) IPSet {
	ipSet := make(IPSet)
	for _, ip := range ips {
		ipSet[netIPToUint32(ip)] = struct{}{}
	}
	return ipSet
}

// CountBadIPs2 returns the number of bad IPs in flows.
func CountBadIPs2(badIPs IPSet, flows []*Flow) int {
	totalBadIPs := 0
	for _, flow := range flows {
		if _, ok := badIPs[netIPToUint32(flow.SrcIP)]; ok {
			totalBadIPs++
		}
		if _, ok := badIPs[netIPToUint32(flow.DstIP)]; ok {
			totalBadIPs++
		}
	}
	return totalBadIPs
}

And let’s re-run the benchmarks:

$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8                1        56966546041 ns/op
BenchmarkCountBadIPs2-8              177           6755598 ns/op

That’s an over 8000x improvement.

The third solution

Surely we’re done?

Well, no. Although the code is much much better, it still has room for improvement. Currently just reading each Flow’s source and destination IP addresses involves multiple hops through pointers. As the slice passed to our function is a []*Flow, we need to dereference the pointer to access the *Flow pointer to access the SrcIP and DstIP fields. Furthermore, as net.IPs are []bytes under the hood, we need a second and third dereference to access the actual IP address bytes. Every pointer dereference is a potential cache miss, and writing code that is cache friendly is critical for performance on modern processors.

We can effectively eliminate the first possible cache miss by packing the Flows adjacently in memory, i.e. using []Flow instead of []*Flow. To eliminate the second and third cache misses we can embed the SrcIP and DstIPs as uint32s in a struct. This leads to:

// A PackedFlow is an IPv4 network flow.
type PackedFlow struct {
	SrcIP   uint32
	DstIP   uint32
	SrcPort uint16
	DstPort uint16
	Proto   byte
}

// CountBadIPs3 returns the number of bad IPs in flows.
func CountBadIPs3(badIPs IPSet, flows []PackedFlow) int {
	totalBadIPs := 0
	for _, flow := range flows {
		if _, ok := badIPs[flow.SrcIP]; ok {
			totalBadIPs++
		}
		if _, ok := badIPs[flow.DstIP]; ok {
			totalBadIPs++
		}
	}
	return totalBadIPs
}

Re-running our benchmarks:

$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8                1        56966546041 ns/op
BenchmarkCountBadIPs2-8              177           6755598 ns/op
BenchmarkCountBadIPs3-8              255           4674054 ns/op

That’s a further 1.4x speedup. Cache behavior has a signficant impact on performance.

Conclusion

At this point, we’re at a reasonable local optimum for single-core performance. We can certainly do better, but not without significantly complicating the code. This concludes part 1.

In part 2, we’ll use Go’s concurrency to improve the wall clock performance, but not without a few hiccups along the way. It turns out that understanding cache behavior is very important when writing concurrent code.

All the source code for this post is in github.com/twpayne/gopher-advent-2022-hot-function.