Optimizing concurrent Go part 1: single core performance
This started out as an interview question for identifying engineers who had a good understanding of what’s needed to write high performance code. It was never asked, but explores a couple of possibly surprising effects.
Part 1 of this two part series focuses on single core performance. Part 2 adds concurrency to the mix. Spoiler alert: understanding cache behavior is critical.
The question
The technical interview starts like this:
Our company’s router logs every IPv4 network flow that it sees: the source and destination IP addresses, ports, and the network protocol. We’ve also got a long list of “bad” IPv4 addresses (malware, botnet C&C servers, piracy sites, etc.). How many times do we see any of these bad IPv4 addresses in our router logs?
We have some code that parses the logs to produce a slice of Flow
structs in Go:
import "net"
// A Flow is an IPv4 network flow.
type Flow struct {
SrcIP net.IP
SrcPort uint16
DstIP net.IP
DstPort uint16
Proto byte
}
We have a list of bad IPs as []net.IP
and we want a function called
CountBadIPs
:
// CountBadIPs1 returns the number of bad IPs in flows.
func CountBadIPs1(badIPs []net.IP, flows []*Flow) int
The first solution
The problem statement continues:
The performance of our system turns out to be very poor, so we’ve run done some profiling, and we’ve identified that most of the time is spent in the
CountBadIPs1
function. Here’s the function, do you see anything that could be improved?
// isBadIP returns if ip is in badIPs.
func isBadIP(ip net.IP, badIPs []net.IP) bool {
for _, badIP := range badIPs {
if ip.Equal(badIP) {
return true
}
}
return false
}
// CountBadIPs1 returns the number of bad IPs in flows.
func CountBadIPs1(badIPs []net.IP, flows []*Flow) int {
totalBadIPs := 0
for _, flow := range flows {
if isBadIP(flow.SrcIP, badIPs) {
totalBadIPs++
}
}
for _, flow := range flows {
if isBadIP(flow.DstIP, badIPs) {
totalBadIPs++
}
}
return totalBadIPs
}
For reference, here is the benchmark on our CountBadIPs
function running on
one million flows, ten thousand bad IPs, where approximately 1% of the IPs are bad:
$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8 1 56966546041 ns/op
That’s pretty slow. Surely we can do better? As this is an interview question, with some carefully chosen low-haning fruit to pick. What problems can you spot in the code?
- We loop over the server logs twice, once checking for source IPs, and once checking for destination IPs. Surely these can be combined?
- We do linear search through the bad IP addresses. Surely we can use a more
efficient data structure, like one a Go
map
?
The second solution
Let’s fix both of these problems:
- We merge loops over the same data into a single loop.
- We replace linear search with a map lookup. We can’t use
net.IP
s as map keys, but we can pack an IPv4 address into auint32
and use that as a map key.
Let’s do it:
type IPSet map[uint32]struct{}
// netIPToUint32 converts an net.IP (which is assumed to be an IPv4 address) to
// a uint32.
func netIPToUint32(netIP net.IP) uint32 {
return uint32(netIP[0])<<24 + uint32(netIP[1])<<16 + uint32(netIP[2])<<8 + uint32(netIP[3])
}
// newIPSet returns a new set of IPs.
func newIPSet(ips []net.IP) IPSet {
ipSet := make(IPSet)
for _, ip := range ips {
ipSet[netIPToUint32(ip)] = struct{}{}
}
return ipSet
}
// CountBadIPs2 returns the number of bad IPs in flows.
func CountBadIPs2(badIPs IPSet, flows []*Flow) int {
totalBadIPs := 0
for _, flow := range flows {
if _, ok := badIPs[netIPToUint32(flow.SrcIP)]; ok {
totalBadIPs++
}
if _, ok := badIPs[netIPToUint32(flow.DstIP)]; ok {
totalBadIPs++
}
}
return totalBadIPs
}
And let’s re-run the benchmarks:
$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8 1 56966546041 ns/op
BenchmarkCountBadIPs2-8 177 6755598 ns/op
That’s an over 8000x improvement.
The third solution
Surely we’re done?
Well, no. Although the code is much much better, it still has room for
improvement. Currently just reading each Flow
’s source and destination IP
addresses involves multiple hops through pointers. As the slice passed to our
function is a []*Flow
, we need to dereference the pointer to access the
*Flow
pointer to access the SrcIP
and DstIP
fields. Furthermore, as
net.IP
s are []byte
s under the hood, we need a second and third dereference
to access the actual IP address bytes. Every pointer dereference is a potential
cache miss, and writing code that is cache friendly is critical for performance
on modern processors.
We can effectively eliminate the first possible cache miss by packing the Flow
s adjacently in memory, i.e. using []Flow
instead of []*Flow
. To eliminate the second and third cache misses we can embed the SrcIP
and DstIP
s as uint32
s in a struct. This leads to:
// A PackedFlow is an IPv4 network flow.
type PackedFlow struct {
SrcIP uint32
DstIP uint32
SrcPort uint16
DstPort uint16
Proto byte
}
// CountBadIPs3 returns the number of bad IPs in flows.
func CountBadIPs3(badIPs IPSet, flows []PackedFlow) int {
totalBadIPs := 0
for _, flow := range flows {
if _, ok := badIPs[flow.SrcIP]; ok {
totalBadIPs++
}
if _, ok := badIPs[flow.DstIP]; ok {
totalBadIPs++
}
}
return totalBadIPs
}
Re-running our benchmarks:
$ go test -bench=. .
goos: darwin
goarch: arm64
pkg: github.com/twpayne/gopher-advent-2022-hot-function
BenchmarkCountBadIPs1-8 1 56966546041 ns/op
BenchmarkCountBadIPs2-8 177 6755598 ns/op
BenchmarkCountBadIPs3-8 255 4674054 ns/op
That’s a further 1.4x speedup. Cache behavior has a signficant impact on performance.
Conclusion
At this point, we’re at a reasonable local optimum for single-core performance. We can certainly do better, but not without significantly complicating the code. This concludes part 1.
In part 2, we’ll use Go’s concurrency to improve the wall clock performance, but not without a few hiccups along the way. It turns out that understanding cache behavior is very important when writing concurrent code.
All the source code for this post is in
github.com/twpayne/gopher-advent-2022-hot-function
.