Of Donkeys, Mules & Horses

Storing and retrieving IP Prefixes efficiently

By Jasper den Hertog

Special thanks to Tom Carpay & Willem Toorop

Introduction

Storing and searching a full table of IP prefixes from the default free zone appears to be a solved problem, since this is the bread-and-butter of BGP speaking routers. Lots of research and engineering efforts have gone into making longest matching prefix searches for IP addresses in routers happen at wire-speed. [1]

Our goals are a bit different though, we don't require the fastest searches, modest performance will do in this area. But we do have additional wishes, like modest memory consumption, decent write performance, etc.

This blog post is an account of my quest for the best fitting data structure given our wishes and constraints, from searching for and reading papers to implementing, modifying and benchmarking them.


  1. https://blog.apnic.net/2020/06/04/modern-router-architecture-and-ipv6/ ↩︎

Why do we want this?

Eventually we want to build an engine for BGP, that's open source, modular and that focuses on analytics. Seen from afar this looks very much like a Routing Information Base (RIB) stored in a router somewhere. The biggest differences would be that we store way more data linked to each prefix and that we would like to query our data in other ways than just by IP address or prefix. We strongly believe that the Rust programming language is best situated to help us do this, and you'll find that all example implementations mentioned in this blog post are written in Rust.

So let's go over our wish list a bit. As we just saw, we need to analyze a full table of IPv4 (currently around 900,000 unique prefixes) and IPv6 (currently somewhat over 100,000 prefixes). We want to store data accompanying these prefixes of arbitrary type and size. We want this meta-data to be retrievable by IP address or IP prefix, but we also want to be able to aggregate on prefix, prefix ranges, time, and so on. We need to be able to retrieve more/less specifics of prefix with reasonable performance. We need to read and write to this data structure all the time, simultaneously from and to different streams. It needs to handle bursts of BGP messages from each of these streams. We need to be able to store the state at any time, preferably to different kinds of storage back-ends (in-memory data-structures, disk, messaging system).

Finally, it needs to able to run on commodity (server) hardware and mainstream server operating systems (most probably a linux distribution), so we can leave our bag of tricks with TCAMs, FPGAs, AVX-512 and what not at the doorstep.

Of Donkeys, Mules and Horses

As mentioned in the introduction there are plenty research papers available that describe the thoroughbred race-horses among algorithms. A horse can be made to do incredible things, can be incredibly strong, but sadly needs lots and lots of maintenance and special care, diet and training. A horse is expensive, both in purchase and maintenance. Then there are donkeys, a donkey can be made to do only a few things successfully, but goes on forever if treated right, needs little food or maintenance. Also, donkeys are cheap. They're modest, humble animals that'll work until the day they die. Finally, we could move to using mules, cross-breeds between horses and donkeys. "Mules are reputed to be more patient, hardy and long-lived than horses and are described as less obstinate and more intelligent than donkeys"[1].

So I expect us to in the end cross breed a mule of our own, but before we start doing so, I'd like you to introduce you to someone else's fine mule.


  1. https://en.wikipedia.org/wiki/Mule#cite_note-Jackson2004-4 ↩︎

A Mule, the key-value collection

So let's start with the mule among data-structures, the Associative Array, HashMap, Dictionary, Object or whatever it is called in your favorite programming language, basically a collection of (key, value) pairs. Under the hood they can be quite complex, but for the application programmer they are pretty easy to use. If you don't know what assumptions to make about the nature of the data and the way you want to interrogate it, aggregate over it, transform it and so on, such a data structure is your safest bet: it is generally very fast, can easily be stored and retrieved permanently. It is somewhere halfway a specialised data-structure - it only deals with key-value collections - and a generic, all-purpose one, since it has many ways of transforming and searching the data contained in it. A real mule, I'd say.

Let's consider a full table of IPv4 data. In order to simulate a full table I've downloaded a RIS Whois file from ripe.net[1], which holds a list of prefixes with their originating Autonomous Systems (AS) and the number of peers of the RIS collector that are seeing this combination of prefix and AS. This resembles the amount of prefixes and the topology in a full table in the default free zone.

What we get is something like this:

[
	...,
	"100.4.146.0/24": { "origin_as": 701, ... },
 	"100.4.147.0/24": { "origin_as": 64777, ... },
 	"100.4.147.0/24": { "origin_as": 701, ... },
 	...
]

And that's a cool structure to have. It's easy to see how we can aggregate data on origin_as for example. It's also easy to see how this could be stored as JSON (well, ok, good that you noticed, this actually is JSON, albeit awful). If all you need is store and retrieve data around isolated prefixes than you have an answer about what data structure to use.

Matching on a specific prefix ("exact match") works really well for the data structure we've created so far. But that's as far as it goes, we can't even look up an IP address contained in a prefix, since the key is an arbitrary string to the system. One way to go around this to store the first IP address in the prefix and the last IP address in the prefix into the data-structure directly:

[
	"100.4.146.0/24": { 
		"start": 1678021120, 
		"end": 1678021375, 
		"origin_as": 701, ... 
	},
 	"100.4.147.0/24": { 
 		"start": 1678021376,
 		"end": 1678021631,
 		"origin_as": 64777, ...
 	},
 	"100.4.147.0/24": { 
 		"start": 1678021376,
 		"end": 1678021631,
 		"origin_as": 701, ...
  }
]

Now we can either iterate over the data-structure, while looking for the IP address we are searching for, represented as an integer (more about that later). If a prefix candidate fits within the bounds of start and end fields it's a hit. This is as far as we can go on our mule, though. There seems to be no efficient way of keeping track of the relations between IP prefixes with this kind of data-structure. Before we go on with our quest for the perfect data-structure, let us first examine the precise relationships prefixes can have.


  1. RIPE NCC RIS Whois downloads from https://www.ris.ripe.net/dumps/ ↩︎

IP Prefix Hierarchy

The relation of a prefix to another prefixes can be any of three types: One, a prefix is a less-specific of a set of another prefixes, or it's a more-specific of another set of prefixes or three, or it's none of both (disjoint). These relations cannot be efficiently modelled by HashMaps or whatever they are called. We need to get off our mule, do a step back and see if we can find ourselves a donkey. But before we can do that, I first need to tell you about the nature of prefixes.

A prefix is an IP address and a length. An IP address has many representations but as we already saw, one of these representations is an integer number (an 32-bit integer for IPv4 and 128-bit for IPv6). This integer in its turn represents a fixed series of bits. A prefix, now, can be regarded as the amount of bits in the address part (the part before the slash) up indicated by the length part of the prefix. For example, 119.128.0.0/9 boils down to the first 9 bits of the IP address 119.128.0.0/12. Those bits are 0111 0111 1000 0000 0000 0000 0000 0000 and all the first 12 bits are 0111 0111 1000, and that's the whole prefix. Since most computers are pretty bad at storing a sequence of bits with arbitrary length, we're wasting a few bits by specifying with the length, how many of the address is what we consider to be part of the prefix, and then we can store the prefix in a fixed number of bits.

Two prefixes share a more/less-specific relation if all bits in one prefix are the same as the corresponding bits in the other prefix starting from the left. The smallest prefix with the smallest length is called the less specific and the prefix with the greatest length is called the more specific. If two prefixes do not have all the same bits set, they are called disjoint prefixes. Consider our prefix from the first example as bits 0111 0111 1000. A prefix that looks like this: 0111 0111 1000 0100 111 would be called a more specific of our example, since it shares all first 12 bits, but it is longer. A prefix 0111 0111 would be a less specific of our prefix and 1111 0111 1 would be disjoint from our prefix, since the first bit is already different.

Studying this relationship a bit more we can see that there's a strict hierarchy of relations between prefixes. Each prefix can only have one direct parent that has a prefix length of its own length minus one (yes, yes, except the prefix with length zero, thanks for being pedantic). Each prefix can also only have not more than two descendants with its own length plus one (yes, yes, I know, prefixes with length max length etc). If we think of each prefix as a node, and connect all these nodes together on one side to their parent (if any) and the opposing side to their children (if any), you'll see that we're creating a tree! And a specific one at that, a binary tree, which is a tree where each node can only have up to two children.

Now let's go back to our key-value collection data-structure. As we can see now it will not help us with modelling this tree hierarchy. We can of course aggregate parts of our data structure into a temporary structure that models more/less-specifics for a prefix on demand. Doing so would require either scanning all of the data structure (if it's unsorted) and aggregate from there, or scanning parts of the data structure. In the last case we need to make sure the structure is sorted at all times (a non-trivial task). If we only could have a data structure next to our beloved key-value structure that keeps the hierarchy...

A Donkey, the Binary Tree

But not so fast, let us first take a good, hard look at binary trees. Binary trees are the most unassuming of data-structures that are cheap and easy to instruct. They cannot be beat into submission, they just want to do the things they know well and are good at. Let's get off our mule and take our donkey for a ride instead.

A Binary Tree consists of a root node and can have zero, one or two branches, let's call them left and right . From each of those branches there can be another node, which has the same properties as the root. That's it, that's the tree. It's a real donkey, isn't it? It's stupid simple, modest and takes a single thing a long way. So how can we make our prefixes collection fit on this donkey?

If we go back a bit to our Prefix-as-a-bunch of bits we can create a tree starting from a root prefix and then create two branches from that root, that go to nodes 0/1 (0 in binary) to the left and 1/1 (1 in binary) to the right. Note that the root prefix cannot be represented in binary. From the two nodes we've just created we can go on and create a new set of branches going to 0/2 (00), 1/2 (01), 2/2 (10) and 3/2 (11). And we can go on like this to map all prefixes we want to branches and nodes in the tree.

In this diagram you see an example of a list of prefixes (on the right), that are modelled into one tree based on this idea (on the left). Note that this list of prefixes is used throughout all examples in this blog post.

Example of a binary Trie

As you can see, we do not only have all the prefixes in their nodes in the tree (the blue nodes), we also have all intermediary branches and nodes up to the node for that particular prefix. Some of these intermediary nodes are shared with other prefixes, some do not hold any prefix (the gray nodes). So we have to store in the node an indication whether this is node represents a prefix or not. On the other hand, we don't have to store the prefix identifier itself in the node: the path itself constitutes the identifier for the prefix. In other words: the place in the tree of a node denotes the key of the values it stores inside the node. That's what's sometimes called a trie.

This trie has many attractive properties: Algorithms to insert and search longest-matching prefixes are fairly elegant and straight forward to implement. There aren't many corner cases to handle (default route jumps to mind as maybe the only case). The more-less specific relations between prefixes is also closely modeled: the path up to a specific prefix is the chain of all less-specifics of that prefix. The sub-tree upwards of the matching prefix form all of its more-specifics.

As to be expected, there are also severe drawbacks in using a trie like this: A straight forward implementation of this trie, would use pointers heavily, basically each branch would be a pointer, so a Prefix with length 24 would follow 24 pointers, before it reaches its destination. This leads to a fairly large memory consumption (each pointer in a 64-bit system would require 64 bits) and degrades performance, since following arbitrary pointers doesn't benefit from caches close to the CPU cores and the CPU needs to go out to the main memory for each pointer. Also, current full IPv4 and IPv6 are really, really sparse up to lengths of 16 and 24 respectively. This means that these bunch of pointers are essentially useless, because there are no prefixes - and therefore no information whatsoever - on these nodes. But each insert and longest match search will be required to traverse those. Lastly, I'd like to mention that persisting trees consisting of [pointer-metadata] snippets of date to disk is really hard to do in any kind of generic way (how do you store pointers on disk?), so probably you'll end up rebuilding the tree over and over again when retrieving (parts of) trees from disk.

I've created an implementation of a binary tree, that implements all nodes with a pointer to its left and right children and a pointer to the prefix meta-data (if there is a prefix on that particular node). As said, we don't need to store the prefix itself (the integer) in the node.

Sizes for Strides of a binary Trie with a full Table

This is a diagram of the distribution of the nodes and prefixes of a full table over the depth of the tree. As you can see the program created some 2 million nodes for 886,118 unique prefixes, which is an average of 0.45 prefixes per node. One node has a modest size of 24 bytes, so storing all the nodes in memory requires some 48Mb of RAM an 14Mb for the prefixes. The green bars, indicating the number of prefixes in that particular level of depth in the tree, map directly to the length of the prefixes in the file. Up to /16 you can see the tree is really sparse, drops a bit after /16, recovers at /19 and then peaks at /24 (more than half of all prefixes in a full table are of this length) and then almost completely drops off after that.

Building this tree by creating all the nodes takes about 350ms (excluding the loading of the file into memory) on a single-thread on a modern CPU. The insert time per prefix is about 390 nanoseconds.

Performing a longest-matching-prefix search on a full-table tree takes around 26 nanoseconds on average. As an indication of how fast this is: the wire-speed of a 10G Ethernet stream consisting of packets with an MTU of 1500 is in the order of 1000 nanoseconds per packet. This is one fast donkey!

Memory consumption isn't so good, though. It may sound silly to try to compress a memory use of less than 100Mb on modern systems ("my phone" etc.), until you realize that we need multiple copies of this tree in memory in a full-blown analytical engine (one read, one write, one per peer connection, etc.). Also, copying large amounts of memory around has a cost that's probably worse than linear with the amount that's being copied. So we want to reduce the memory consumption.

Radix Tree

So how can we use less memory? As we already saw all the 64-bit pointers require lots of memory, as well as slowing down things unnecessarily. It would be great of we could somehow collapse the nodes that are not holding any prefix information, since then we would have fewer pointers, resulting in less memory consumption and less traversal time. This is what a radix tree tries to achieve. Let's just look at our example to see how a radix tree achieves this goal, and why we've reverted to using the word tree instead of trie (apart from aesthetics, that is).

Radix Tree Example

I've redrawn the prefixes from the previous simple trie example as a radix tree. So what's changed? A number in a small box appeared next to each prefix and two gray nodes disappeared. The idea of a radix tree is that nodes that have only one branch upwards and do not carry a prefix can be eliminated from the tree. Doing so makes a more compact tree, where there's potentially fewer pointers to follow to get to the prefix that's being searched for. Hopefully it would also save on memory usage. The trade-off is that we have to keep two extra pieces of information around at each node.

The first is a number that indicates which bit this node is branching from. Think about it: In our simple trie we can always count the number of pointers we've followed, where every pointer followed indicates a move of one bit in the prefix we're looking for. We are now breaking this scheme, since following a pointer in our new radix tree might skip bits. So we need to indicate on the node to which bit you need to skip. Or in other words: each node has to carry a number that indicates which bit in its stored prefix is used to determine the next branch.

The second piece of information that needs to be stored in each node, is the IP address part of the prefix of that particular node. In our simple trie, while searching we would exactly know what the value of each bit was, since each node represented one bit in the prefix address. We have now eliminated nodes, so if we arrive at a node we cannot know what the values of the intermediate bits were, unless we store the values of those bits at the nodes we just hit. The size of these bits might vary between 0 bits and the number of bits of the longest prefix stored in the tree (32 bit for IPv4), so in reality this boils down to storing the complete prefix address of the node in itself.

On the algorithm side things are quite a bit different between our simple trie and the radix tree. I don't think it's an exaggeration to say that most of the elegance and simplicity is eliminated. Searching for the longest matching prefix is not so much affected by the changes, you still hop from node to node left and right, until you've reached the prefix you need. The only real difference is that you may overshoot and get a prefix that's not fitting your prefix, so at each jump you would have to verify if you're still on the right track. Once you've overshot, you can exit and you're done.

On the insert side things are quite a bit different. On paper it doesn't look too bad, you have the same overshoot that we saw on searching. But if you look at my implementation code you'll see that the larger part of the code is concerned with handling an overshoot. The implementation will have to address backtracking to the parent node and then deciding there to make an intermediary node, and then deciding to store the prefix on the intermediary node, or to create yet another node on the intermediary node that holds the prefix. Finally, the algorithm has to decide where to weld the intermediary node back on to the parent node we tracked back to. So all of a sudden we have to consider three nodes and all possible interconnections at once, instead of just going over them one by one.

I made an implementation that doesn't use recursion and doesn't use pointers back to parent nodes (a suggestion made by Sedgwick[1]) to make it as similar to the simple trie implementation as possible.


  1. Algorithms in C++ Part 1-4; 3d edition, R. Sedgwick, p.638-639 ↩︎

Sizes of a Radix Tree with a full Table

As we can see the total number of nodes in the radix tree has dropped to some 1,6 million (down from 2 million in a binary trie). The bars are now more evenly spread out over the levels of the tree up until we reach depth 24 and then it drops off. Note that the depth level does not correspond anymore with the length of the prefix: the prefixes are pushed down to fewer deeper levels as a result of the compression. Insert performance is better than our binary trie: we're creating less pointers, which is costly. Read performance is worse though, which doesn't matter so much to us, but is unexpected. The idea is that following pointers is a somewhat expensive operation for a CPU, but somehow this pans out differently.

But, wait, what? Our memory consumption actually went up! What happened? Our node size has gone up from 24 bytes for a simple binary trie to 32 bytes in our radix trie! Well, remember we had to store the prefix itself in the node? That accounts for the increase of the node size. I used a type that can hold both a IPv6 and a IPv4 prefix, so that requires 8 bytes. The net result is that the reduction in number of nodes is not weighing up to the increase in node size. We could make a IPv4 and IPv6 specific implementation and then the IPv4 tree could see a reduction in memory consumption when compared to a binary tree. Even then it wouldn't be too impressive a memory footprint reduction (it would be 45Mb approx.).

It turns out that the level of possible compression in a radix tree is heavily dependent on the topology of the uncompressed tree. It basically shines in (very) sparse trees, where lots of prefixes sharing the same set of nodes up the tree. As IPv4 full tables grow more and more dense over time, this kind of compression will probably lose most of its benefits [1]. On the other hand, if everybody and their aunt is going to disaggregate ("de-aggregate") their big prefixes into /24s this compression will pay off better.


  1. BGP Design and Implementation Zhang, Bartell (2004), Cisco Press, p.36, 37 ↩︎

Other compression schemes

We are now venturing into the area of race horses; these are schemes that try really, really hard to squeeze the last drop of read performance.

One of these horse is a level-compressed-trie. A level-compressed trie tries really hard to overcome the problem of the empty slots of the multi-bit stride by using variable-sized strides so that as little as possible amount of unused slots are left in a tree. This however creates the necessity to continually rewrite parts of the tree directly after an insert. This makes the level-compressed trie rather unfit for our purpose, a write-heavy collection of prefixes.

Another scheme to compress a prefix tree is called the Luleå Compressed Trie[1]. It pioneered using a bitmap, a fancy word for a sequence of bits, to replace swaths of nodes with one structure. This algorithm has however the same drawback the level-compressed trie has, inserts and updates are slow and expensive and require rewriting parts of the tree.

See how these schemes are heavily skewed towards being good race horses?

Luckily there's one more data-structure that we can examine and that will lend itself to cross-breeding as we will see. It has fast reads (of course) but better memory consumption than radix tries and has reasonably fast inserts, that won't require rewriting parts of the tree on update.


  1. Small forwarding tables for fast routing Lookups, M. Degermark, A. Brodnik, S. Carlsson, S. Pink, SIGCOMM '97 ↩︎

A Horse, the Tree-bitmap

The tree bitmap is a multi-bit tree with a fixed stride per level of nodes. It is very well documented in the original paper[1] and even more clearly explained in the book "Network routing, Algorithms, Protocols and Architectures"[2].

Per node it holds two bitmaps, one for the child-nodes and one for the prefixes. It also holds a child-node pointer into a data structure that holds all nodes contiguously in memory and a prefix pointer into a next-hop data structure, again holding all next-hop data contiguously in memory.

In the prefixes bitmap, aptly referred to as pfxbitarr in the paper, each bit has mapping to a specific sub-prefix, like we saw with the multi-bit strides. If a bit has the value 1 it means a prefix is present at that sub-prefix. Likewise a 0 means there's no prefix there. And here comes the trick: the pointer to the meta-data for that particular prefix can be retrieved by counting all the ones up to the prefix we're interested in. Then the pointer to the meta-data is calculated by taking the prefix base-pointer and adding the counted ones x meta-data size to it. Well, ok, counted ones minus one. The result of this calculation is the pointer to the meta-data for the prefix. In the same vein the child-node pointer is calculated. There's a ptrbitarr bitmap, where each bit maps to a child-node, 1 means a child node is present, 0 means the buck stops here. Count the ones, do the math and there's your pointer to the child-node.

There's two main reasons for this scheme to work this way, first it assumes that modern hardware is really good at doing many, many CPU operations in the time it takes to retrieve one piece of data from main memory (RAM). Second, it assumes that bitmaps up to a certain size can be held in caches close to the CPU and are therefore fast to retrieve for the CPU. Most modern hardware has special instructions for doing counts of ones on bitmaps in memory (e.g., the POPCNT instruction on x86 64-bit architecture). So, the tree bitmap scheme tries to do as little as possible through pointers, hence the calculation with pointers. It postpones following the actual pointer (that will go out into main memory probably) to the last possible moment and not more than twice for each node. It tries to stash as many prefixes and child-nodes into as small an amount of bitmaps as possible (thus avoiding pointers), while still observing the maximum size that's feasible for the CPU to work on efficiently.

This is the race horse all over again: minimising pointers and keeping memory layouts as rigid as possible is great for read performance. But it has serious disadvantages for our purposes. First, each time a prefix or child node is inserted (in contrast to appended) the layout needs to be recalculated and the memory needs to be transformed. Also, we either need to allocate the amount of memory for each node when creating the node, or make growing the allocation each time a child-node or prefix is inserted or appended. In a write-heavy system memory would be re-allocated and data would be moved around all the time. The second disadvantage would be that we still have isolated data structures that are connected through pointers and we can't persist and retrieve these from disk without many (probably expensive) transformations.


  1. Tree Bitmap : Hardware/Software IP Lookups with Incremental Updates, W. Eatherton, G. Varghese, Z. Dittia, 2004 ↩︎

  2. Network Routing. Algorithms, Protocols, and Architectures; 2nd Ed. Mehdi, Ramasamy, 2018, p. 480-484 ↩︎

Pointer-free

What I would really like is to have nodes that do not contain pointers. This would allow me to store and retrieve nodes and prefixes in a logically separated data-structure. That separate data-structure could use different storage back-ends, probably at the cost of read/write performance and some extra memory consumption. But as we have seen, we don't care too much about race-horse performance.

What if instead of using pointers we would use indexes into two global data structures, i.e. vectors or lists (or whatever your favourite programming language's title is for a bunch-of-elements-with-an-order)? So we could have one for child-nodes and one for prefixes. If we would use a data structure that holds a collection of prefixes and child-nodes and is growable and retrievable by index then we could store a base-index with each node and use the same "counting ones" trick to calculate the offset from the base-index. From the implementation point of view this would boil down to the same thing as doing the pointer arithmetic, only now would be rely on the update/insert mechanisms of the data structure to do the memory juggling. In each node we would only keep two small vectors, again one for the node and one for the prefixes, that contain the indexes into the global vectors. The small vectors inside the nodes with indexes would have to be sorted when a child node or prefix is added to the node, but the global vector would be append-only, it will never - and it shouldn't - be sorted.

I've implemented two versions of the tree-bitmap, one with the nodes and prefixes stored locally inside each node with pointers (respectively, ptrvec and pfxvec), I've named it "local-vectors", and one where we use two global collections and the node only has two collections of indexes into these global collections, dubbed "pointer-free". The last version can be visualized like this:

Tree Bitmap diagram

This is an example of a tree bitmap with two nodes, both with a stride size of 3, necessary to model the same set of prefixes as in the other tree/trie examples. In the bitmap layout you can see how the bits of a stride sized 3 maps to the, well, bitmap. Note that for each increment of the stride size the size of the bitmap doubles, so a stride of size 4 takes a 32-bits integer for pfxbitarr and a 16-bits integer for ptrbitarr, a stride sized 5 takes a 64-bits and a 32-bits respectively, etc.

Furthermore, the implementation that I've created can be configured for arbitrary stride size ranging from 3 to 8. I've ran the full table through a tree with strides 8-8-8-8, 4-4-4-4-4-4-4-4, 6-6-6-6-4-4 and 3-4-4-6-7-8 respectively. Since the number of nodes is the same for both versions of the tree-bitmap I've collected them in the same diagram:

Sizes of Strides for Tree Bitmap for several Stride Layouts

As you can see the number of nodes needed dropped dramatically in comparison with the binary trie and radix tree. Even the variant with the biggest number of nodes, all-4s, features a mere 172,000 nodes. The variant with the smallest number of nodes, the all-8s, has only some 30,000 nodes! Even though the nodes are really fat in comparison to the other trees - 160 bytes without the vectors inside them - the memory consumption has dropped significantly, ranging from 9Mb (in brackets are the numbers for the "local vector" version) for the all-4s "local-vectors" to 35Mb for the all-4s "pointer-free".

For read performance the "pointer-free" version consistently score better than the "local-vectors" version. The "pointer-free" all-4s even scores better than the radix tree. Reading from two global vectors by index takes following fewer pointers than jumping to the next node to find its vector. Also, smaller strides perform better than large strides, which may surprise you. The explanation is that in my program I had to implement scalars of 256-bits and 512-bits myself, since Rust stops at unsigned 128-bits integers. I therefore had to implement routines to count trailing zeros and perform bitwise shifts myself. These routines are fairly naive on the one hand, and on the other hand they cannot take full advantage of things like the POPCNT instruction the way 8 to 64-bits integers can.

For write performance the "local-vectors" version consistently scores better, compared to the "pointer-free" version with the same strides and by a large margin. Some closer examination reveals that the global vectors are re-allocating memory for new nodes and prefixes, which takes a significant amount of time.

So, here we see a clear trade-off between memory consumption (all-8s has the lowest) and read/write performance, here the all-4s is best. Depending on the storage back-end we may push the needle into a certain direction, e.g. if we wanted to persist all nodes and prefixes into a key, value database (on disk or even over the network) then we would care about the number of nodes and prefixes we need to store, so we would choose big stride sizes. If we need to keep up with a very bursty stream of updates we would choose the best write performance and opt for small stride sizes. If we need to create temporary trees, e.g. for transformations, we could use a "local-vectors" version tree-bitmap with small stride sizes and so on.

Cross Breeding to full Circle

We've considered the global data-structures up till now to be list-like sorted collections, where each element in it can be referred to with an index. That's easy and simple. We can however choose any old data-structure we wanted as long as it can refer to each element in it by a unique identifier and where that unique identifier can be stored inside the nodes. So, if we choose, say, a (key,value) collection as the global data-structures, we come full circle! We started out with a (key, value) collection, went on to try a binary trie, a radix tree and a tree bitmap. Finally, we tried a tree bitmap that features our starting point, the (key, value) collection. We can now enjoy all the advantages of that collection as we saw earlier on, while preserving all the information of the hierarchy of the prefix tree. There is also the added benefit that an exact-match search could be done directly on the (key, value) data-structure instead of going through the tree from the root, data aggregations can be done directly on (a copy of) the (key, value) data-structure.

So, this is what we end up with for now: our own cross-breed of donkey and horse, that's actually closer to horse than I expected. It turned out to be relatively easy to tune the tree-bitmap data-structure and its construction algorithm to suit our needs.

Up till now we have only considered our data-structures with single-threaded, consecutive reads and writes. In the next installment we will take a look at how to take our data-structures to the next level and insert, update them and read from them concurrently. Maybe the pendulum will swing somewhat back to donkey, I don't know, we'll see.

Benchmarks and Code

Benchmark Diagram comparing binary Trie, radix Tree and Tree Bitmap

Tests performed

  1. load a full table 893,943 prefixes in randomized order
  2. perform 2,080,800 consecutive longest matching prefix searches

five runs of each test. Take the average of the worst run.

All code can be found in this Github Repository