Why Routinator Doesn’t Fall Back to Rsync
When creating software, we carefully weigh each design decision: security, resiliency, usability and many more factors play a role in the end result. This article explores the reasoning behind a behaviour that isn't specified in an RFC but which has significant impact on operators deploying RPKI.
By Martin Hoffmann
Update 12 November 2020: We still believe that falling back to rsync may be problematic as outlined in this post. However, after consulting with the community we have decided to implement fallback to rsync in the next regular release of Routinator, while continuing work on deprecating rsync communication in the IETF.
RPKI has been designed as a distributed collection of individual objects – essentially files. When producing a data set, such as the Validated ROA Payloads (or VRPs) for route origin validation, a relying party needs to collect all these files before validating their content. The files can be stored in multiple physical locations: resource holders can maintain their own store for their own files or they can choose to use a store provided by someone else. These stores are called repositories. Within the repository, all the files for each resource holder are collected in a directory.
In order to gather all these files, a relying party needs to synchronize a number of directory trees from various locations on the Internet. Given that the overall amount of data in the RPKI is currently more than half a gigabyte, re-fetching it all every ten minutes isn’t a good idea. There is, however, a software that is great for keeping a directory tree synchronized over the network: rsync. And that is what RPKI originally was designed to use.
Unfortunately, rsync has a number of drawbacks.
For one, there is no server authentication, making it relatively easy to impersonate a server. While this isn’t threatening the integrity of the RPKI – all data is cryptographically signed making it really difficult to forge data – it is possible to withhold information or replay old data.
Secondly, rsync was originally designed as a handy tool to help users to synchronize their data. It wasn’t intended to scale to the 150,000 clients requesting updates every ten minutes that we are likely to see soon in RPKI – a number that translates to a quite impressive 250 requests per second. This is reflected in how rsync works. To oversimplify: the client gives the server a list of the files it has, the server compares that to what it has, and then produces a lot of little deltas to send back to the client. In other words: the server has to do a considerable amount of work comparing a lot of files; in case of the RIPE NCC’s repository 60,000 of them. Worse, the existing rsync software hasn’t been implemented with high performance in mind either.
This makes operating an RPKI repository that scales to data being fetched regularly by the entire routing world really quite expensive. And, with only a very small number of applications running rsync at scale, there is little operational experience and little opportunity to gather experience other than dealing with scaling issues as they arise.
When it comes to shifting large amounts of data, the modern Internet has but one answer: HTTP. An entire industry has sprung up around distributing data this way. So instead of dealing with an operational unknown, why not just use HTTP and gain the option to outsource everything if it becomes too troublesome? Enter RRDP a.k.a. the RPKI Repository Delta Protocol. It bundles up changes to a repository into a series of deltas and offers them for download via HTTPS. In addition, a full snapshot of the current state of the repository is available for when a relying party is bootstrapped or gets confused.
By now, most RPKI software, both for publishing and relying parties, supports RRDP. Because relying party software will prefer RRDP over rsync whenever it is available, an operator of a repository will find that almost all traffic arrives via RRDP – and will provision the two services accordingly: ample capacity for RRDP and only a rump service for rsync that can deal with the few clients that insist on rsync.
Now imagine what happens when the operator’s RRDP service becomes unavailable for some reason. If all the clients that normally use RRDP decide to try rsync instead, the rsync service runs the risk of getting overwhelmed. Suddenly, 250 requests per second arrive at a service that normally has to deal with maybe 3.
This is the scenario we had in mind when we implemented RRDP in Routinator and had to decide how to deal with RRDP failure. While the standards documents mandate that a repository has to be offered via rsync and that if a repository announces to be available via RRDP it must also indeed be, there are no RFC-docmented rules on how a relying party should deal with either of the mechanisms being temporarily unavailable.
From the perspective of a single Routinator instance, trying to fetch the data with rsync if RRDP fails is a good idea: the software should try everything to get the data it is supposed to collect. But as part of a greater RPKI community, delaying data gathering and giving the repository operator a chance to fix an issue instead of just flooding the alternative under-dimensioned path surely is a better strategy. Yes, you might be one of the lucky ones that actually gets through and receives some data, but at the price of potentially making someone else’s life rather more difficult.
This is in particular true since delaying data gathering doesn’t mean the repository’s data is lost. Rather, cached data from previous validation rounds will be used for as long as it is valid, buying the operator of the repository some valuable time to troubleshoot and fix the service without too much panic.
But what if the RRDP service isn’t just broken but deliberately suppressed by an assailant? Wouldn’t it be better to deal with such a situation by falling back to an alternative source of data? One argument against it is that, surely, an attacker would simple suppress both sources if that were the case. Because of the performance issues of rsync, taking down a server is much easier for rsync than for HTTPS.
In addition, remember that rsync doesn’t have any server authentication. Meddling with the payload data of an HTTPS connection is significantly harder – not impossible, just a lot harder – than doing so with an unencrypted, unauthenticated protocol such as rsync. So, if the attacker would want to do that, suppressing the RRDP service that uses HTTPS is certainly a good step one. And, again, while the cryptographic signatures on all objects make it really difficult to tamper with the data, replaying old data is possible this way.
For these reasons, we decided that Routinator will not fall back to rsync if it cannot access the RRDP server of a respository if, and this is an helpful exception for an operator setting up their RRDP service for the first time, access to the repository via RRDP has succeeded at least once before.
Not everyone agrees with that decision. In their paper “On Measuring RPKI Relying Parties,” John Kristoff et.al. have argued that the strategy implemented by Routinator may lead “to erroneous invalidation of IP prefixes and likely widespread loss of network reachability.“ Their argument is based on the observation that the distributed nature of RPKI allows the ROAs that are relevant for determining the RPKI status of a given route announcement to be published by different entities through different repositories. If one of those repositories becomes unavailable, the remaining ROAs found in other, still reachable repositories may invalidate an otherwise legitimate route.
This breaks expectations since, normally, RPKI is very forgiving: if the ROAs for a certain route disappear, the route becomes ‘RPKI unknown’ and is accepted. If a repository becomes unavailable, whether by fault or aggression, its ROAs will expire and disappear, not affecting routes in any way.
The overwhelming majority of routes, however, will not suffer from this issue as they publish all their ROAs under a single repository. In a quick analysis, we found well under a hundred prefixes in the current dataset that had less specific ROAs published in a different repository – out of close to 180,000 unique prefixes.
Even if this number is likely to increase as more networks decide to operate their own RPKI infrastructure with more complex setups, failure of their RRDP service doesn’t spell instant doom as, again, previously gathered data is still used until it expires. Regular HTTPS monitoring of the service or outsourcing outright it to a CDN easily makes it the more reliable of the two transport mechanisms.
That said, it is certainly true that publishing ROAs contributing to the RPKI validity of a route announcement under multiple CAs – regardless of whether they are published in the same repository or not – carries a risk. As each CA is independently evaluated, it is entirely possible that a subset of the ROAs is rejected and the route marked as RPKI invalid. Worse, in such a scenario, the reason for such a rejection may very well be outside of the control of the holder of the prefix making fixes significantly harder.
A network deploying a strategy of split authority over ROAs should be aware of this risk and have coping mechanisms in place. And as a community, we can certainly do better in describing these risks and provide guidance to people for safely deploying RPKI for their networks and make Internet routing more secure.
Of course, we are very interested in your feedback, both from an operational and security standpoint.