These are unedited transcripts and may contain errors.
Notice: Use of undefined constant steno - assumed 'steno' in /var/www/html/ripe-60/steno-transcripts.php on line 24
Afternoon Plenary Session at 4 p.m., May 3, 2010:
CHAIR: Hello. Welcome again, we are about to start the second session, so if you could take your seats we could go right ahead since the programme for the second slot is quite packed.
Let's get started. First speaker for the second slot is Geoff Huston giving his report on BGP in 2009.
GEOFF HUSTON: Anyone here from Iceland? Ashes, volcano. Funny that...
I was going to thank them because I have been ?? I gave this talk at ARIN and I was kind of scared coming here two weeks later that most of you would have seen this talk already, but you haven't, had you? It's all new.
So, thank you, the folk from Iceland. Actually I don't think their airport is closed, isn't it? Ash or something. Serves them right.
This is a talk about a policy and routing and the intersection of the two, because when we get down to it, and you think about why do we have these meetings, other than the beer, you know, what's in it? What is address policy actually all about? Because one part of it is the distribution of numbers, but why do we distribute the numbers? What are they used for? A huge amount of what guides our policy process is actually the routing system. And a fair deal of the discussions we continuously have about address policy is the impact on routing. So, it makes sense to look at this same topic from the other side. What's routing been doing? How successful have we been ?? how successful have you been in creating policies that have beneficial effects on routing or at least didn't stuff it up completely? So let's look at the stuff?up level and see how good or bad you have been in the last year. So, firstly, this is an important topic. I love this quote. This is great. This comes from October 2006. Just four years ago. Now, never let the ?? be accused of having tunnel thinking. What else was around in 2006? Security, v6, a whole raft of burning issues, but oh, no, the most important problem facing the Internet of the day was routing, and it had to be solved.
So, in that vein, let's see exactly how big that problem is and whether we have solved it or not. So, this is a talk a little bit about measuring BGP, and there are actually a lot of ways of doing this and each of them see different paths of this large system. So, the first way is to actually assemble a whole bunch of peering sessions and pull in all of the data all at once. The RIPE NCC's RIS service is a classic example of that and it's a truly magnificent service. And equally, Dave Myres, with route view service, which has been going for over a decade now, is a brilliant and massive archive of the most astonishing information, huge amount of data, great work.
The other way of doing this is rather than looking at the entire system, you carefully nudge it. You have advertise a route. And you look around the rest of the network to say what happened to it? And this kind of work, [beacons], AS set manipulation, [BOGON] detection and triangulation. You would have heard Randy Bush talk on all three of these topics and explain what we can do this area. Fascinating work to try and discover how the system reacts.
But I am not smart enough for that. I am actually really dumb. So, all I do is take a single BGP perspective and just do the same thing again and again and again and again and again and again and again and again. Which I have been doing for a long time.
This current data set, that's politically incorrect by the way. I am actually AS13107 something or other, because we adopted some new notation, didn't we?
This is a set of data that's been collected every day continuously since the middle of 2007. I don't advertise anything; it's entire passive. It's a Quagga platform, runs v4 and v6 and archives everything it hears. I also sit behind a really simple up stream. So I don't see any duplicate announcements or iBGP eBGP interaction. So I simply see the external world through as clean a view as I can. That's all it is.
So let's have a look at what happened in 2009 in the world of inter?domain routing. You know up to the right curves, up and to the right. This one started out 285,000 and at the end of the year we are at 313,000 routes in the default?free zone. And it was kind of up and down through the year but on the whole we kind of got there. Variations but up and to the right.
Here is the routed addresses up and to the right. Those are /8s, those slightly squiggles, who the hell is flapping a /8? Come on, own up. It was you, roundabout July you were slapping a /8. Okay. No one in ARIN end up owned up either ?? I know who you are.
Up at the right, that's me. Okay. I'll own up to the last few but the rest wasn't.
And the last one is a number of ASs up and to the right. This one is amazing. It's so flat. Every day we added 12.3 ASs to the routing system predominantly in Europe. There is this team of people, and there are only twelve of them and they only ever add one AS a day. One of them has to come to your site and add it and they never take a holiday. I find this bizarre, that the rest of the system, you know, you take a break for summer, you take, some time off whatever, even weekends, but every day, 12 ASs get added to the routing system, every day of the week, every day of the year, I find this weird, don't you?
I am sorry, numbers just amuse me. So there we go. What was 2009 looking like? It was a plus 10% year. It wasn't a plus 50%, with wasn't a plus 100 percent, most of the metrics you look again, plus 10 percent. More addresses, 10% for ASs.
Less than the year before. The year before similar stats, 12 to 15 percent, growth is slowing. Now, there are two reasons. One is this is just business and there seemed to be kerfuffle on Wall Street and you are doing collatarised debt options and life got /KHRAOPLey etc.. those 12 ASs a day though are interesting. I don't think this was finance. I think this was something else. I think, actually, that what's going on is that the diversity of the Internet is slowing down. That the entrants into the routing system that used to be a large collection of big and little, an astonishing diversity is slowing and that now, on the supply side, you are seeing the larger factories doing massive deployment in mobile devices and the pressure on the routing space is actually getting lighter. There is actually less growth as a result as the big folk actually get bigger and increasing a market dominant position. But that's heading off into another place. Let's not go there. Let's go to a happy place.
V6. Here is a happy place. This is an up?and?to?the?right curve, but for those of you up the front, you will see that, etc. A curvy?curve, not a straight?line curve. So the IPv6 network was growing faster than linear. That's a really good sign. You know, like the IPv4 Internet that grew from 285,000 entries to 314,000 and IPv6 grew from 16100 entries to 28,100. Well done. Well done. Big applause. Three a day, doing well.
I had to diddle this because there is a lot of addresses in v6 and some of are big and some are small but when I first plotted this, all I saw was a huge spike in December, because unbeknownst to almost everybody, someone back in November of 2008 decided to advertise in v6 the prefix D /3. Nobody noticed. We were advertising one?eighth of the entire v6 address space and it wasn't until January that someone said, "what's going on?" And turned it off. We know who you are. When I removed that, you didn't actually see some jumps but the only ones are the two largelies, I think it was a /20 and a /19 but I need to check, appeared in the routing system in terms of address span. So the v6 address space grew but in large and small chunks.
Here is the AS count. The number of players are growing quite a lot. Interestingly, they are almost all transit. The folk who provides services to other folk get the picture. More than 25% of the ASs that provide transit do something in the v6 routing table. Less than 2% of the ASs on the edge do. So, the middle of the Internet is getting the message, the edge...
So the numbers amazing compared to 10 percent in v4, you know, somewhere between the minimum of 30 percent to a maximum of up to 70% in more specifics. Not only are we doing well in routing in v6 but we are doing even better in deaggregation, well done everyone.
So you know, there we were, it's higher than v4, at this rate, at this rate the v6 network in routing terms will be the size of v4 by about 2025. Just in time as they say. It's not very good in terms of we seem to have a huge amount of work to do and about two years to do it. So, while a number of folk are well and truly doing the right thing in terms of visible movement in v6, the overall trend, not there.
However, in grand terms, where is all this heading? Well we can do some maths and extract out those numbers and push them forward. You smooth it off, take a first offer differential, generate a list where is best fit and I use a  /TKPWRAD Rick data model. You put a line across it. That's actually an interesting one, that's the first order differential of the routing system since or two. You notice that last year was a lot smaller than the previous growth. So, the number of entries going into the routing system certainly fell quite significantly across 2009. But I can push this forward, draw a  night smooth curve, project it out and you get the following sets of numbers.
Start of this year: 313,000 entries. Start of next year, 350,000  numb you. They are not frightening numbers. Year after, 391. After that you are going to run?out of v4, aren't you ? So I have no idea about those final two numbers. They are completely a shot in the dark. This is just taking a statistical model and pushing it well past exhaustion and any form of credibility. So I can sort of say with some confidence that as long as we all do what we were doing yesterday tomorrow, you'll end up with under under half a million entries in the routing system and then you have no include. Clue.
That's have a look at v6. Do the same extrapolation. We are putting about three entries a day into the routing system so again one can put a /TKPWRAD Rick against that and push forward. The numbers are not frightening. They are really very, very small. Something is going to have to change in this model of deployment. The amount of capital and resources going into v6 is, we feel, inadequate for the size and the pace of the tasks that's confronting this industry. We are not investigating enough quickly enough. Clear. Clear. So you add them two together and you get relatively low numbers for routing table growth.
As long as, you don't do anything stupid after v4 exhaustion, yeah, right... and as long as you continue to treat v6 in a half?hearted fashion, those numbers aren't cause for concern. Both of those assumptions should be wrong. Both of those assumptions should be dramatically wrong. So those projections of FIB size I gave you, I hope fervently are way too low. Just how low I can't tell.
What kind of growth rate stresses us out? How fast does it need to grow to be a problem ? Well, part of it is up to you. How long do you keep your routers before you flip them over? How long does your current FIBs and your current [T?cam] sizes last what's the next boundary point for you and how much will the next router cost and how long do you want it to keep in service for? I don't know. That's something between you and your vendor. But you are dealing with a system that continuously gross. From the supply side, this is not quite Morse law, doubling every 18 months does not happen in this kind of memory, these kinds of devices. So far what we have seen is that we can keep the unit cost of routing constant as long as routing doesn't grow by more than 20% per year. So far you are doing okay. But as to the future, hard to say.
So what's a reasonable margin here? I don't know. I have no idea, after two years, what this industry is going to do. I would fervently hope that routing gets stressed. I would fervently hope to see 100,000 entries in the v6 routing table. I would actually expect to see, after v4 exhaustion takes place, that the consequent secondary distribution will cause a dramatic rise in the number of v4 routes as this industry panics. The two combined make very large numbers, but I have no maths to do that prediction.
As they say, it's not the size that counts; it's what you do with it. So what we have we been doing with routing? Let's have a look at these update trends.
This is the plot of the number of updates per day, and most of the days have been boring apart from one dramatic day when I had 16,000,000 updates in a day and my poor system melted. That's a local event. Quagga was giving me a hard time. What if I got rid of all of my local BGP resets? This is the year of updates. It's not up and to the right. Whatever else it is, it's noisy and between 50,000 and 150,000 updates, flat. One of the few graphs of the Internet that's flat. Why are the number of updates per day relatively constant?
I can go back to 2005, same data, same flat. Hang on, at the start of 2005, there are only 150,000 entries in the routing table ?? half. So we have sort of doubled the number of cars on the road. We haven't made the road any bigger, haven't changed the technology but I can still drive to work in the same time. This is weird. This is not physics as we know it. It's not the world as we are used to, that somehow in the routing system we have packed a dramatically larger amount of employers in the routing system. Yet its behaviour is almost constant. I am interested in that. It's completely anti?intuitive. And you can have a look at the projections. They are flat. This is weird.
So what about withdrawals? I announce things, obviously I don't like them after a while. What about withdrawals? Flat. Go back a few years, flat. So weirdly, the growth rates are far smaller than the growth rates of the table. Why is that? So, I start to look at some other ways of trying to understand this and the first thing I stress is I am not seeing duplicates. There is another presentation later this week that actually looks at the currents of duplicate announcements. The interaction of iBGP and eBGP. This is not that. I don't have a large iBGP near me. So this is the number of prefixes that are updated every day. And you should see two curves. The top one is where someone very close to me did a full table reset. But the bottom one isn't. So on those days when I didn't get a full table. I saw approximately 20,000 prefixes change day. Every day for the last five years. In fact this one since 2007 but it goes back further. So, in other words, in routing, there is a bad boy room. It's just over there. And it only has 20,000 seats. And if someone wants to be bad they have to wait for someone to leave the room before they can go into the room and be bad. Because it's a different bad boy room almost every day, right. But it's this constant size. I find that truly weird. It's the Kyoto protocol all over again. It's sort of self?constraint in the system. The number of unstable prefixes on any day is about 20,000. That's it. And that's not related to the size of the network, right.
So what's going on? I am still just fascinated by BGP. So, part of this is that BGP is an enormously chatty protocol and as it tries to find the right answer, it explores all the wrong ones. So, this is some work that's actually derivative of Randy's work and this is a RIPE routing beacon siting in Moscow and every hour it advertises and every other hour it withdraws, so here is a withdraw. So, there was an original path that I saw was 4 ASs long. The first thing I see is not the withdrawal. I see an announcement of a longer path. 30 seconds later big clue, MRAI timer, I don't see a withdrawal. I see an even longer path. A minute later, two by 30. I see an even longer path. 30 seconds later an even longer path. And only then do I see the withdrawal. BGP takes time for the truth to pervade, right ? 150 seconds it took and five events.
So, when I look at all of these bad people in the bad room, there are two kinds of people. There are these single flashes, flashes, right, they just do it once and that's the end of it. Let's forget about them, because they are boring. I am interesting in the folk who persistently create noise. So, updates that are part of a sequence of updates tightly coupled in time.
How many of these sequences do I see a day? Flat line again, not up and to the right. This is bizarre. 28,000 convergence sequences every single day. Been stable for years.
So, I am starting to wonder, what's going on ? Why, underneath all of these up and to the right systems, are these constants in routing? So the first thing I am trying to look at is, is the time to converge interesting or important? People always say, one of those common myths, that the more things we put into BGP, the longer it will take to converge. And if we put in a million entries, it will never converge; it will just constantly chatter to itself. We have jumped from 150,000 entries up to 313,000 entries, yet the average time for sequence of updates to stop, to say that's it, no more, you have converged, is 72 seconds. And has been fora very long time. That's weird.
So, as far as I can see, that constant is a constant of the Internet. As the Internet expands, for some reason, that distance vector algorithm keeps on taking the same amount of time to converge. There is something deeply magic going on here and if anyone says the speed of light is increasing, you are wrong.
So the next thing to look at is maybe it's not the time, it's actually the number of updates. So, this is, since 2007, the number of updates on average every day I see in a sequence. The number of updates to reach convergence. Same flat line, isn't it? So, again, what I am seeing is 27.updates for each event. The growth of the network isn't something that as far as I can see says the distance vector algorithm is not exploring more states. That means that as the network grows bigger, you don't populate new areas on the edges of the Internet town. You fill in all the bits in the middle. The diameter of the network, the number of HOPs is actually remained constant and it's the density that seems to be changing. So, in other words, what he is going on here is that the convergence behaviour is related something other than the size. That the convergence behaviour actually seems to be relating to the density of connection.
So let's look at this a little bit more and see if there is a correlation we can find in the numbers that supports that theory. So, again, let's look at the time dimension. As I said before, the average amount of time to converge of 72 seconds. But what's the distribution of each convergence event? So if I plot the number of events by the time to reach convergence, and do it on a log scale so that I completely confuse you because nobody understands logs, the first thing you see is a really neat wavy pattern. The second thing you see is that wavy pattern has peaks at every 28.5 seconds. Just happening to be the average of the MRAI time. So, the first observation is an awful lot of Ciscos out there. The second observation is that the decay of the peaks is linear. If anyone can explain that to me, they can be awarded the Geoff Huston memorial prize for maths because I wouldn't have a clue why that decay is linear, but it is. That's weird.
But convergence distribution doesn't tell me much so instead what I want to look at is the number of updates to reach convergence as a distribution. So, an awful lot of things converged in two updates, an awful lot. Fewer in three. Fewer in four, and so on. The bit on the left. So that's the distribution of the number of updates to reach convergence. And coincidentally, when I do that as a percentage, and also plot the distribution of the lengths of AS paths in the routing table, I find an amazing correlation. Now, either this is an accident of maths and we can all go well that's amazing but completely coincidental. But the fact that I keep on seeing it month after month says to me there is something else going on.
That there is a correlation there between the AS path length distribution and the number of updates required for BGP to converge. And for 98.66 percent of all updates, it correlates with the path length. There are a number of folk in the bad boy room who never stop. They just keep on updating and they are beyond all hope. 1.3 percent of folk are beyond hope. But the rest of you aren't actually doing anything wrong. The rest of you are actually just sitting behind the topology of the network and what's keeping the network so amazingly constrained is actually the average AS path length as far as I can tell. Since 1998 when the network was oh, that big, until now. The average AS path length across the network and the original data comes from route views collected with a whole bunch of others, is about 3.8. No matter where you are the rest of the networks about 3.8 AS HOPs away from you on average. So the network has grown like crazy but the diameter has been amazingly consistent.
So, as far as I can see, what that means is the networks growing by density not growing by diameter. So, where does that leave you in terms of what the issues are. Was the IAB right? You know, is BGP scaling the absolute dramatic problem that we see? Is it the most important problem? . The real answer is so far so good. You know, it's like falling from the hundredth floor; even by the time you are down to the second floor, so far so good, we haven't crashed it yet. That's sort of good but what's keeping it going? Why? That's a really good question. I don't know. I have some clues about why the system has behaved as well as it has but I am not sure I could tell that you it's a natural law. That no matter what you do to how you distribute addresses, BGP will continue to behave. That's not true. We have done a number of things in this community that I think have been prudent. And we have all seen a number of things where the coincidence of money and good connectivity have actually done us the right thing. So, yes, provide a based addressing works. It works because it does constrain the overall growth in the system. We have a viable routing infrastructure, because by and large the a.m. of deaggregation going on in the networks remarkably constrained. That the policies we self?impose around addresses follow topology makes routing work are generally self?imposed constraints that make routing viable. We continue to have strong awareness of address aggregation. That makes the system, I think, maintain a constant rather than a constantly growing curve. And the other thing is what makes constant diameter an increasing density? Why is the network only 3.8 ASs away?
And I suspect that what's going on there is that the massive construction of local exchanges and also global and transit providers have been tonic beneficial. In other words the policies and the practices have created a system that, oddly enough, exhibits constancy in the face of growth. As far as I can see, the principles under which we operate as RIRs and the broader industry principles have been along the basis that, in general, deregulated industries tend to self?optimise well and when things are going right the sum of self?optimisation is common optimisation. That's not all the time and it doesn't happen everywhere every time, but, oddly enough, in routing that does appear to have worked so far, that the coincidence of self?interest on the part of the ISPs and common interest on the part of routing has actually created a system that's exhibited remarkably constancy. Will it continue that way? I have no idea. I know what is break it. I am pretty sure you know what will break it too.
If you allow addresses to move in weird and ad hoc matters, if you allow this system to have 20 degrees of variability, if you allow routing to work in strange and convoluted ways with bunches of local restrictions that tend to create a longer, stringier more convoluted Internet, say good?bye to routing. So oddly enough we know what breaks it. I am still not quite sure we know what keeps it at the level and stability, remarkable stability it does. But part of the reason why it's been so, I think, is that our common self?constraint in each of our individual circumstances as ISPs has created a universal constraint on the system that has worked amazingly well.
(Applause) turn off realtime dictionary. I think 100503?pm1 problem. It happened before. Are there any questions?
SPEAKER: Could the fact that the number of updates not increasing actually be something to do with the propagation of those updates not happening as the density of the routing tables gets bigger?
GEOFF HUSTON: No, routing does propagate everywhere globally. Have a look and you will see that information does indeed propagate across the entire system one way or another. So, no, the stuff doesn't get lost. It does indeed go everywhere and it goes everywhere on average in 72 seconds is. Don't forget that I am taking measurements from AS 2.0 a finally isolated island at the bottom end of the south Pacific. If I get to see it, everyone gets to see it, believe me.
SPEAKER: I just wonder whether you will also take out some, well, exercise in just putting up some under underlying assumption that provide a base address policy will no longer exist and come to, well, conclusions about so far only theoretical assumptions?
GEOFF HUSTON: There is certainly an ex?body of work, if you want to do it, who are those 20,000 prefixes a day? Exactly what are they doing and why is it only that population? There is an interesting observation underneath that, I think, that you know is the next piece of work. So, you know, somehow the part of the Internet that creates instability seems to be tightly constrained. We understand the little bit about their updates now. But the next thing is: Area they in that box? You know, what defines the room of bad boys? How do you get in there and how do you get out of it is certainly something I am now interested in.
SPEAKER: I have a question on a Jabber channel. The questioner asks, does your presentation mean that PI space or politically motivated address stickiness to the end user will break the Internet? Should we propose mobile IPv6 to those people as a working alternative?
GEOFF HUSTON: I don't think I would actually go as far as as saying that. What you have said is that the policies that we have used, which has been a degree of compromise between those folk who want provider independent space and providers who are comfortable in using large aggregates in the routing system, that degree of compromise has been successful. If you perturb that one way or the other, too much PI space, some large customers don't like the idea of provider lock in?? sorry, yes, too much PA space. Too much provider independent space, I have my own IP address, I want provider portability. At the level in v6 of a /64 starts to get a tiny bit over board. So I suspect, you know, that the compromise we are in certainly visibly works. But should we adjust the rules in the future? I am not sure I am willing to go there.
SPEAKER: Randy Bush. It's hard to talk after you and [Karsen].
I am a little uncomfortable with saying we know what's going to break the Internet when we don't really understand why it's still working. So... That one I am not sure I buy entirely.
See a post on NANOG that post to say a recent ?? the paper is not out yet but it's been accepted at JSAP ??
GEOFF HUSTON: I read it this morning.
RANDY BUSH: Which shows essentially the rate of deaggregation since 2001 is flat. And I think the dynamics are really tough. I do think that there is both a higher degree of connectivity within the topology, as you say, but I think we are also seeing a large expansion at the edge which still would keep the average AS path lengths constant. I believe number of stub ASs is increasing significant.
GEOFF HUSTON: At this point, just to add to that, it's a very good paper and the reference was posted to NANOG, so, by all means, have a look at it if you are interested because it is a good study.
The number of stub ASs increased by 10 percent last year. The number of transits increased by 10 percent, so, yes, one was a much bigger increase, but relatively, oddly enough, that transit/stub balance has actually remained constant. Again, part of these sort of weird couldn't stances when the rest of the Internet supply and to the right. So, you know, maybe it's these weird constants that actually save our bacon.
Thank you very much.
CHAIR: Thank you, Geoff. Next speaker up is Elisa Jasinska, talking about using route servers.
ELISA JASINSKA: Hello. I am Elisa, I work for AMS?IX, the Amsterdam Internet Exchange, and I am going to talk about a little different use of BGP here as in a convention sense being route servers. This talk has actually already been around here a little bit. It was presented at NANOG this year and various other industry meetings, so in case you have seen this before, I am sorry for the redundancy. I promise this is the last time. Okay.
We actually are going to start out slowly with what actually route servers are. I am not sure how many of you ?? how many of you now what a route server is? Okay. Then we can skip to the first 20 slides probably and just get to the bottom of it, the interesting point is the testing, because there is a few implementations of route servers that scale or may not scale in a sense we would like them to. And a group of people from various Internet exchanges that are obviously interested in using route servers that are obviously interested route servers or providing it as a service at their exchange. We are interested in seeing how those implementations scale and what they actually do. This is why we sat down last year by the end of last year and actually compared different implementations against each other.
This is going to be the last part of it.
But let's start out with why do we actually need route servers? So, if you go to an Internet exchange, AMS?IX is an example here, you obviously want to peer, not obviously but mostly you want to peer with with as many parties as possible present at that exchange point. What do you then need to do is set up a session to every single one of those parties over the exchange which can generate a lot of work, AMS?IX has 350 members right now. So, you basically your busy for days and days setting up all your sessions. And the Internet exchange wants to provide you a route server to make that easier.
What you can do, what you can do is you can reach a lot of parties by connecting only to that route server and the route server propagates all the information that it gets from everyone connected to it to every other person connected to it. You will see that in detail a little bit later and you can use that for different reasons. You can use it once†as a redundancy thing. You can have sessions set up to all your big peers and you can have a session to the route server in case something happens to your own sessions, then you still have the route server and the other way around. Redundancy, if the route server dies or if the session dies, you still may have one of those two still present and still receiving the routes from that party that you actually wanted to have.
Another reason may be that once you connect to an Internet Exchange, as you said, you have 350 parties, maybe more, maybe less, and you are busy for an insane amount of time to actually set up all those sessions. Route servers make it easier if you get there on day one you get the session set up to the route server, and then, basically, you have a lot of prefixes already from those route server sessions. So, it's an entry thing where you basically, from day one, already have traffic going over your exchange point, which is, well what you ultimately want to have.
Let's go over it in a little more detail. What do those things actually do? This is technically a normal BGP update. I have been told there is a laser pointer in here. Let's see how that's going to work.
This is ?? this is our peer and our peer sends an update to a second peer he has, providing him with, I mean, we all know BGP, right, AS path, next HOP and his prefix that he actually wants to announce to that other person. In this case .1 is a route server, and here is a different peer who does the same. So they both provide a route server with a simple prefix they have.
Now, there may be people that do not want to talk to everyone connected to that route server, we all have our peering policies, we all have our filtering mechanisms so what we can do is the route server needs to be capable of performing filtering. So you can tell him, okay, I am AS so and so, and in case this is now basically where we accept anything anyways, but we could have exceptions where we don't want to have anything from a certain peer, so a route server needs to be capable of filtering that out for you.
And then we get to the part where the route server performance for you the best path selection. This can be tricky when you actually filter, because it means we don't have a path yet and it means that if it would do one global best path selection and would you actually want to filter someone out who, in that best past selection would be the best one, you may not receive that route at all. Once we get to things like add path, you still may be provide with a second option that you could then use. This is not the case yet.
So also performance influencing issue about route servers is that the best path selection, basically needs to be performed per peer, every peer needs its own RIB and in every RIB all the entire table needs to be stored.
There is different options how the implementations that currently exists do that. In a few, you can actually only say that you need ?? you need a special routing table for that one if you actually know that that party filters something. If nothing is filtered by appear, then it's ?? it can easy be fed from that global routing table you have, because that's not going to be his best path and that's it.
So this is what it does. And then in the end, the route server that we had before at .1 sends out, and now, do you remember what we had two slides ago? It sends out to the peer on .4, the prefix that has been announced from the other peer .5, and then to .5 it sends outs the prefix that has been announced by .4 with him as next HOP and the entire information.
So, as opposed to normal BGP, there is no change in those updates. It doesn't ?? the BGP route server doesn't insert itself into the AS path. It does not announce itself as the next HOP or anything similar.
So, it's a slight change to the conventional BGP protocol as we know it.
To illustrate, this is actually from the Quagga. Quagga is one of the BGP implementations that have route server functionalities and this is from the Quagga documentation where they actually explain how a BGP has to look. They don't really ?? so this is again, they show that you you have one best path selection in the middle and everything that comes in from every peer goes to that selection and filtering. And then it goes out to exactly those same peers again. And now if you actually want to consider the filtering and the possibilities that you may filter out your best path, you basically need that middle part multiple times.
Clear so far? Okay. I'll just go on.
What we have right now is a few implementations that actually have those functionalities.
All not really based on anything written down. All really not based on any standard. All of them just thought, okay, let's see how we can do that. Quagga was the first one so everyone kind of followed what Quagga implemented. OpenBGPD and BIRD, which is actually developed here in the Czech Republic. And as I said, there is this bunch of people at Internet Exchanges that want to use those things we didn't really know which one, how do they scale, what do they do exactly? So all of us, Andy, Chris ?? switch and data, I think they are Equinix as from today.  me, mow from Linx, Robert from PL?IX and the others, we sat down and combined our forces, so to speak, to run a round of proper testing of all those route servers against each other and see how they perform. And it was a fun week.
So, what we did: Obviously there is those things that you want to make sure that actually work right, does it support IPv6, does it support 4, does it actually properly transparently forward all your information without changing anything along the way. So this is the first thing that we looked at. And there is actually not much to say about the functional part of it because all of those three implementations do support AS 4 we do not have any issues with this, at least as of end of last year when we tested it and I don't think anything changed until today.
IPv6: Well right now all three implementations support IPv6 properly and it actually does work. We ran into a few issues, for example, OpenBGPD had a problem with IPv6 withdrawals when we started touching that software. Withdrawal has been fixed in the recent releases so it's all not really a problem any more. But there is basically explains why you should make a point in case you want to use any of those packages, you should make a point of using the recent versions possibly whatever you can get in CVS because they are all highly, highly developed right now as of this moment and you don't want to use older ports or, I don't know, ports that have been in FreeBSD somewhere because they are like years old and possibly have all the bugs still there that we were actually trying to find and have the developers fix at that time.
This is the general part of the thing. Now, scaleability is an issue where you run into, well?being at AMS?IX having 350 parties that actually want to connect to the route server having possibly every single one of them actually perform filtering and possibly a need for around 350 best path selection and 350 routing tables in your route server makes it kind of difficult for them to properly scale uncommon hardware. This is all software implementations right so you put them on to a server that you have somewhere and hope for the best.
This is what we did. We had IX a we use that had IX a to set up 100 sessions and depending on the software used, because some of you may know that Quagga has been not scaling very, very well. So depending on the software you used we either had it announce 500 or 1,000 prefixes per session we introduced some random flapping, withdraw update, withdraw update, basically to see how it's going to work, what it's going to do. And this actually turned out ?? we did a couple of rounds with less peers, more peers but this actually turned out to be the most interesting one. So, with Quagga, what you have is, it's a single threaded implementation and the problem you have with Quagga is that scheduling is somehow period of very /TK?RBG ?? well, weird, let's put it like that. So you run into issues where it's so busy calculating receiving other things, keeping up with whatever, that it sometimes tends to forget to send out keep a lives or something like that and then your sessions go buy, buy, I am done here, which is not very good.
It's generally just very, very busy. There have been a lot of actually multiple bugs that made is crash on certain situations. You suddenly had things flapping and it went 100 percent CPU load and I am gone again because I ran into some odd construction where I think I am not going to work any more.
So this is how Quagga looks like when you actually try to use it and Quagga is the one, as I mentioned before, where we only send 500 prefixes per session to it because we knew that it's not going to be that ?? performing that well.
So now, 100 sessions and 500 prefix Yours sincerely techly not that much. Right. On the AMS?IX route servers from the 350 he be members we have, we have around 200, 220, 230 ASs connected to the route servers, then they all have v6, v4 sessions, sometimes multiple v4 addresses on the exchange point so we have well over 3, 400 sessions. So this is not too much. It's not doing very well.
Now, compare this to OpenBGPD at that point where we right away send 1,000 prefixes per session that we have. Where it's basically it's busy constantly, remember those sessions are flapping, right. So, it's constantly busy with receiving new updates and doing things. But, this is a load where we are sort of all right, we can handle that. We can deal with that. It's working, it's not crashing. We are able to survive if it's doing something like that. Not?
The differences: Okay. This is BIRD and BIRD basically looks pretty similar to OpenBGPD at that point. So, now the difference to BIRD and OpenBGPD from the Quagga implementation is that OpenBGPD is multi?fledded. So OpenBGPD has a separate thread for calculating your routing tables and it has a different session thread to keep up your sessions, so it is well aware, okay, one of those threats can still keep your sessions alive, send all the necessary, keep alives back and forth, etc. While the other one is busy doing all the calculations and even if that one would be really, really busy for quite sometime, at least your peers wouldn't say buy, buy, I am good.
The ?? well, the somehow not so good thing but it always, it against depends on how you actually use it is that on certain architectures you are not able to use a lot of memory in OpenBGPD or in open BSD, what OpenBGP runs on, right, so you have a 1 Gig memory limitation or a 4 Gig memory limitation on 64?bit systems. Which is, we haven't gotten that far at least in our production implementation and in the tests either. But you may. Especially if you use multiple routing tables for you're peers, you end up having a lot of data to store somewhere really, really quickly and this generates this.
So, with the same test again, where we just saw the CPU usage with, this is what we use on memory for OpenBGPD. Now, this is the free memory, right. This is actually the used one. So we have 400 end bytes of data stored. All of those 100 session that is we set up there actually had their own routing tables, so all of those prefixes are duplicated over all those routing tables. Which is, it still leaves you some room but if you are running that ?? if you are running that on an older system you can hit the Gig very, have I quickly, right.
This is what Quagga does. I don't comment. This is just weird. BIRD is a single?threaded implementation as well and I have no idea how the guys did that for but for some reason, their scheduling seems to make a lot of sense, because they are able to be busy with calculations and still keep up with all the on time tasks BGP has to do. So, it's actually very impressive and it was a first time that a few of us, while we are performing those tests actually had a look at BIRD at all, and we are still pretty impressed. We saw something weird and I am not sure it may be ?? we were wondering actually we had a few other graphs where you could see that more detailed. That it was not moving its memory whatsoever, but that apparently is has something to do with Linux memory management and it basically keeping that memory reserve forward that process fora longer time and then will he reuse it go again. So, yeah... but it's kind kind of, it's doing a little less than OpenBGP so it's pretty much the same amount.
All right... I think this is it.
Is there any questions?
SPEAKER: Did you make any observation or experimentation of multicast using route servers?
SPEAKER: No we did not. Sorry.
CHAIR: Anyone else?
SPEAKER: Ruben  /EPBS Nick, Brazil. Would you rather Internet Exchange point choosing one, only one of the three softwares or would you rather use one route server with one or the other route server with the other software?
SPEAKER: No. Obviously ?? the problem, many Internet Exchanges had with Quagga before was that they usually, they have two route servers and they both ran into the same condition and exactly the same time and both crashed at the same time, which is not what you really want to have. So, I think what the best that I think that you actually want to do is for diversity reasons is have two route servers based on different software and there is a lot of movement actually at that time going on, because what I didn't mention so far that there is someone busy from the European Internet Exchange association busy with rewriting Quagga in a threaded fashion so that, hopefully, Quagga will be usable at some point, eventually, as well. And there is another, Cisco is actually busy with implementing route server function alternatives and hardware. So we may have a few other options in the future. At AMS?IX, we migrate away from two Quagga boxes pretty quickly to two open BGPD boxes, and now BIRD, over the past year, has been developed to get to a fashion where you can use it now, actually, very good. So, we are planning to move away from one OpenBGPD to a BIRD implementation to have two different ones.
CHAIR: Next up, Randy, please:
RANDY BUSH: I am Randy Bush, from IIJ. I have 50 files to get to in 20 minutes so I am going to talk as fast as Geoff Huston does.
The other problem is I always like to point and this wide room and three screens is going to be a problem. So we'll do the best we can.
Routing is very fragile. That's duck tape.  Johnson /SEU train give a presentation on the way we are surviving by random acts of kindness. When the YouTube incident happened, at first Pakistan stand didn't fix it; we all fixed it by routing around it. Okay. This is not the way to survive.
Routing errors are significant and have massive customer impact. We do not wish to be on the front page of the Wall Street Journal. Okay.
99 percent of misannouncements are accidental originations of someone else's prefix. Okay. So, we are talking about preventing the YouTube accident, preventing the 7007 accident, for those of you who aren't old enough, [Vinnie Bono] took all the BGP table, ran it through RIP and back out again, sliced into 24s. Sprint UUNET, each fell over for two days. So the goal here for origin validation is to prevent most accidental announcements. This does not prevent malicious path attacks. That requires path validation and my children will do that.
This stuff is not new. Steve Bellovin wrote the original paper in '86. In 2007, BBN Steve Kent and other people that are still with us, and some that aren't. Did SBGP, which has its problems, SOBGP, etc., 2003 we tried to do a workshop, 2006 ARIN and APNIC started work on the RPKI, which I'll describe shortly; RIPE picked it up in 2008. In 2009, we started an open test bed running code, all the way to routers, which I will describe. And in 2009 ISOC discovered it.
The goals to keep the Internet working. And to seriously reduce routing damage from misconfiguation, misorigination. It's not†?? the goal of this work is not to prevent malicious attacks and it's also not to keep RIRs in business selling certificates.
This is all based on the research ?? resource public key infrastructure work which was developed outside the ITF and is now making its way to the ITF sausage machine.
X509 cert, I am going to assume you know something about it. That little 'CA' in the upper right?hand corner does not stand for California; it stands for certificate authority, which means that this certificate can sign other certificates and you are going to need to know that later. This will be a test. RFC 3779 extended X509 certificates to describe ?? 6 and 4 and ASNs, just the other cute thing to know because you'll need it later is there is one field in X509, the SI, a which says when I sign stuff where am I supposed to put it? When this certificate is used to sign things, where in the Internet can you find that?
The software you see is being developed by RIRs, operators deployed, etc., etc., all the way to the router. As I said, based on these certificates, okay, just like address space, it's the hierarchic, so here is an ARIN cert for this space. That's a 16. Chops it up into 19 and 20 and a 20. Chops those up into 24. It's turtles all the way down. That's who owned it but who can route it?
There is this thingy called a ROA, a Route Origin Authorisation, that binds an address block to an AS that can route it, that can announce it. Okay. That can be the origin AS. Note this is not a certificate. So, it can be signed by a certificate that doesn't have that CA bit because this neutered certificate is not signing a certificate; it's signing a blob.
So, I own those two bits of address block. I create this stub certificate to do this because I don't want it tied to this address space. So, the AS in the game does not have to agree to this, it does not have so sign the Route Origin Authorisation, because it says, I am willing to participate by announcing it in BGP. It's got all the power in the world; it doesn't need more.
So, what happens when one of those experiments to which Geoff mentioned a few minutes ago? I need to slice that 16, this is real if you remember all the screaming on NANOG about my stealing other people's ASs, that talk will be Wednesday I think. I had to slice it into 256 different /24s because I wanted ?? I had to wait for Ralph /PHRAEP damping to settle, 3 hours for each announcement, and I wanted to do 30,000 announcements. So that would have been a lot of certificates and a lot of ROAs. That's what we call ugly. There is a macro. In a ROA, you can say here is the prefix, here is the prefix length and then that 24 at the end says you can slice it and dice it up to 24s, max, length. Thank you, Ruediger Volk, for showing us that trick.
How do I use this in reality, these ROAs? I get a /16 from my†?? from the Internet registry, say. It consists of four flavours of ice cream. There is the stuff that you wouldn't want to touch, it's unused; there is my infrastructure, my static internal addresses, my loop backs, my point to point Linx, etc.. they are not ?? there is no separate BGP announcement for them. There is the light green which is my static customers. They don't speak BGP, but I route their space under my AS. They don't need a separate ROA. And then there is the BGP customers in the dark green. They are going to announce the address space that I am del great to go them out of my 16, they are going to announce it from their AS. So, I am going to issue a ROA under my AS for the whole thing, which means nobody else can punch holes in this and announce it. Nobody else can originate it, okay. Except these three customers who are BGP speakers to whom I have delegated space, they will announce this she has to make a ROA for this space. He has to make a ROA for that space. This customer says: Don't tell me about certificates, I am a dentist's office, I pay you, you do it. Just like everybody else. We provision the circuit, we do all the stuff. We will issue the ROA with their AS on it.
So, that's all cute. You have got all these ROAs, what the heck do you do with them? Or where are they anyway? Where are they is here is an example of a software implementation of the whole RPKI system from beginning to end. And this is an open source one. That a couple of us are working on funded by various people, it was originally funded by ARIN, now funded by the US Government and its open source.
But anyway, here is the RPKI engine itself which issues certificates, etc.. separately is the signing module which could be a hardware signing module if you are that paranoid and want to pay that kind of money for a little bit of hardware, though it doesn't necessarily have to be that expensive, here is a hardware signing module. It costs 60 bucks. And I am serious. So ?? but it doesn't sign very fast or hold a lot of keys. Only what do you want for 60 bucks. You can get one for 60,000, it's real fast. And here is you're your registries back end. If you are an ISP running this, this is your database or your oracle or my sequel, etc., this is the back end, even though since this is the user, he says this is what's facing the user. It's the front he objected end. So maybe I'll fix that. But, what's important here is something called the up/down protocol. Which is how ARIN talks to UUNET, how ARIN's instant of this gives you UUNET their own certificates to UUNET can issue certificates to its customer, so its customer can issue further certificates. How IANA issues certificates to ARIN, right. Through this up/down protocol. These protocols are really ?? all the protocols here are ASN 1 wrapped in XML wrapped in CMS, and they are currently also wrapped in TLS but that's being taken off. There is just so much wrapping you can do.
There is this other protocol, the publication protocol which you will show means that the person running the engine can separate the engine from where the repository is because they don't want to have to put up a reliable repository. Yes, Geoff, that draft is about to come out.
So, it is not a big sent lied database, we don't do that. It doesn't work on the Internet. I have a T?shirt that says BGP never converges, all devices fail. Both of those devices will fail at the same time. The last talk, she just said that, didn't she? Both Quaggas fail at the same time.
So, how does this distributed database work? We have, for instance, the IANA published the route certificate here and when it makes ARIN certificate, ARIN certificate is here, but remember that SI, a pointer, it says everything ARIN published using that certificate goes into the repository that ARIN keeps. And ARIN does the same to UUNET, down, down, so the repository is distributed around the Internet with these people running their own versions of the software. Kind of like the IRR is today you have got instances, 30 odd of them distributed around the world that each of the ISPs who run an instant of the IRR has their own repository.
Okay. So that's cute. How the heck do I get a validated version of this whole thing? Well, what I have is a trust achor for the IANA and that essentially is an URL for this and the public key. So, this gatherer, RCynic. It's cynical R?sync, sorry for the joke, I usually don't get involved in this these things. This RCynic gatherer has a left?hand buffer, it goes and gets that IANA, puts it in the left?hand buffer and says, I don't trust anything here. I have the trust anchor. I start at the top and I and I go through the certificate chain in there and, one by one, I move them over to the trusted buffer. Notice I cannot be fooled into following a bad pointer. I then have all of the IANA gathered and I do a recursive descend through the rest of the registries in the came same way, and, lo and behold, I have ended up with a validated cache of the entire Internet. This is cute but who the heck is this and what happens then their server goes down in? This is expensive and and unreliable. Okay. Because that server is down, I don't have their data. Now in fact when we get 30 slides further, you will find that we can deal with the fact that that a server has gone down, but we shouldn't have to.
So what we do is, remember that publication protocol. So PS go, net can make a contribute with IIJ and say I want to use your publication server. So when PS go, net goes to ARIN to get their certificate they say my publication points actually is over here. That SI, a pointer, that's over here. So, the tree flattens up. It goes to very reliable servers. Or words to more reliable servers, words to flatness.
How might I run this? How might I actually use in in a very large ISP or an RIRs RIR? I have got the front end, the software once on a Mac. This is the front end. My database. So, I say, hey, give me ?? generate the following ROAs from my address space. Oh add I want to del great this address space to Mary, etc., etc.. I get off the aeroplane. I plug it in. It updates the RPKI engine. It's running on the server. That does the up/down protocol to delegate space to Mary or delegates space if I am, you know, the manager of RIPE, it delegates space to Rudiger, whatever, and it published the stuff over in its publication point which maybe we contract out to somebody who is really good at deploying global data infrastructure. So, maybe Google's done it for the DNS, maybe they'll do it for this. So we can all publish there. Maybe yes, maybe no. If you don't want to, you don't have to.
It might also provide, as John Curran says, 98% of an RIR's users are just going to want to go to a web GUI and say issue a ROA for me already, I don't want to run this stuff. But 10 percent of the ?? of an RIRs address space wants to use a go, UI. 90 percent of the RIRs IP space is issued to five big providers. And they want to run their own. Like, detag, level 3, NTT, I can go through the long list. And so 2% of the RIR's users want to run their own. Just the big ones. It's like IS?IS; nobody uses it except the world largest ISPs.
So that's a usage scenario. You could do differently.
I have got these certificates, I have got certificate certificates in provocation points and ROAs coming out of my ear, there must be some use to all this work. Remember, I wanted to do origin validation. I am going to talk about running code. Cisco ISO and XR test code exists now, another vendor, who we all know is right behind.
It works and you can see it via, I am just now debugging a looking glass so you can see it in the test routers from anywhere on the Internet. The compute load: Oh these certificates and everything, it's horrible my routers can't do that. But it's running on a 7200 and a GSR. If I am protecting myself with an access less list on, for instance, a GSR, every entry in that access list takes around 10 microseconds to evaluate. So if I have got on the average 1,000 of them, that's 10 milliseconds gone on the announcement. With this code, it's 10 microseconds. Period. This is cheaper and faster than what you got today.
How do we do the trick? We don't make the router do the crypto. We do this. Remember the RCynic gatherer is making a cache. So it gathers it from the global RPKI. I have this cache, I own it. I, the ISP. It is nearer in my POP. This is using an object security protocol. In other words, when it fetches this, remember the RCynic Gatherer is goes through and actually does certificate validation. The cache server I own, it is in my trust domain with my router. Therefore, I don't need to revalidate the certificates. So I can only pass to the router what it needs for the BGP decision process and that's really none of those certificates. And since I am not using object security I can do simple SSH to secure the transport and authenticate it. So that protocol is known as the RPKI to router protocol. It's the third protocol and it's the only one you have to learn. You have got them all, up/down, publication, RPKI router and you will pass the quiz at the end of this.
The router says, hey cache, I have woken up. The cache says, okay I am going to give you some data and it gives it all the prefixes and it says I haven't got any more and for those of you who know DNS, here is a notify, the cache says, hey, I have got some some new data, you may want to ask me if I have got new data. So either the router listens to that or periodically poles and it says with it gives a serial number which is the high watermark, the serial number of this transaction, so that what's transferred is only the new stuff. End.
What's transferred here is a v4 data. It's the prefix and the AS number that can announce it. Of course, with the prefix, those are the prefix length and you remember the maximum length field, so we don't have no numerate all those ROAs and there is a flags field that says is this announce or withdraw? And then there is the normal other protocol garageage. And IPv6 prefix looks very similar. 96 more bits. No magic.
So that's pretty dumb. The router doesn't really have to hold much. In the implementations by the way they put it right into the Patricia tree and it's cooked and it's zippy fast. If I were a very large global ISP, I wouldn't want to go out to the global RPKI for all my POPs. So in fact, what I'd have is probably, you know, let's imagine three continental POPs, excuse me for not having any in the southern hemisphere,nd fear. Then each popular might feed of that global. I might double the protocol for the RPKI router protocol. The client has a list of servers, it can go to in the SSH keys of each of them, they are in the priority order so this POP might prefer this one but that fails, it's going to go to that one. And it can also go to its neighbour. Okay. And you might also, for your BGP speaking customers, they are not going to want to add a server to run this cache. So just like DNS today, you might want to give them a cache that's validated that they can fetch from.
Now, when they do that, I might warn them that they are entering your trust domain. They are trusting you to value to have validated that cache. Then again, they are trusting you to carry the packet. So, it's neither here nor there.
So, how do you configure it? This is IOS XR. Then when you say router BGP. You say here the cache, here is the port, pull it every 600 seconds. Finished.
Okay. The result of a check is going to be one of three things. Valid. I found a matching ROA with the right sap number. I found a matching ROA and it had a different sap and there was no matching ROA. Or, it's not found. I don't have a ROA that covers the address space. Initially, everything is going to be this. Then it's going to become those. Okay.
Here is the logic. We are not going to walk through it and it won't be on the test. Of course, Dr.†Knob, Dave Ward, you can disable validity check entirely, you are disable the validity checking for peer, you can disable it for a prefix. When a check is disabled, the result is not found as if there was no ROA. Okay.
Here is again on XR, dump me the table of ROAs. In other words, here is the prefix, the max. Length, the origin IS I expect. You can dump this today. You don't want to dump it five years from now.
The defaults are origin validations enabled. If you can configured of cache server peering. The poll interval level is 30 minutes and there is no effect on policy unless you can figure the effect. In other words, you can turn this on and the packets still go where they go today. And then here is back in the date database how I say what ROA is to issue by the way. So here is a ROA I want for this prefix, etc., etc. Here, I show, just in show BGP ?? this one is valid. In other words, 27318, it's okay to announce this prefix. Here is one that's that's invalid. Here is one that there was no matching ROA. Or, it was disabled for this prefix, or it was disabled for this peer.
But we have got a problem. So, this router can test and come to a conclusion about its validity and it's going to pass the route via iBGP but what does this router now about that? We haven't extended iBGP to carry validity information. Okay. And this router got a spoof and this router didn't have a ROA at all.
What do I choose? So one proposal was ?? and here is some bad examples, by the way†? or not bad, but I don't think it's what you would want for a tended policy. This route was valid even though it has a longer ?? it has a longer AS path. The shorter AS path caused this one to be chosen over the valid route. You might have preferred the longer path. Because it was secure. Here is one where it was valid with a worse metric. You might have wanted to choose it. Both equal path length.
So, how do I do do this? How do I work? Validity into policy? Well I don't want to set a community because then I have to interpret it everywhere, etc., etc.. the fact is I want to take the validity state and deal with it just as I deal with all other local policy. So I am going to make the validity state testable. This is not meant to be a Cisco route map. This is just kind of like an I say Cisco route map. I can test validity stake here it was ?? this is somebody running pretty secure. It came out invalid, drop the route. It came out not found, local previous hit down. So my entire AS's policy is here, and it fits fits into the RPKI?based validity decision blends right into my entire policy existing policy model.
Here is a really paranoid AS. If it's valid, set the local pref above the default. All other validity states ?? all other states, drop it.
Very paranoid. Here is somebody who wants AS path to have more strength. So they use validity state to set metric. So the point is, it goes into your normal policy. Okay. There is an open test bed running. You can join it. Okay. These are the actual players ?? that didn't wrap very well. And due to the fact that none of the registries are really running public, available stuff, and especially doing up/down, we had spoofed the registries using the real reg registry data and so APNIC and JPNIC and all the way down, these guys are publishing here, level 3 publishing themselves; Cristel is from Belgium, so we gave her chocolate ?? and that's kind of the game. You will notice this one has space from both APNIC and ARIN. We hold parties where we essentially do like we used to do with DNSSEC, but you can join it in the mail, level 3 joined it entirely by mail in three days.
There is one problem in especially in the ARIN region. Operators are exceedingly concerned that all of a sudden my routing is controlled by the RIR. And the RIRs always said they didn't control routing. Okay. But yet, we have the problem of wanting better security. And is as most security things go, there is a trade?off. Okay.
And, you know, who do you trust? Here is 2001, Verisign issued fake certificates for Microsoft. Microsoft was severity damaged severely damaged, so I wouldn't trust them either. At least the RIRs, you know, we have some community and you'll see later this week, suggested policy and how we deal with it.
But, especially in the ARIN region, there is extreme fear of RIR control of my routing and ARIN is doing its best to say we are issuing certificates, and we are not going to take them away. And people are figuring that out. You will see RIPE figuring that out later this week.
There is a full implementation. There it is. And there is a mailing list if you want to join the test bed.
And this work is supported by the Department of Homeland Security, believe it or not. Those people who won't let you into the United States.
And it's actually their science and technology division and they believe in the a safer Internet and they are paying for open source implementation and all that stuff.
ARIN funded one third of this stuff. This is my employer, Cisco gives us routers, Google gives us rack space, NTT and Equinix gives us fibre transit and all those wonderful things. And you have made it to the end of the talk.
SPEAKER: When do we take the test?
RANDY BUSH: The next break.
SPEAKER: Jeremy Stewart. I have got a question. Why did you use a top down approach for issuing the RIAs rather than a peer to peer method of actually distributing them?
RANDY BUSH: Because the IETF, PKI Working Group sat on its ass for ten years and didn't develop a mesh model. There is, unfortunately, no technology which to base it; we weren't going to go through the hell of trying to invent a mesh PKI. I certainly wish there was, because that's the trust model among us, a mesh model.
SPEAKER: You can use the mesh for distribution?
RANDY BUSH: He asks, can't you use the mesh for distribution? The publication protocol let's you use the mesh for distribution. There was an early version of the test bed where, in fact, a APNIC was being served by PSG Net and publishing it over it, you can make an arbitrary mess of this. I suggest you don't. Thanks.
SUZANNE WOLFE: I am well aware that I am the next last thing between all of us and the will the evening reception, and also I can't talk as fast as Randy or Geoff so I'll try and go through the presentation.
I am Suzanne Wolfe from ISC and for those who may not know us, we are a US?based public benefit company that does†?? we build Internet infrastructure, software and services. We are best known for our DNS software Bind, but we also operate one of the DNS route servers, and I am here to talk about a new venture for us. Also I'd like to say thanks for our hosts, and to the RIPE NCC for a great venue. I have been here before and liked it very before so I am very happy to be back.
We'll start off with the news that isn't news. I noticed yesterday when checking dates that Geoff Huston's model for IPv4 exhaustion is now predicting that we have got exactly two years†?1st May, 2012, and, of course, it's a model that's subject to changes and conditions and so on but that's still a sobering moment there predicting that the RIRs will be out of unallocatated IPv4 space in two years, with IANA running out in September of next year, I believe is the current.
But the important thing here is that, functionally, this date has already come and gone for large network operators and the event horizon is moving closer and closer to smaller and smaller operators and where we are now, again we have talked a little bit about this even already today and one more during the meeting but applications are not v6 enabled, content provider are just getting their feet wet in a lot of ways. There is going to be a gap between when there is no more unallocated v4 and when v6 is widespread in native use, and it falls to ISPs and access providers to bridge that gap.
So, the original model for how v6 was going to appear in the world and in the net was we would have dual?stack for a while and when they have v6 and it hasn't worked out that way. And folks have started just in the last couple of years, working on making serious efforts to work on coexistence technologies, perhaps for a long time where we will still have v4 and we'll have to be able to operate with v6 at least for folks who can't get v4 any more and need to grow their networks. A bunch of different approaches being discussed. We are here to talk about one about dual?stack light light and the URL is the current draft of the specification in the IETF. But it makes sense first to kind of do and overview adds soft some of the way people are thinking about these coast existing technologies. There is a variety of use case that is goes with a variety of ways of managing negotiations. For a network. For a lot of people, v6 is not that relevant; it's somebody else's problem. And NAT 44, 6to4 relays, tunnel brokers, are some of the ways that users can deal with some of these things but for folks running networks, a great men many want v6 to be completely irrelevant and a great many of them, that's a reasonable stance. Another model of deployment, are you building v6 at the edges but keeping v4 at the core. There is ways to do that too. NAT 64 and DNS 4, 3 can be complex to operate, maybe they don't have to be, these are the things that people are just beginning to experiment seriously with.
For folks who can maintain good connectivity for v4 and as they move into v6, conventional dual?stack makes some sense or user facing equivalents. People are starting to talk particularly large content providers, about some interesting ways to make sure that they are not degrading the user experience by serving their content over v6 transport, that they don't know that much about and haven't a lot of experience with when asked perhaps in preference to v4. So there is some interesting work going on there. And the other large?scale way people are looking at this, if you are building v6 in your core but you still have to assume v4 out at the edges, perhaps CPE for a large access network, dual?stack light is designed to help you how the out with that.
So exactly how does this particular technology work? What does it look like in the network? With it all has v4 applications behind v6 CPE to communicate with v4 servers and peers over v6 infrastructure. Pretty straightforward in concept. One Address Family Transition Router can handle many clients. The mechanism, there is tunnelling over the v6 infrastructure. And then NAT to the IPv4 Internet. The client side is meant to be simple and light wait weight. And the diagram here is from Alan  /TKUR at Comcast. He is the principal author of the DS?Lite spec and the graph is used by permission. It shows the overall architecture of DS?Lite. The takeways:
The main advantages are that it enables sharing of IPv4 addresses, maybe among many clients. There is a great deal of ability to scale horizontally within the carrier network. You can have as many of these RIRs or as few as suits your technology and operations model. And it's relatively simple. None of this is as simple as we'd like, but this is a fairly straightforward way of approaching gluing these things together for both the user and the carrier. And particular the architecture is transparent in native IPv6 as newer network moves to that, even behind the CPE. If you have mixed v6 and v4, v6 had been handled transparently and only v4 will be encapsulated and tunnelled. And it only involves one layer of NAT for the v4 clients.
So, this can obviously get fairly complex, but the basic structure is fairly simple. It starts with the client, which the spec calls a B4. The B4 element of the topology. A DHCP option supplies the target address for the tunnel end point and it can this can be changed. And there is actually a draft in the DHCP Working Group to standardise the option. The encapsulation of v4 pacts, packets, pretty convention terms. Some amount of state on port mappings are maintained here. I need to be more careful how I state that because the after AFTR does the primary work of ?? but there is some settings based on the client side and port cesarean reservation extensions are under discussion. We will be implement implementing UPNP and probably NAT PNP, but the details, the implementation specifics are still under development.
The rest of the basic structure is the after AFTR., which is a tunnel concentrator and in very straightforward terms. Decaps lates the incoming v4 and v6 packets. Native v6 has passed through the infrastructure unaltered. It doesn't see the tunnel. It doesn't have to be touched by any of this. For the v4, pieces, this is straightforward. It's intended to be fast. It's intended to be easy to manage and maintain.
What you get on the other side of the tunnel is conventional NAT with the shared IPv4 address. Managing the details of the of the port mappings and reservations between the v6 and the v4 Internet and static core dynamic use of limited v4 addresses. Again it's designed to be limited as v4 gets limited.
How it looks in the network? The network sees both v4 and v6 in use. One v6 delegation for the customer site according to whatever your practices may be. One v4 delegation for the AFTR. Customer sees business as usual. And this is a tremendous advantage for folks with large installed basis of people who don't upgrade often. You know, v6 aware applications use v6. V4 only applications, nobody needs to know. Control points are shared. Which was also a goal. You know, this was built ?? this was designed and built by people who don't like the idea of a large CGN or a large NAT sitting in the middle of a network as a single very espensive point of failure. So there is a great of deal of attention to horizontal resource scalibility in the network.
This is the ?? ISC has done an open source implicationtation of dual core light. With help from Comcast. The design goals included, it should run on commodity hardware.
The open source distribution that's available now. Client side DHCP, support the B4 element, that's actually under OpenWRT, we have contributed the patches back. I am not sure when they will be taken up, but the widget we have built is part of the distribution. The server side, the tunnel concentrator, which runs on conventional servers. And the URL will get you to it.
Now, we have another release later this year. In the long term AFTR has to go where users want it to. In the short?term we know of a few of our next steps. UPNP is definitely on the road map for the immediate future. We have just beginning to get information about how this performs, how it scales, how you get a lot of clients to be able to use it easily. There is a couple of things going on within the IETF, actually, with dual?stack light. There is some notion of being able to support redundancy or fail over by being able to have a client choose ?? a B4 client choose among AFTRs. There is some discussion about extending the topology model into some more dynamic ways of doing things, so we are participating in that. And to the extent that those things seem like good ideas, we'll go out and implement them and and see if they work in the real world and people want them. We are hoping the carries of all sizes will find this useful technology across this that gap. There is many ways of doing this. There is many ways of doing with with v4 and v6 co?existence as there are technology models or business models, but we think this will help some folks with some at the time set of problems with the real world.
So, try it out. It's just one of the tools in the v4/v6 kit. But we'd we'd like to hear from folks. I think it goes beyond large access networks. If you are growing a network and can only get v6 addresses. If you haven't installed base of v4 and nodes, this is probably worth looking at for you.
We want to hear from you. There is a few of know that this is what we do, we build open source software that's freely available to anyone who wants it, no strings attached. Just go download it, we also try and build communities around our software where folks request can work with us on evolving it, and making it better and making it more useful.
So we have the usual, mailing lists and the forum for discussion of the road map and the software. We want to hear if people are finding it useful. Look for us here. We are usually at ITF IRR meetings. Send us mail, let us know what you think. Whether it's useful for you or what we could do to make it more useful. So, go try it out. And now I will stop standing between us and the beer.
SPEAKER: Hello. Jan George from Slovenia. Do you have any useful data from real deployment, how this thing scales?
Not yet, which is one of the reasons I am up here encouraging people to try it out. The design criteria included that we'd be able to support thousands of clients on one AFTR, but we don't actually have as much information that we'd like. And if you want to try it out, please do and please let us know, because whatever we we have now, we need to build better. We need to go further.
SPEAKER: And this is good enough for large deployments like Comcast?
This is ?? I am going to be very careful. This is an early implementation with that as the eventual goal, but in addition, we need to know how it scales in practice and we will get there eventually but I am not going to say we are there yet.
AUDIENCE SPEAKER: This data will be very useful. Thanks.
AUDIENCE SPEAKER: We are slowly moving into that part where a native v6 and the excess layer it actually going becoming a reality so it's guess that's the first step words to DS?Lite but do you know of any vendors who are thinking about implementing this in their CB, have you been in contact with them? That's the other big part, you build a server but we need stuff from the client as well.
SPEAKER: There have been some early conversations. It's early enough that I am not comfortable going into those details. That is one with of the reasons why we build against open WRT is so that it would be easy for people to take it up should they so desire and if you have vendors you talk to and there seems to be of interest to you, please feel free to tell them that.
AUDIENCE SPEAKER: Thanks. Will do.
CHAIR: So last talk for the day is [Ondrej Filip] and I picked the right Andre this time. On DNSSEC.CZ.
ONDREJ FILIP: Thank you very much. Unlike normally I don't have a very technical people speech so I know we are close to the beer and I know that anything that will be too technical will be you know problematic this time. So this is some sort of announcement on marketing speech just to let you know that not just BIRD is the best project from the Czech Republic. That is a key project but the way we are dealing with the DNSSEC.
So, the announcement is on the first slide and not to bother you too long.
Up to today we have about 15 percent of domains in .cz sign which means close to 100,000 domains. I hope we have more signed domains than anyone else in the world in total. So, that makes us a country with hopefully biggest deployment of DNSSEC. And since we announced it publicly that many people just wrote us, how do you do that? Why this works in Czech Republic, so let me spend a couple of seconds just explaining what we do with DNSSEC and how we get to those numbers.
In every meeting that is somewhat related to DNSSEC, I hear always the same complaints like there is no business case, there is no demand on the ?? aside, no registrars know how to use it, they are not demanding it. It's too complicated, too expensive. Everything. And many people just mention chicken and egg problem. There is there are no ISP validating domains, that's why people are not signing them and so on.
You know, we really wanted to somehow change this thing, so we put some energy into this problem we came with a little different philosophy. We knew that somebody had to start it. We were thankful to the Swedish registry, who were the first one to implement DNSSEC we really thought how we add to the whole thing something.
Also we thought security is not a special service. So we need to make it as integral part of the domain name servers that should be included naturally in the domain name.
And also we knew that this should be probably or responsibility. There would probably be no one else in our country that would take care of our DNSSEC. We knew that we couldn't do it alone, so we wanted ?? we tried to find some a lies in the battle. More importantly registrars, some content providers and registrants.
So we started to communicate heavily with all those groups and I will go a little bit more detail on the registrars. Because, you know, they are really key players in the game because you need to have have to be able to sell the DNSSEC to the end users. So, we launched DNSSEC at the end of 2008. Before that we did a series of seminars explaining them, how easy it is to work with DNSSEC. And, well of course it's not so easy but I hope they understood and were very cooperative.
We set up a very nice conditions for them. We didn't charge any fee or nothing like that. We really wanted those things to be very smooth.
We also offered some free DNSSEC training, heavy trainings, one, two days training just really how to use DNSSEC, how to use HSN, how to use all that stuff we went a little bit further actually, we came with some positive discrimination in the marketing. Which will describe it later. And also we did some technical support for supporting bulk DS record registration, the trick is we don't have DS records. That means that our domain likes like that. Everybody domain that is has somebody who is a holder of the domain, some registrant. There is a bunch of people who are called administrative contacts and each such domain pointers to another object, which is called KeySet and this KeySet could be shared by potentially millions of domains, and each KeySet has a some somebody who is technical contact, technical person and a DNSKEY, not the DNS recall, so that's the difference. And if this object is linked to multiple domains, we count DS records instead of those people. So, that means that if for example, dual key over and you have a million domain and you use a single key for all of them which is probably the easiest way how to deploy multiple DNSSEC registration, then you just update one object and everything works smoothly. So heats that's what we try to help them to support multiple signing of domains.
Another thing is the co?marketing as I mentioned. What it use usually do, is we give some money back to the registrar for their marketing. We pay 50% of their marketing campaigns that somehow pro mates our domain and the limit is somehow tied with the registrar performance and what we did, we offered that if they will have more signed domains that for each signed domain they will get some bow bonus, so if they want to have a bigger campaigns to sign more domains, so we pay a little bit more for the marketing efforts.
And surprisingly, even one of the registrars has a marketing campaign just directly targeted to DNSSEC, which sounds weird because it's hard to communicate those things to the end user. But, yeah, that was one effort.
We didn't just stay with the registrars, of course. We have also†?? we try to educate some end users, so, we were presenting on various fora and we are also hosting some of them like local fora for Czech people that the DNS is still ?? if if ?? there are may be some problems that we can contact how you. We also communicated it with some important players. We help all ?? or we were able to change mind of several newspapers, the big newspapers so they signed the domains and also informed about the fact by the end users. We used some different communication channels like our campaign with come crazy guy explaining how you great it is to have domain signed and so on.
And last but not least he we also have some project specially targeted for DNSSEC in laboratories, which you can download for free. And I'll just name you two of them. The first one is the DNSSEC tester. It's basically a port hole where you can just download an application which tests your home device and tells you if the device is able to at least work with DNSSEC, how much ?? how crap is your implementation of DNSSEC on the device. So that's something ?? it's currently ?? translated to English, Czech and Hungarian language, so if you want to for example use it, share this resource with us, just let us know and we can add some other languages, different URLs, whatever you want of course.
You know, it's supported by three major platforms.
Another thing which we did, and that was a problem we had that it's hard to eyes visualise DNSSEC actually. Nobody understands that you start it and nothing it visible and it works normally. So, we wanted to find a way how to visualise it, so we deployed a Firefox add?on that you add to your Firefox and that shows you that the domain you are downloading pages from is signed by DNSSEC, so then you have a green key or you have a different colours of keys depending how ?? whether you validate your ISP validate or your domain is signed or not signed. So that's the one thing and I cannot describe it detail because this will be presented by my colleague, during the DNSing with Working Group on Thursday. What happened?
We had all those bunches of communications and what happened? We launched it at the end of ?? September 2008 and you see it started to arrive slowly. I think at the end of the year, 2009, we were somewhere close to 160 ?? 16100 domains, which is probably quite different from the numbers I presented in the beginning. So, what happened later, in Jan we had some small addition, well small, like, 20,000 domains, and on September, we have some, another addition, which, in somewhere let us to the number 100,000. So, what happened? You probably know, you are not surprised. Too big registrars, decided to support DNSSEC by default. So they signed all demains that were holding, they are not just reduced, also DNS providers of course, and they decided not to get ?? not to collect any money for that. They just decided that they agree with that DNS should be free and they just simply sign it. So, basically if you have customers, you have to protest that you don't want to have DNSSEC if you don't want to have your domain signed. And of course, no one really does.
They took it as a some marketing advantage. They took it as a competitive advantage, and surprisingly, this thing, this occasion was very well communicated. It was very well covered by many media in the cheque rebubbling, mainly because some of the domains were the domains of those media and they were interested what happened really. So, got a lot of attention and there was two registrars ?? I think some of those guys are here in the room, so may be maybe if you wanted to ask them why they did is it and what were their problems, you can do that.
So, I am at the end. I promise to be short. I know everybody is thirsty.
So first of all, DNSSEC matters in CZ, which is good news, we are going close to route signing so I think we will gain some more points from this occasion.
We are somehow pioneering because we have more domains than anywhere else and honestly we found bugs in some implementation, so if you want to do it on a large scale, please let us know, we are trying to fix some bugs in validators and so on. It's nothing as smooth as you would expect. We had some problems with that. But, everything is going to be fixed so it's in the right way.
Last thing, so welcome to Prague and welcome to the country with with most DNSSEC secured domains in the world. And I want to a special thanks to RIPE NCC because they enabled DNSSEC validation on the locals over here so all of you are now using DNSSEC even you don't know it, so that's great news.
And because you stayed to the end of the presentation, I will show you just how the whole thing started in check public, when we took the decision to start DNSSEC. It's a historical picture, we saved it and this is how it started. The typical cheque way in a pub drinking beer, we spoke with with Richard Lamb and Steve Crocker, and all of them, and after a few beers we decided yes, we will go further, we will try to be ?? in DNSSEC, so that's it. Thank you very much.
CHAIR: Any questions? Thank you very much. It's quite impressive and sets an example for everyone else.
That was the ?? today's last talk. In one minute's time, the welcome reception starts in the other rooms. We start tomorrow at nine o'clock, so I hope you will all be here by then. Enjoy the evening.