Wednesday, 16 May 2018
At 9 a.m.:
DAVE KNIGHT: We should get started, we have quite a packed agenda this morning. Welcome to the RIPE 76 meeting of the DNS Working Group. I would like to start start by thanking our scribe and monitor from NCC and we don't have a lot of time to bash the agenda, we will note we have made changes in the last couple of days to what was published in the list in terms of running order. Any comments on there thank? No, good. And I think our only administrative order of business is to approve the minutes from RIPE 75. I am not seeing people rushing to the microphone to object to that so I think we can consider that approved and move on. And we will get stuck straight in with the first presentation and that is from Petr Spacek.
PETR SPACEK: Thank you for coming after the social so early, I am impressed by so many people. So first question, so we wake up little bit, who already knows something about aggressive caching? That is quite a lot. So I will keep the theory short, if possible.
So the theory, that is mostly, you know it. And then we will look how it behaves in practice on the and compare this with the behaviour with the resolver under the attack. That is basically the outline. So theory. Proof of non‑existence, you already have seen it, the import think there is a range of names which don't exist in proof of non‑existence so we ask for non‑existent name and we get back information, there is nothing in between of example.com and www.example.com. This brings us Poe essentially a lot of benefits, it may be theoretically decreased latency for the negative answers, lower resource utilisation whatever it might mean in this context. It should improve privacy and resiliency so let's see.
In practice, it's not that easy, because the real behaviour depends very much on the query pattern, D T L so it's DNS so everything depends on TTL and also on the way how names are spread in the DNS zone so we need to play a little bit with these parameters to see how it works in practice.
So, experimental set‑up. We have bunch of clients asking weird questions and we have authoritative server. The authoritative server is configured differently for different parts of experiment and we were using signed zone so the aggressive caching can work and we can see how it works or not. And we were playing the junk ‑‑ random name queries to the Knot Resolver which was in turn asking the authoritative server, and again, we were analysing number of packets between the recursive resolver and the authoritative server and bandwidth consumed by these junk traffic. So tools:
We were using Knot DNS authoritative server to sign the zone and the most important part here is that it was intentionally configured to produce big signatures, big which is intentional because we wanted to compare, let's say, worst case scenario when you enable signing and then you are under attack so Hi big the bandwidth consumed by the attack will be or will not be. And Knot Resolver had huge cache which you will see the reason on the next slide and just for the measurement and the attack.
As we have mentioned earlier, the behaviour of the aggressive caching very much depends on the particular scenario so it, for unsigned zone it doesn't work, which of course is DNSSEC aggressive caching and for signed zones it very much depends on TTL and also on the way how names are are spread in the zone file. So for purpose of this experiment, we were using four different zone files which were real for some real deployment to see how the name distribution actually affects the behaviour.
So unsigned zone. It doesn't look right. That is interesting, because my preview is different ‑‑ never mind, I will try next slide. Still doesn't work. That is interesting because preview of the next slide is different than what you see here. May someone from the technical team tell me is it PDF? No, it's not. Can you open the PDF file, I don't mind, there is no animation or anything. If you just switch to the PDF. You wouldn't appreciate the nice graphs. Talk is cheap, right. So we were measuring behaviour over time, so we were analysing ‑‑ ah, perfect ‑‑ and if you could get rid of the bar on the very bottom. Perfect. Thank you very much. So on the horizontal axis here we have time from the beginning of the attack, which is here, zero, and the attack was running for one hour, so that is in here. And this ‑‑ and on the left axis we have number of packets between the recursive resolver and authoritative server and on the right side we have bandwidth, and cache size is in here which now you can see why it was configured to 20 gigabytes right. After one hour of attack the cache of the resolver was approximately 20 gigabytes without flush. So these some absolute numbers which of course doesn't make sense, because absolute numbers depends very much on the software and how so on. We are not interested in. We want to look for the differences between the unsigned zone and the scenario where the zone is signed and caching is enabled. For the purposes of this talk we normalised the numbers, so these are now the same numbers for unsigned zones so traffic ‑‑ number of packets and bandwidth for the unsigned zone are 100 percent and we will be comparing the behaviour to this. And again, cache size, approximately 20 gigabytes is 100 percent now. Finally, some real numbers. This is how it behaves with little zone, we have little zone with just 50 names, that is almost nothing, one wild card TTL. This is zone from Czech blogging platform and I was this blogging platform and somebody ran a random sub‑domain attack against me it would be super easy because in the first second of the attack the number of packets is just less than 1% of the original amount and the bandwidth consumed by the attack is about 2% for the NSEC signed zone if we were using NSEC3 it would be like 3% or something. Still be nuts. And the nice property here is the attack basically ends after one second, right? It's somewhere here in the second second everything in the resolver is cached and nothing gets sent. Effectively this means that the resolver is still 100 percent utilised because it has to refly all the junk queries and the resolver is doing it as fast as it can. The authoritative server doesn't see anything after two seconds, just one property and second property is that cache consumption on the resolver is way, way below 1%, which effectively means that these junk queries don't force the resolver to flush the useful information from its cache, so even though the resolver is utilised to 100%, it can still use the information from cache and don't need to contact the legitimate service again as long as the TTL doesn't expire of course. So that is super easy. 14,000 names again real zone, something harder. Numbers are a little bit higher but the characteristics is quite similar. Again, in first second we have about 8% of original number of packets, and about 25 or one‑third of bandwidth consumed by the original attack. In first second. In second second, or the end of second second, we have like 1% and of packets and less than 5% of bandwidth consumed by the attack which is quite a difference because in the original attack the amount of packets and bandwidth was the same for one hour, no change. In two seconds in here we are slowly approaching zero, after ten seconds there is like no attack, authoritative server gets bored. Lets try bigger zones. This is more interesting. With 110,000 names in zone, we have like 20% of packets and something about let's say 70% of original bandwidth in first second, okay? This is quite a lot, but still, it decreases quite quickly so after ten seconds again the attack is getting slower and slower from the authoritative point of view, so if you look at the longer timescale after one minute the attack still can be seen but the like less than 2 pest of the original number of packets and about 5% of bandwidth, if you were using NSEC3 instead of NSEC it would be, let's say, 7, 8% of the bandwidth, not bad. If you look at the scale of minutes, after 5 minutes there is basically nothing on the authoritative site. So as long as the TTL doesn't expire, we are protected. And again, cache consumption is super nice, less than one% of the original amount of junk so again it helps the recursive resolver to work normally even under attack. So let's try huge zones, I am not sure how many people have such a big zone but let's see. In the first two seconds, we have something about one quarter of number of packets, which is not bad, and the amount of bandwidth is 80% in first second, for NSEC. Again, for NSEC3 it would be like 100, let's say 140% or something. That is of course more than original which is oh, help me, I am not going to do this. But if we look at the ten seconds after beginning of the attack, we will have approximately 20% of the number of packets, that is nice, and 60% of bandwidth for NSEC and approximately 90% for the NSEC3 so after ten seconds even with NSEC3 it is better than the original situation. Longer time scales. Show me more. After one minute, it's very nice because the bandwidth is below 20%, if you look at longer timescale after 10 minutes of the attack it's like 2% maybe. After 20 minutes there is again nothing in the authoritative site. That is it. Of course, now you can say, oh but you have long TTL, I don't like long TTLs, show me more. So we shorten the TTL to one minute to see what happens. Of course the characteristics in first minute is the same, it would be very surprising if it wasn't, in DNS it might happen of course but it didn't. And there was a spike in the traffic at the end of the first minute which is expected, but the nice property here is that the TTL and queries spread over time, over the entire of names, so if the attacker is patient enough it will eventually stabilise on approximately 10% of number of packets and approximately one‑third of bandwidth consumed by the original attack which is super nice because we have huge zone, one million names and very short TTL. So even in this, let's say, host style configuration it works.
Now, we can sum up how it behaves. The aggressive caching clearly helps with the random sub domain attack, much better cache usage helps the recursive site because it evict everything from its cache so the legitimate clients asking sensible questions are getting these mostly from the cache so business as usual. Of course the CP utilisation will be record here but that is it, you are under attack so there is some price to pay for it. It very much helps to eliminate the random sub domain attack to the authoritative site over time so if the attack is long enough it will eventually reach zero or depending on TTL it will converge to zero or some stable point. Of course NSEC is more efficient than NSEC3 but even with NSEC3 it helps. So that was for the ran done sub domain attack. So what is the take away? What is my urge to you? Please, most important of all the possible steps you can take, because the new software really helps, even though the old software works, please upgrade. It will help your users to increase their privacy because if you start to volume date on the recursive resolver it will help to stop leaked queries. If your authoritative site and sign your domain you will get over time protection against random sub domain attack. Of course, this will work only if a lot of people validate and the authoritative site will sign, nice property of the aggressive caching is that it helps both sides. It's not just only for recursive or only for authority tiffs, it helps on both sides so finally, we have something which is mutually beneficial for validators and authoritative site and of course if you upgrade you will get all problems from old times, I will not delve into details but if you have all the broken software you can expect problems next year, so please upgrade.
And as I have mentioned on the previous slide the NSEC3 wasn't supported by the recursive resolvers for the aggressive caching but that is changing, so you can look forward to Knot Resolver 2.4 which should be out in next couple of weeks and it will support NSEC3 aggressive caching so now the authoritative sites can pick am I interested in just being more efficient with income Sec or hiding something using NSEC3, the protection as was presented here will work for both. So, thank you very much, and please upgrade and ask questions.
DAVE KNIGHT: Thank you. Do we have any questions?
GEOFF HUSTON: APNIC. I am all in favour if this and I think it's really cool and everyone should do it, so I am not criticising this, but if someone does it it is unlikely they will reproduce your outcomes, and as far as I can see, the reason why is your distribution of names is uniform and it's the same distribution as the distribution of queries isn't it, because, in some ways, if every miss spans the same range of the zone, you will get relatively uniform behaviour. But if for example my zone only has letters that start with A and the random name attack comes in, as soon as a name in the random name attack starts with a B all the rest of the zone is covered with an NSEC response, and in other words, I get a huge sort of win because my zone is not uniformly populated and, in some ways, these results that you are saying here are effectively results that, as far as I can see, you are smearing the zone with information relatively uniformly, is that right?
PETR SPACEK: Let me correct this. These are actual zones we taken from production so this is blogging platform from Czech Republic, this is Czech university which certainly doesn't have uniform distribution of names in the zone, this is unnamed web hosting company and this is one smallish or biggish TLD, depends on point of view. These are not zones are uniformly spread names.
GEOFF HUSTON: They are not synthetic. Is it the Czech alphabet?
PETR SPACEK: It's not. This is not Czech TLD. When DNS zones for example in the university, you have a lot of like English‑sounding names for servers and so on.
GEOFF HUSTON: The point is though that the distribution of the random name attack versus the distribution of names in the zone being attacked, actually have an impact on the way NSEC caching works, it's all good but you might find better or slightly worse results you find here depending on those two variables.
SPEAKER: Warn. If you do this testing again what would be to have CPU utilisation better displayed.
DAVE KNIGHT: Thank you again.
Up next we have Geoff Huston from APNIC presenting work he and Joao Damas have done on measuring additional truncated response.
GEOFF HUSTON: Morning everyone. I am with APNIC, and this is work that I have done with Joao Damas. It's work done earlier this year. Last year, we did some measurements and if anyone thinks that the Internet works, I think they are on drugs, so is the Internet, yeah, it strikes me as truly bizarre that when you try and push a large response, i.e. a DNSSEC response with R.S.A. signatures, as soon as you exceed 1,500 objecting at the times you run into an awful place (objecting at the times).org, the DNS quay response is 1650 okay at the times. Now, I don't know what all of those validating resolvers are doing, but when we test every resolver, not just the validating ones but every resolver and we push back a really large response, 38% of them don't get it in v6. The network just goes clunk. And it seems to me that whatever Internet we are using now, is not the Internet we thought we had, because rather than being flexible, rather than being able to fragment, rather than the entire issue that says that big UDP packets in particular work, they don't. They don't work in 4 and they really, really don't work in 6. And so this is whole sort of area of the Internet and it's massive, where end‑to‑end paths can't carry fragmented datagrams, they just can't. And our concern when we look at this is that not only are none of you fixing it, that is bad enough, but you are deploying more gear that makes it worse and so the problems actually getting worse not better.
The DNS is an obvious victim. The original specification of the DNS was interesting. When you asked a question, I have lots of things missing, but when you asked a query, either the response was 512 octets, or less, or you got back a truncated response and you had doing to TCP if you wanted something better, and that was the original spec which back in RFC 001 or ‑‑ someone will quote it at me, I don't know it. With DNSSEC and EDNS 0 we kind of made the assumption that we have gone beyond that, 515 bytes is such a strange number, if the Internet has an MTU it's 1,500, right? I accept if it's v6 because it's 1,280, I accept, but the theory was that we could do DNSSEC and keep UDP, and we are relying on that really heavily, but when we introduce V into this, it doesn't work. So, the reason why it doesn't is that you will probably get an answer but it will take you much longer than you are prepared to wait. Because the first time you ask, the big response gets dropped, you get silence. So what do you do? Ask another resolver, but it's the same answer so what do you get? If the problem is close to you you get silence. And so you try a few more times, time is ticking at this point, the extend silence then tells your stub resolver, okay, now, let me drop the EDNS 0 UDP buffer size to 512, let's see if there is any kind of response. At that point you get a response you flick to TCP and things work, but as a user you have died and gone to heaven. Users don't wait that long. And so we get answers eventually but in terms of the timing, there is a lot of impatience going on. So the real question is, how often is this happening? What is going on and can we make it better? Well, the answer to the last part of those questions actually came in a draft that first appeared in September last year, I didn't write it, I had nothing to do with it. Joal wasn't a conspirator with this. This was Davy's song and this was the Yeti work. I think it came out of the Yeti work and the draft was this additional truncated response, he has pushed out a new version a couple of weeks back, and it's a hybrid sort of response to this problem and it applies to v4 and v6 and as the name suggests, the really quite easy. One query generates two responses. The original UDP response, whatever it was going to be, and possibly a trailing fragmented response ‑‑ sorry, trailing truncated response so the big response goes out and then behind it comes the same response truncated. So, the way this is intended to work, here is the kind of algorithm that I explained a few seconds ago: The original queries time out because the frags get dropped, right? You re‑query using more resolvers, they frag and time out. You then re‑query and you use 512 octet buffer size.You get the translator response you get to you (?) To TCP. Size ATR says I get a truncated response because it trails the fragmented one. So if I can't receive the truncated ‑‑ the fragmented response I will get a truncated response within ten milliseconds. And because that is the first answer I see, it's the real answer, and immediately flick to TCP. So if the fragmented responses are dropped the truncated response will be following. If you get back the fragmented response you are fine, the truncated response is a spurious response that you will ignore, your DNS doesn't care any more it's not listening. If it gets lost you are going to pick up the truncated response and flick. You send the query, you get back the fragmented response that may or may not make it and you also get sent a truncated response, that always happens. If the fragmented response didn't work, the truncated response will work and you will do the TCP query and answer. So that is the intended mode of operation of course the answer to the question of all of this, what could possibly go wrong? Well, you know, I'm sure there is some networking folk in in here as much as there are DNS folk and one of the things networking folk love to do is reorder the packet, the bastards and it's surprising common. Reordering the frags doesn't really matter, but if you reorder such that the truncated response gets there first, you are going to get some spurious TCP, and reassembling a fragmented response takes time in the stack, you know, it takes some time so there is a certain amount of the smaller packet overtaking the larger fragmented packet which will happen. The other issue which is also to my mind bizarre and surprising but in the wonderful world of the DNS not only is everything possible but a massive number of people are doing it, a whole bunch you don't do TCP in the DNS. You get back a truncated response and go ‑‑ next question. Just do nothing. So not everyone supports TCP, so you kind of get these three islands of brokenness in the DNS. There is a whole bunch of folk who can't receive fragmented UDP. It seems to be almost 38% of users sit behind them in v6, there is a lot of them. There is another bunch of folk who when you send them a truncated response will not do TCP. Weird but true. And there is a few that have both problems, you know, both legs are broken. This is a problem. ATR won't help, indeed nothing will help. If you can't do TCP and can't get large fragmented responses, I don't know. Don't do the DNS. It won't work.
But we'd like to know how big these pools are because it would really help us understand if this is going to be helpful or just a cute trick. So, we started to measure this using this ad technique, used six test and separated v4 and v6 so we weren't crossing the beams in ways we didn't understand. So the first two always, always generated an ATR response. Now, there is a few tricks to doing this, and one of them is, I have a large response and I don't care what EDNS 0 buffer size you are offering me, I am ignoring you, it was merely a suggestion, it wasn't a rule, I am going to give you a large fragmented response no matter what, and I am going to trail it with a truncated response. So I am ignoring what you are asking, I am just sending you back basically the big fragmented response and truncated response, no matter what you said. The second time, the second pair of tests, I only ever send back truncated, irrespective of EDNS buffer size. If you don't do TCP you are not going to get an answer. And the third case is the other part of ATR, I just send you the big response that is fragmented, no truncation, nothing. I don't care about your EDNS, your UDP buffer size. I am not listening. This is the way I am forcing the resolvers into a particular behaviour and not going down the emergency let's haul ourselves out of this. Did this 55 million times because Google, they do this stuff Apriliantly and help us a lot, thank you.
Looking at resolvers, I find this really weird. From the face of it it just like like the Internet is truly amazingly broken. Almost 30% of resolvers in v4 cannot process an ATR response; in other words, they didn't like the fragmented response and the truncated response. This is v4, right? This is fragmentation works, yes. 40% of resolvers got a large answer ‑‑ sorry 60% got a large answer. 40% didn't. And yes, you all do TCP, right? Except for the 20% of resolvers that don't. And that is v4, that is the thing we have been working on for 30‑odd years, this is what we are meant to be good at and if this is good you guys are crap. And you look at v6, I found 20,000 resolvers in these test over 55 million users, ATR only worked slightly more than half the time. Large UDP responses, half the time. Who would deploy v6 with a service with a 50% failure rate? TCP, 45% failure rate. This is just the inverse. So these failure rates are stunning. I don't see how anyone could say we are doing a good job, because we are not. This is no way that we can seriously think about DNSSEC in the light of that, you would think. Okay. As I said, seriously, we are doing that shit a job? And we are. And if someone says deploy v6 I would say deploy v6 carefully because if you can't make it work properly don't bother. Because making life worse is not helping anyone. Half of the visible v6 resolvers can't get a large fragmented packet is not a good reflection on our work. It's shocking. Who are they? Anyone from AS 9644 in the room? Probably not, it's in Korea. The Americans. Another Korea, Korea, China, Italy, Aruba. No one is putting their hand up, it's crap. Open DNS, weren't they bought by Cisco and meant to know what they are doing? These can't handle large UDP responses. And some of these are professionals, they meant to do this for a living. It's shit. V6, Google cleaned up a lot, but a lot of people use Google and not every part is cleaned up. Frontier, open DNS again. These are the v6 again who cannot handle large UDP responses. Deutsche Telecom. Where is Rudiger? Absent. Not good, really not good. What about the ones who don't do TCP, this is kind of you don't do TCP, seriously? Philippines, yes, seriously. Telefonica Spain, yeah, seriously. I don't know, this is v4, you are meant to show this shit. You were taught this stuff. You didn't even have a read a book. So not good. Orange, they are local, what is going wrong with these people?
In v6, same lists of folk, again they cannot do TCP. How does their DNS work? So, whenever you count the DNS, what you immediately notice is the counting resolvers isn't really the truth, because 15% of the world's users eventually send their queries to Google's 8.8.8 service. So some resolvers do a lot more work than others, and around the top 10,000 IP addresses of resolvers handle 90% of the world's users. So while those numbers may be crap they may only reflect a small number of users because of this distortion in use. So maybe instead of counting resolvers, let's look at this from a user impact. What happens to you and me and everyone else's users and without looking at which particular resolvers we use let's look at our resolution servers. So what we find is that with v4 if I send back only a large fragmented response, no truncation, no fall back to TCP, you have either got it or you haven't, approximately one eighth, 12 .5% of users out there can't get it. That is a lot better than, you know, 30 to 40%. Multiple resolvers when you get value you move on to someone else, something will eventually work but 12 .5% of the time nothing works, would you deploy a service with a 12 .5% loss rate? Seriously. It's a disgusting number. Even can't do TCP, it's 4%, cannot do TCP. No matter what resolvers you have got selected, they won't do TCP. That is v4. V6, as I said, v6 is always worse. 20% can't do fragmentation. 20% of users, if v6 is all they have got will not get back a large answer unless they resort to TCP but TCP 8.4% failure rate. If you combine the two you make life slightly better, with ATR 3.9% of users, it's not helping. With v6, 6 .5% of users, it's just not helping. So we can now put some numbers on those buckets. For v4, around 8.6% of users can't do fragmented UDP, .1% can't use TCP and 3.9% are in the nothing is going to work. They can't do both. For v6 the numbers are slightly worse, 14.3 can't do UDP, 179 can't to TCP and 6 .5% can't do either. (1.9) so what does ATR do? It doesn't make perfection, it doesn't clean up the shit. It will help some folk. It will take a frag UDP loss rate and drop it down to 3.9% in v4 and the same 20% in v6 and drop it down to 6 .5% immediately. There is no delay, there is no time out, there is no go and search for something else. So you drop by 14% and you increase the speed for those 14% of folk. The other 6.5% of folk, it's v6, well, you know, maybe you shouldn't have deployed it. So why use ATR? It's one more mechanism and for those familiar with any of the current DNS discussions it's one more straw on the Kamell's back. So why do it? Because it makes the DNS faster for that class of folk and it's not a small class of folk, 12% of folk in using v4, 20% using v6, ATR will help in a lot of those cases. It will help. And when those frag UDP responses are blocked, it will make it faster. So why use it? It will improve folks' life out there on the DNS. It's not perfection but it will make it better for some people some of the time. Why not use it? Well, it makes DDOS just that little bit better doesn't it? I.e. worse. Because now one query gives you another gratuitous packet added on to the rest so the DDOS attack factor works just that little bit better. Also now, I'm starting to send truncated responses to a closed UDP port, and many of your resolvers, one in five, don't do UDP blackholing. Why don't you? I don't know, you are just lazy. But what I get back is one‑fifth of the time I get back ICMP port unreachibles, I shouldn't be but I do, at about a fifth of the rate back. So there is the potential DDOS factor, why? Because for ten milliseconds I'm hanging on to the response, and sending it truncated so if I'm doing five responses a second, that's five items in my ten millisecond cache. If I am doing 300 million responses a second, does anyone do that many? That is a big cache of ten millisecond delay. You might want to run a piece of fibre to the moon and back again or something just to induce the delay without memory, there is a potential DDOS factor. I mentioned reordering and it is one more feature. Don't forget you only ever send it if you think you are going to fragment. Now, you either have a direct query down to your interface and go did I fragment, or you are going to have to guess. But you are only meant to do this if the initial UDP response is truncated. It's not for everyone it's just when you are sending out responses bigger than your local MTU. Adding more packets is certainly never good. The real question to you folk, particularly the folk who are thinking, the vendor side, is it worth it? The theory goes with EDNS 0 buffer size hunting you will eventually fix up the problem anyway because all those null responses will convince you to retry the query with a 512 octet at the time buffer size, actually you are going to get back, you are going to haul yourself out of the hole anyway at some point. Is it worth the additional packet to haul yourself out of the hole in 10 milliseconds? I don't know. And the second thing is, does anyone have any better ideas because I haven't seen any other ideas about this? But there is certainly no doubt in my mind that there is a huge intolerance of delay these days, and the massive timers and the huge hunting that the DNS does in its obsessive search for an answer goes against the trend of trying to make the DNS faster and faster and faster, and when we try and add more stuff, like Sec N we have got these two pressures flying around and I am not sure how they got resolved. So that is my assessment of I am happy to take comments and queries but don't forget I didn't write the draft, I really have been neutral on whether you should use it or not. I don't write this code so it's up to you folk to think of this and how you were going to respond to this combined issue that large packets and the Internet aren't working well. Thank you.
SPEAKER: Hi, my name is Ileach Menam, from Logious. So there is two things on my mind, the first someone that you ignored what the ‑‑ what the resolver said about EDNS 0 size so suppose they support two K instead of 3K no wonder they couldn't process this because they haven't large enough buffer so does make me feel very happy, that choice. (Doesn't)
GEOFF HUSTON: Let me explain bit more about how this works and why we had to do this. For a lot of the time the way we were measuring whether the DNS worked or not is whether you managed to resolve the name and the way the ad works is you resolve the name, you go and get the web object. But it's really, really noisy, there is almost a 5 to 6% noise component and so we thought is there a better way of doing this? And there is. You don't pass glue records. So when I say this is the name you are after, here are the name servers, I neglect to say and here is their address records, here is the glue records. No, no. You have doing hunt. So now, your are resolver puts that to one side and says well what are the IP addresses of these serves. A lot of resolvers, when they are trying to resolver missing glue, do not use EDNS 0 options at all, none. And so how do I give them back a large response? I am not allowed. So it's my server, my server, my rules, my game, I just send them a large response because no resolver that I'm aware of ever says I asked a query with a 20 octet buffer size, if I get 21 it's not a go. The buffer size is a suggestion not a rule. If someone has written code that actually matches the answer to the query with the buffer size, put your hand up now and I will say fair cop, you got me but I don't think resolvers do that. They are trying to create a behaviour in the server, the not a limitation of the querier.
SPEAKER: I am not sure I understand. But I ‑‑ what size is the packets that you sent? Hughes 1,600. Just too big.
SPEAKER: Right, okay. I would be happier if you didn't have to do this but maybe ‑‑
GEOFF HUSTON: It's the DNS ‑‑
SPEAKER: About the fragmentation issue, do we know what the problem is, why so many people have a problem with this?
GEOFF HUSTON: There are two sets of problems, you know. Firewalls really have a problem with fragmentation, as you are well aware, and in general, rather than sending through the frag they like to reassemble the entire packet and have their timers and get Bord. V6, because the fragmentation header pushes everything down by a bit there is amount of cheap hardware out there, a disgustingly large amount it turns out, when it sees extension headers and that the transport offset is further down the stack it says too hard, drops the packet.
SPEAKER: Right. So ‑‑
GEOFF HUSTON: I think a cheap vendor that started with C and then A and then a T did this.
SPEAKER: Another problem that you may have been running into if there is actually limitation on the path MTU that you are also triggering path MTU discovery and only sending one MTU packet thing that packet is lost and you would have to rely on retransmission which there won't be so that could be an issue so maybe if you are going to do more testing you could check for that.
GEOFF HUSTON: The not a path MTU issue ‑‑ it's microscopic. What you are searching for and we are are paths less than 1,500, there are vanishingly small numbers of those. 64 inter aid dough don't work, really bad idea.
SPEAKER: My conclusion about this is if we do this ATR we are basically saying we are giving up on fragmentation and it's okay to be this broken, that is what I read if we do this?
GEOFF HUSTON: No, what we are really saying is, we can make things faster for big answers if you go there, that is all. And it won't be perfect and it won't fix all the problems but it can be faster.
War en: Can you go back to the slide about looking at users. And I think we have sort of had this discussion before. So I don't quite understand these numbers, because I'm a user and I resolve dub dub dot ietf.org which apparently has big responses.
GEOFF HUSTON: In the DNS? Dub dub dot ietf.org and you validate
SPEAKER: Validate local ‑‑
GEOFF HOUSTON: And things aren't cached and you are going to get the DNSKEY set for.org at some point, 1630 octets.
SPEAKER: Yes. So I wonder around and connected to a bunch of v4 and v6 networks, haven't seen 4% break am or 8.4% or 20 or 12 or any other real set of numbers, somehow it all just works.
GEOFF HUSTON: We run the ad 55 million times and 4% of them fail, Warren.
SPEAKER: I have not been to 55 million different places.
GEOFF HUSTON: You need so see me our ad more.
SPEAKER: You can't really be claiming that a large percentage of .org people are broken?
GEOFF HUSTON: Don'tor get one thing here, I am pushing this far harder than reality. Only 12% of users exclusively sit behind validating resolvers. Another 12 to 20% sit behind some validate and some tonight, God knows what they are thinking. So if the entire world validated, big if, you get a 4% loss rate. If I only tested validating resolvers the loss rate would be a lot lower. But I am stressing everybody in this test not just the validating resolvers.
SPEAKER: Sure I get that.
GEOFF HUSTON: That is the reason why the number is high, but your experience doesn't show it.
SPEAKER: My experience doesn't show it and talking to dot org people they would presumably have also had a number of people bitching ‑‑ small percentage. And just to add to this point, you are forgetting caches. .org is popular enough that mostly commonly used resolvers will have that cache. You will see in that case is that when the TTL expires or the signature expires for a few seconds while the resolver does all the, oh, it doesn't work, I will try again and use TCP and figure it out and I get it to work again, which it eventually will, right, if Google's resolvers have cache flush for .org, it will take a few seconds and eventually it will get there. After the done everyone who asked for.org after that and touring the validity of the TTL will be fine.
SPEAKER: Except some of it shows that TCP doesn't work either, so one never gets it in cache.
JOAO DAMAS: For some people, yes.
JOAO DAMAS: Okay. And that is why they use more than one resolver.
GEOFF HUSTON: We really hope you are not validating or wondering what they are doing.
SPEAKER: Laurence Liam in a from Netnod. On the TCP side as well because you want to see problems with TCP ramping up wouldn't you? If you have problems fragmenting UTP or handling fragmenting UTP ought to have problems with fragmented TCP as well or is that jumping to a conclusion? House Hughes that is jumping a long way further than I was thinking of jumping insofar as there are problems with TCP and MMS size, those problems do get slightly higher in v6, didn't really look very hard to why it's failing on this. We are using 1,500 octet MTUs on TCP.
SPEAKER: And on UDP house Hughes that was 1,500 octet fragmentation, yes.
SPEAKER: Second comment is, this would be an excellent presentation to give to firewall people and, you know, networking people. I would argue that the DNS people are a smaller part of the problem here.
GEOFF HUSTON: I would argue actually the opposite, that the only folk who rely on large UDP packets for their livelihood, their speed and customer satisfaction are actually the people in this room. It's your problem more than anyone else. Other folk have problems, you know, but most of the time quick, 1350 octet MTU, problem, what problem, the day is sunny? There is work arounds in TCP and most folk who rely on this work around but you are the continues and you have no plan B. There is no plan B. The ICMP means nothing, I have set the answer, and so now I have got a caching problem with all this stuff, and so yes it's the DNS that is staring at this going I wish we had an answer.
SPEAKER: I kind of disagree but never mind.
JEN LINKOVA: First of all, thanks for reminding me because I need to keep looking into results numbers because I know what it's not, I need to find out what it is. But what is interesting so, UDP fragmentation it's most likely the network or something, it's ‑‑ functions of middle boxes but for TCP and ATR, it looks like something on resolver itself so behaving strangely.
GEOFF HUSTON: Other than enthusiastic firewalls right in front of resolvers and don't forget my suspicion although I haven't really blocked it in those resolvers that can't to TCP aren't serving a lot of people, they are serving a smaller number and I gave you a list of folks, the big ones but the majority of those, enthusiastic firewall, TCP over port 53 is obviously wrong. And as long as you are not validating you won't even notice you did it. Link link so my question was it's interesting all this interesting behaviour in resolvers, if you have more of them in IPv6 ‑‑ those resolvers which like do not respond in TCP, right? So we have more of them in v6 world than in v4, right?
GEOFF HUSTON: Yes link link so it's, okay, that is interesting, so people kind of adopting v6 and enabled and their resolvers doing some strange thing.
GEOFF HUSTON: Yes. I hit open DNS, a large number of times, 206,000 times and it failed.
JEN LINKOVA: I won't repeat ‑‑ I am actually pleasantly surprised that v6 is not much more than v4.
GEOFF HUSTON: Is that a good statement
JEN LINKOVA: It's a good statement. We cannot say v6 is 25 ‑‑ it's comparable.
GEOFF HUSTON: It's only slightly worse than 4, I suppose that is an optimistic statement.
JEN LINKOVA: I want to add to your comment about the why we have blocked frag menment in v6 it's not just cheap hardware, looking how firewalls are organised, UDP 53 but for pyramid frag ammed stuff you need to ‑‑ pyramid VIX fragment header and header after ‑‑ explicit ‑‑ policy, you need to have something IPv6 and people do not do that.
SHANE KERR: Hi, I have a quick question from the RFC chat from rod I can, Internet citizen, he wants to know is there a test query DNS server address, are you able to run this for yourself?
GEOFF HUSTON: No. As I said before, the way we construct it in the experiment was remarkably non‑obvious, because what we wanted to do was not just give you an answer, but make sure the DNS asked a subsequent question that proved that the first answer got back. And so we are playing around with glue records but as soon as you start playing around with glue, most resolvers use a different behaviour, so our serve then has to behave differently, and so this whole thing gets incredibly customised incredibly quickly. We run basically, Joe did a huge amount of work, Ray wrote the original code, on LBNS. It's a library based server and the questions are programmes so when a question comes in the code executes and a custom answer comes out. It's not a public service, we just do this right behind the ‑‑ just for the ad. So, nice question, but no, I haven't put one up.
CHAIR: Thank you very much.
Next up we have Moritz Muller from SIDN who is going to talk about DNSSEC roll‑overs.
MORITZ MULLER: Good morning. I am going to talk about rolling with confidence and how to manage the complexity of DNSSEC operations and in this case I will present a methodology how operators can monitor their DNSSEC roll‑overs specifically algorithm roll‑overs so they can be more can have tent and make sure nothing goes wrong. This is joint work together with universities, the registry of ‑‑ we have more than 5 .8 million registered domain names out of which more than 3 are signed with DNSSEC and I am part of the research department.
So, as RFC 6781 says key roll overs are a fact of life when using DNSSEC, and there are multiple reasons why you have to carry out key roll‑over, maybe you have some key managment policy that says so maybe have some compromised keys you have to roll or upgrade your algorithm because you want to move to a more secure and more efficient algorithm. So you have to carry on periodically. And basically, as you probably know, there are three different ones, ZSK which is probably the most periodic roll‑over types, you have KSK where you have to change the key signing key and therefore also have to replace the DS record at the pair enter and you have algorithm roll‑overs where you have to change both and replace the DS at the parent.
Algorithm roll‑overs specifically can look kind of scary. This is again from the RFC of back practices, it shows the algorithm in different stages and this is the conservative approach of the algorithm but as you can see see, it doesn't look that straightforward and we think this might be also one of the reasons for the low deployment of DNSSEC because this process is kind of annoying and things can go wrong there, so we want to fix it by providing this methodology and what can go wrong there is something that also happens to in 2012 back in the days for DNSSEC was quite new, sorry for my colleague Mark who is for dragging him into this, this is a mail he sent where there were DNSSEC validation failures of .nl and it said the new ZSK was not republished long enough and finally results in validation errors, and from there the impact was not that big because validation was not that common, but all with the AAAA and quad 1s and what else, 9s, validation is becoming more and more popular. And what you get from this slide or from this message already is that timing is quite important part when carrying out rollover. And I will briefly explain why that is. I hope I am not appreciating to the choir, I think there are many people who know why it is quite hard and hopefully I will reach some who don't use DNSSEC yet. I can explain a bit more detail the background. Imagine that you have a forwarding Val dating resolver very far on the left, the light blue box. You have a resolver in the middle which is the upstream resolver and you have a name server which is authoritative for example to the come and which is also signed. And example.com is right now carrying out a key rollover. The resolver has cache the A records of example dot com and the signature signed with the old key. And now the forwarding resolver wants to know who has example.com and therefore queries the upstream resolver, it gets the responded the cached answer with the old key. Because the forwarding resolver is a ‑‑ it wants to get also the key to validate this record and the upstream resolver, the Orange one doesn't have the key in its cache who it has to query the cache and has replaced the old key with new key and therefore responds with new key. And that means that the forwarding resolver cannot validate the old signature with new key and therefore fails validation. This is just one of the cases where Val days goes wrong because the operator doesn't ‑‑ hasn't taken caching into account when rolling its keys, and we replicated this scenario with RIPE Atlas probes and we found around 35 probes were behind resolvers that failed in this kind of scenario but this is like way more scenarios where things can go wrong.
So as I said timing is quite crucial and you have to take two delays into account when carrying out a rollover and when exchanging your keys. The first is publication delay and second is propagation delay. And the publication delay is my definition is time it takes until every server is in sync again, and depending on how you spread your records through different sites this may take seconds to minutes and the propagation delay is my quite simple definition, it's a time it takes until resolvers have picked up the new state, which means how long does it take until resolvers have dropped the old record out of the cache and can potentially query for new ones and depending on the TTL you are using, it takes minutes, hours or even days.
So, back to the algorithm roll over. As you can see here it has five different stages, you have six stages but the first one is just the initial stage and between each of the stages you have to wait for the publication delay and propagation delay to expire until you can move safely to the next stage because otherwise resolvers, they still have cache circuits may fail validation. Because it's an algorithm rollover you have the interaction with the parent as well. You can automate roll‑overs quite well nowadays, ZSK or KSK, you can automate them with OpenDNSSEC, for example, which takes care of most of the parts. You still have to know what your propagation delays are but they can take care of most of the rolling process. But especially if you have a KSK rollover you still have to interaction with the parent and therefore you have to define yourself when it is safe to move to the next stage in the new DNS stage.
I mentioned that this is the conservative approach of the algorithm rollover so you have additional stages when adding the new signatures and the new keys and withdrawing the keys and withdrawing the signatures and this was introduced because some resolvers, old, unbound, expected one signature for each algorithm used in the zone app ex of a zone, so when you carry out algorithm rollover you will have two different keys with two different algorithms and if there won't be two different signatures then some old unbound resolvers, I am not sure which version they are, expected a downgrade attack and failed validation. We checked with RIPE Atlas probes and out of 10,000 only six were using recursive resolvers that failed validation in this case so you could say that the conservative approach of the algorithm rollover is not necessary any more but if you want to be really sure you can still do that.
So I don't want to be a theoretical explanation here or want to use a practical example of an algorithm rollover where we carried out our measurement methodology and therefore we used the S E algorithm rollover which was carried out in December last year I think, S E has more than 50% of signed domains which is also interesting 70% of the population in Sweden roughly relies on validating resolvers so if the rollover would have failed impact would have been quite massive. It's the first algorithm rollover of S E ever and they changed from SR A1 to 256 and I will have two slides in my presentation later on, two links, to the blogs of .se IRS which explain in more detail why they chose certain things.
So, basically, for measuring the whole algorithm rollover we introduced three different measurements. The first was to monitor the population delay, so what time is everyone name server in sync. Measured the propagation delay so when did the resolvers drop the old keys and records of cache and were ready to pick up the new ones and we monitored the trust chains, so to make sure that everything stays intact there and resolvers were able to keep on volume dating.
As an example I will pick stage 3 or 4, depending on how you count, when the new DS was introduced or replaced in the root zone, before you go to the stage of the new DS you have to make sure every single resolver had the new keys and signatures so there should be no resolver any more which only had the old keys in their cache. And you can move safely to the next stage as soon as every recursive resolver has dropped the old DS out of cache or has the new DS in their cache as well. To measure the population delay we relied on RIPE Atlas here, we queried the authority of name servers directly, so we tonight rely on any caching resolvers but directly send the queries to the authoritative name servers and in this example queried the root name servers directly. And what you can see here is the time it took and until every single RIPE observed probe observed the new records from the different root servers and you can see here roughly 3530 and every single RIPE Atlas probe got the new DS from the different root server instances. You can see that the ‑‑ it depends a bit which root server you query, some were in sync quite fast so, for example, take the J‑Root which is on the very left between all the other funny colours, they were in sync like right away basically. And if you look at L root, they took a bit more time and every of their name servers were in sync again. Note that this doesn't have anything to do apparently with number of Anycast sites they are having so the number in the brackets are the number of Anycast sites and only having many Anycast sites doesn't mean your serves take a long time to be in sync. L has many and they took a bit longer than J. This is just a snapshot so this is the publication today just when DS of S EU was replaced, doesn't mean L always takes longer to be in sync so there might be many factors that have an influence on this delay, and this was just the case for S E, for this zone update and it might be different for the next zone update.
So the publication delay is roughly 10 minutes. The propagation delay, use RIPE Atlas probes but queried the recursive resolvers so we checked what records do the recursive of the RIPE Atlas have in their cache. The TTL of the DS and root is one day so 24 hours and as we would have expected after roughly 24 hours almost 99% of the recursive resolvers had the new DS in their cache. But you can also see that a very small number of resolvers, roughly 1% or less, even took another day to update, obviously to get the new DS so, this might be maybe because they run root locally for some reason, there might be also many, many reasons for that. So you could say that the propagation delay was roughly 48 hours, if you really want to make sure that every DS ‑‑ every resolver has picked up the new DS. So for the timing of this stage when was it safe to move to the new stage of the rollover, when was it safe to remove the DS keys, you could say publication delay 10 minutes, propagation delay 48 hours and S E took even more time, I am not sure how many hours they waited but for them they were even more on the safe side in our opinion. And we verified that by monitoring the trust chain as well and we used RIPE Atlas and the vantage points of the HT proxy network which allows us to trigger DNS queries through http requests and that gave us more than 46 thousands vantage points out of which more were behind validating resolvers. We create for a test domain which has a bogus record which allowed us to get the state of the recursive resolvers so whether there were validating resolvers, non‑validating or bogus resolvers, and if something would have gone wrong during the rollover we would have expected an increase over time and decrease in validating resolvers over time. And this is something you could have seen in this figure but again here on the very left the new DS was replaced at the root zone and after that we could have expected an increase in bogus resolvers but we don't see increase which basically means that the rollover of S E was successful because we also didn't see any increase in bogus records through any of the other stages of the S E rollover, which is a great thing, hooray. So the S E rollover was successful, the conservative rollover which S E didn't carry out, was not necessary. But in general if you have time for your rollover just take your time. It doesn't hurt you wait another day and then to be sure that resolvers pick up the new ‑‑ drop the old records out of the cache.
So, what we also want to achieve with this paper or with this article was that also other operators who are maybe not that familiar with DNSSEC can also be more confident when deploying DNSSEC and when rolling their keys and therefore I describe the measurements in a bit more detail and I am going to publish a small tool with which you can create these queries and measurements are RIPE Atlas using API, so that also other operators may be of smaller zones can manage their rollover easily.
You might ask do I need all the 10,000 RIPE Atlas probes for that? Of course not. If you know your audience, for example, if you know that you have certain kind of audience in a certain country you can also use the RIPE Atlas API to select certain probes in certain countries so you don't have to rely on 10,000 probes and use a lot of credits. There will be a detailed paper if it's accepted and they are the two links which explain the operational decisions that I AS did during the rollover. With that, thanks to the operators of IS, to allows us to (IIS) to monitor their rollover and cooperate with us. Thanks to RIPE for giving us a lot of credit and raising a lot of limits in the RIPE Atlas API and with that, thank you very much.
SPEAKER: Oracle. Thanks for this. I was just looking for some clarifications on publication and propagation delay. It looks like you only taking account the DS, or you only looked at the DS records when it reaches caches and the name servers, is that correct?
MORITZ MULLER: In this presentation, yeah, but in general we looked at also the propagation of the DNS keys, the signatures so for every single record which is added or removed in each stage we measured that.
SPEAKER: I would like to maybe you to an RFC 7583, I am not sure if you know that. It's key talks about key timings in DNSSEC roll‑overs because I noticed that you use propagation delay for going to, let's say, the time it takes to reach the caches, propagation delay when the time it reaches the secondary name servers.
MORITZ MULLER: I will clarify that.
SPEAKER: And then final thing, the 48 hours, I guess that is only for .se because that depends on TTL and would be nice to see if the equation in the RFC and in other documents match the 48 hours that you have seen in your monitoring research.
MORITZ MULLER: A colleague of mine looked at this caching behaviour a bit more deand he saw more resolvers stretch TTLs in general. So it's not only for ‑‑
SPEAKER: So the equation probably is very conservative and you expect a lower number there.
SPEAKER: Lars from Netnod. Can you go back to page, I believe it was 29. No, before that. Where you had the propagation delays, that one, yes, please. There is a starting gap there, so how did you determine.0?
MORITZ MULLER: .0 ‑‑
SPEAKER: 17:30, how was that determined?
MORITZ MULLER: 7:30 is basically just like ‑‑ when you see the first increases the first time we have seen the new DS popping up at one of our RIPE at loss probes but we are not sure when exactly (Atlas) the new zone was published.
SPEAKER: So from which point did you start to measure the delay?
MORITZ MULLER: We measured basically the delay from the first time we saw the DS at any of the RIPE Atlas probes.
SPEAKER: Fair enough. Then I misunderstood. All right. That is a good way. When you talked about timers and automation, OpenDNSSEC handles all that four doesn't it? If you happen to use ‑‑ there is already software out there that handles these rolling timers.
MORITZ MULLER: I think OpenDNSSEC can do that mostly as well, if you configure, not propagation delay but the time it takes until the records are dropped off cache then I think also can automate that for you.
SPEAKER: And also how does this work relate to ‑‑ there was an Internet drop which I think was turned into an RFC about the rolling timers.
MORITZ MULLER: I am not 100 percent aware of that.
SPEAKER: It was Stephen and my colleague ‑‑ years ago and not entirely sure what the current status of it is. But if you are not aware, if you are not aware of it, I can point you to it because there is lots of thoughts in that document.
MORITZ MULLER: Definitely. I think there is RFC which is called something like timing in DNSSEC roll‑overs or something like this which discusses that in more detail but they are not giving strong recommendations how to do that. So this is why we said okay we issued ‑‑
SPEAKER: Peter CZ.NIC. I would like to follow on the automation line because don't ever to algorithm rollover manually, it's nightmare and I am not sure if OpenDNSSEC can to algorithm roll‑overs, hopefully somebody ‑‑ it can, bone know is modding. And for example, OpenDNSSEC is not the only one if you have Knot DNS authoritative authoritative serve you say this is the new algorithm and that is it, and if you publish your tools how you monitor propagation of DS records it would be wonderful if it has some nice interface which can be integrated with because then it's really like matter of changing two lines in configuration file and it would render your presentation just advertisement for the automation. Which would be I think the best outcome because whenever people do something manually they will screw up something. So please publish your tools and thank you very much for this work.
BENNO OVEREINDER: Thank you for the presentation and indeed, OpenDNSSEC 2.0 and newer can do algorithm rollover so ‑‑ before I start, I think you will Rick can comment on that in more detail about dot EC and OpenDNSSEC.
SPEAKER: From I. S, we are the registry for .se. And first I want to thank NLnet Labs for helping us with OpenDNSSEC and SIDN for doing all the measurements and I think the measurements were for us a big part that giving us the confidence to proceed in the whole rollover because we weren't flying blind, we had measurements, we could point to that we see the DS ever where and it gave us confidence to introduce changes to the zone and we were quite confident that it would work and the part where we took more time than necessary were more like yeah, now it's Friday, 5:00, what could possibly go wrong? So let's wait for Monday. So, but thanks and we are quite happy with the results.
ANAND BUDDHDEV: From the RIPE NCC. I just have a small comment about Leman's question to you and your graph here. He asked you about how you determine the.0, the publication time of the zone. For those who are not aware, all the root server operators publish statistics called the RSAC 0 O2 version 3 statistics and one of the metrics in there is called publication time and this is published by A root server serve or Verisign because they publish the root zone so you can always find the information in there about when a particular serial was published by Verisign. And then you can compare that to the propagation delays in case you are interested in seeing how long it took.
MORITZ MULLER: Yes.
SPEAKER: I just one follow‑up question on the thing that Leman said earlier, the propagation delaying in OpenDNSSEC you still have to do the calculations yourself to you have to do ‑‑ think about how long it takes to end up in your zone in your parent zone and you configure it and then OpenDNSSEC does that calculations for you but it's not like OpenDNSSEC can figure out the propagation delay for you.
MORITZ MULLER: Yes.
CHAIR: Thank you very much.
That concludes this morning's session. We have a second session this afternoon right after lunch, it's not in this room, it's over in the main room, thank you again to all of this morning's presenters.
LIVE CAPTIONING BY AOIFE DOWNES RPR WWW.DCR.IE