RIPE 76
Plenary session
Monday, 14th May 2018:
HANS PETTER HOLEN: Good afternoon, welcome to this RIPE meeting in Marseilles. My name is Hans Petter Holen, I am the Chair of RIPE so I am going to open this meeting. It's the biggest RIPE meeting ever, hey, we are in France I was told, 814 registered attendees, and as of 13:30, 286 checked in. So it's great that we have a big room.
263 newcomers, so recruitment through the RIPE community is going really good. The RIPE meeting, and I am saying meeting, not conference, is a meeting open to everyone. We are here to bring people together to get to know each other, to discuss, to interact, to communicate, that is what we are doing in our day job, shifting IP packets around, now we are here we can actually interact directly socially, and we want to be a safe and supportive and respectful community. So we do have a meeting code of conduct which you can read here which basically says that you should be nice to each other. We really want this to be an open and welcoming community to all community members so please be respectful.
In the event that you experience something where you don't feel that you are treated in a respectful way, we have three trusted contacts, you have the pictures up here, Mirjam, Becka and Rob Evans so if you are here in the room you can, please contact these and they can hopefully assist you in any way possible.
This meeting is put together partly from Working Groups. You can see the Working Group Chairs up here, two or three Chairs for each Working Group. They have yellow badges, so you can see who they are, and in some of the Working Groups like the Address Policy Working Group, there is now a selection of a successor for one of the Chairs, so engage and talk to these people, if you are interested in the topics on their agendas.
For the plenary programme, we have a Programme Committee, Benno is chairing the committee, he will come on stage later, and Leslie and Brian our vice chairs. Then we have a host representatives, and we have representatives from the three regional nothing, me nothing, SCE and ENOG and another six plus the three chairs nine, elected at the you guys. So there will be ‑‑ there is possible to nominate yourself to be on the PC until Tuesday and the voting system will be open from Tuesday to Thursday, I am sure Benno will say more about that later on.
Microphone etiquette: There are microphones here so you may want to say; please state your name and affiliation. Yes, we know you speak in your private capacity but we still want to know where you come from and who you are. But please engage in discussions, presentation ‑‑ presenters know they should set aside some minutes afterwards for questions and dialogue and especially in the Working Groups there can be substantial discussions as well.
We do have a meeting plan. There are some interesting things on this, there are some BoFs in the evening. There has been accountability task force that you can find in one of the BoFs, and there are some working groups here and if you are interested in the new European legislation that happens on the 25th of May, go to the working groups. I think there is a lot of interesting content this time, I get some really good presentations this time so I can actually justify being here not only as the RIPE Chair but also from a professional point of view. On Tuesday, you will see that there is a women in tech lunch, that could also be interesting. In the evening and some say this is the most important part of the meeting, we have socials. On Monday, today, we can meet the Executive Board and there is a welcome reception after that. Tuesday, there is a networking event and on Thursday there is the RIPE dinner. Unfortunately, this is now sold out, I have heard rumours that there may be a waiting list but one of the things that is going to happen for the first time at this RIPE dinner is handing out the Rob Blokzijl award. So if you are there you can look forward to that. And we will also on ‑‑ announce that on Friday in case you are not at the dinner.
We have been working to analyse and better understand how to increase the RIPE meeting diversity, so we have programmes like the RIPE fellowship and the RACI programme where we reach out to the community and to researchers to bring them actively into this community. We now have the Women in Tech Lunch where women can share experiences on how ‑‑ what it's like to be a woman in this environment and how to bring more women in because seriously, half of you should have been women, right? We are ‑‑ half of the population are women and it really shouldn't be so that the proportions are different in any industry, but, well, that is life. We have a task force working on this as well, we have an on site childcare, that is a new thing, I think it's sponsored by ISOC I was told so this is a new experiment which was proposed by the diversity task force and we really want to make sure that what we are to go is inclusive and open. And sometimes it helps only to talk about the things like this.
Accountability has been a very popular word in the transition in the, in the environment that we are working in, the Internet governance areas. RIPE put together a solid documentation of the activity framework for the accountability framework for the RIPE NCC, but we have now also got a task force to work on the accountability and docking that of the RIPE community. So there is going to be a BoF tonight where you can come and here the status of what they have been doing and how we are going to bring that forward. I think there has been great work done there and I think it's great to see a lot of these things, this is the way we have done it but it may not have been written down all the time and we can get into a, we can be proud of ourself and now we are actually accountable to this community and show that to the outside.
There has been some talk about how to replace me or actually how should I, what will happen when my term is over, my predecessor was chair for 25 years so it's still a while until my 25 year term is over. Some people think that that is a bit too long, my wife is one of them, so maybe we need a better process to do that. There has been discussions on the mailing list in the past, on Friday morning in the session Anna Wilson will summarise what has been so far and moderate a discussion on this topic so maybe we will converge towards consensus on how to select me and select my successor for a at the find term and with a defined process.
There is a RIPE Networking App, of course we can't interact and talk to each other without technical support so there is an app to help you doing that. Please download it and use it to find interesting people that you haven't talked to and you can book meetings with them and chat with them. So go and test it out.
And then this meeting would not have happened, I would say, without the support from the RIPE NCC but we know that, but also from the host and sponsors, so you can see all of them up here. And with those words, I will give the stage to first Franck Simon, one of our hosts, so welcome to the stage, Frank.
FRANCK SIMON: Good afternoon everyone. Thank you Hans Petter. So I will try to be short, I so first of all I wanted to tell a few words about the partnership we have with RIPE. For us the relationship with RIPE is long‑term relationship. As you may know, France‑IX is one of the Internet Exchange in France, in fact leading Internet Exchange, but with RIPE we have a special relationship. We are involved in the RIPE Atlas project, we have a app for the organisation of local trainings in France when RIPE ask for it and as a RIPE, we have some common values. You may know that France‑IX is a neutral internet exchange. It is a non‑profit association, so we really believe that Internet does need some neutrality, independence and organisation like RIPE and we share the value with them at France‑IX.
You have seen in the presentation of Hans Petter that we have a record number of participants for this RIPE meeting and last time the RIPE meeting it was organised in France it was 26 years ago. So I just hope that by the end of the week when you leave, you will get a nice souvenir, nice memory of France and Marseille because I guess next time it is organised in France we will have to wait for a few years.
Also, it is a main, the biggest RIPE event in term of participants, I was mentioning, and it is the first event since the last five years after RIPE's event at Amsterdam and Copenhagen, and usually at RIPE meetings are not that many French participants. We have checked the stats for the few last years and usually between 15 and 20 French representatives, so we d4ecided that as a co‑host we needed to do some lobbying on the social networks and so on and it worked pretty well because for the last checks this morning we have something like 80 French participants, which is definitely a record in terms of numbers and I believe it won't be for the next few years for the French participants. So thanks for the French community to have heard the message we wanted to share the importance of attending the RIPE meeting.
And also, a few words about Marseille. Marseille is not only a city with sun not today, but to change by tomorrow, I hope so. But usually it's a sunny city and it's second biggest city in France, not only in terms of inhabitants but also now in terms of network infrastructure, you may know that when you are go there were some new ‑‑ bringing some new connectivity from Middle East, Asia, Africa, and so on, and so Marseille in terms of connectivity is growing very fast and I believe for the next few years Marseille may become even bigger than Paris. Whatever, Paris is the capital of France. So we believe at France‑IX that Marseille is growing and becoming a strategic city, that is why we have set up as Internet Exchange two POPs for now in Marseilles since the last few years and I bring you this exclusivity. We will announce the opening of a third POP in Marseilles, indeed at interaction MRS 2, and this will be the third in Marseilles and this will show our presence and increase our presence for the next few years. So Marseille is strategic in terms of networks and I believe by all the attendees today Marseille is a nice city, not only for business but also nice city to visit.
And as a conclusion, I would like to thank a few people. First of all, of course, I would like to thank the RIPE for the confidence and the trust in France‑IX to be a could host of this event. I would like to thank of course all the participants, I think some people are still missing because they have met some issues in some transportation but I hope that they will be able to join the RIPE meeting as soon as possible.
I would like to thank also the French community to be fully represented here, and finally I would like to thank my team in France‑IX and our partners to help in the could hosting of this event which is some hard work but we are so happy to do it. Thank you, and I wish you a very nice RIPE meeting.
(Applause)
HANS PETTER HOLEN: So thank you, and then the next word goes to the other co‑host, and his name was suddenly removed. Olivier, please come to the stage.
OLIVIER CAZZULO: Ladies and gentlemen, good evening, good afternoon for everybody. Welcome to Marseille. My name is Olivier Cazzulo and I am proud to be a part of the RIPE 76 launch as Medinsoft and AMFT representative. Choosing Marseille for this event a great symbol because we are a connection land and connected city. From its beginning, has been chosen because of its fantastic and strategic localisation, seaside landscape, south of Europe based and EMEA orientated. As you are ‑‑ those who ensure nowadays people, business and organisation connections you are at the right place. French Tech is a collective brand, rolled out by the French government in 2014. It is a digital showroom international orientated with coordination of business in international trade shows like CES and NWC or TLD. This label ends at accelerating the growth of the French digital economy and its international competitiveness. The is to federate, coordinate and boast the whole digital ecosystem, private and public accelerators, high tech clusters, digital and management schools, economic development agencies around four main goals:
The first one is to foster start‑ups, 250 incubated and accelerated.
The second is to enable them to grow quickly, 30% of annual growth.
The third, to propose the support systems to continue global markets, there are 12 French Tech Ops all over the world.
The fourth, to anchorage business to locate in the ‑‑ to locate in the territory and create local jobs. 25 events organised by the he can could system on the average and by week in 2017.
M FT has the ambition to become the European place to be, to achieve this goal the Terry has major assets. 50,000 jobs in the digital economy within 8,000 companies for 10 billion turnover and among them two main things: Strategic geographical position with the arrival of the biggest Internet back bones in Europe and the second one, much more important, people commitment and energy. I hope you will enjoy your meeting and take some time to appreciate our fantastic spot, rich of more than 2 million years of history, culture, trading and ready for the future, and enjoy your stay, thank you very much.
(Applause)
BENNO OVEREINDER: So thank you. So short, Hans Petter also introduced the Programme Committee, very briefly, I want to pitch, well, what we do, what we don't do and how we can ‑‑ how we select our presentations and also our next members. Well, how do I continue now? So this is the Programme Committee, Hans Petter already introduced them ‑‑ us, and as you see, we are about, how many people nowadays? 15. And we are all equal, but some people, some of us are more equal than others, to use a famous quote and the people more equal than others, the PC members more, are in the grey box, I will come to that later because we are elected by you and you could be one of these people in the grey area, the new PC members we are looking for. I come to that later.
So Simon is our host representatives and Brian is the Working Group representative and the other three, Osama, John and Pavel from the different regions. What are we doing? The PC is organising, is making a selection for the plenary, tutorials and the BoFs, and the plenary, the presenters, we have to thank the presenters for the plenary because they make it interesting, we do only the selection here but all the speakers showing up the next two days and on Friday they make this interesting programme for the week. We have lightning talks every day, on Monday and Tuesday and Friday, three lightning talks each day. For today, we have three lightning talks already selected. For Tuesday and Friday we haven't selected yet so set in your interesting ideas, your provocative ideas to discuss, send in some slides, we make a selection and maybe the next day you are up here and can speak to the audience. We will sent out some emails and some tweets to solicit for presentations. Any mad idea will do but not too crazy, not too crazy but mad of course. A BoF, you can pitch an idea that needs some more attention and some follow‑up in a Working Group or whatever. It's important for us that you rate the plenary talks, and the tutorials and the BoFs, it gives us very valuable feedback, also to the individual presenters that they can learn how they can improve, what did and didn't work. So important for us as the PC, or the PC elections. So two seats are up for election, you can nominate yourself, we also ask people but definitely nominate yourself, until Tuesday afternoon, 3 o'clock, 3:30. Then you can present yourself at 4:00 Tuesday, tomorrow, at the opening of the 4:00 session, and the election will close at Thursday 5:30, and Friday we will announce it. So please, talk with one of us here. One of us, if you are interested to join us, nominate yourself, send an e‑mail to the pc [at] ripe [dot] net here, and I hope you will, well, you will send e‑mails and join us for the PC. That is about it. I want to start with the plenary programme now, and I want to invite Artyom Gavrichenkov, I pronounce his name good, thank you. Invite you to the floor. Would you prefer...
(Applause)
ARTYOM GAVRICHENKOV: So here is a long title, member cashed amplification, lessons learned. But what we are actually going to talk about, you are right, is that 1.7 terabit attack in the ‑‑ during the end of February this year, let's first define some terms, we will use later. So what is an amplification DDOS attack itself? So in a nutshell, most serves on the Internet send more, they serve data to clients and they usually send more data to a client than they receive from them. This is more or less okay for TCP‑based servers, as TCP protocol has a built in mechanism called triple way handshake for refined remote IP address of a client. However, there is no such mechanism present in the data gram protocol, the UDP, and each author of a UDP based protocol must design their own handshake and sometimes they just in the design phase. That is an attacker that is able to impersonate a victim by spoofing source IP address and to send a request for large portion of data on behalf of a victim. An attacker will generate one gigabit of request but the victim will receive like 30 gigabits of responses from the amplifier, which may pose serious consequences. There is a long list of protocols vulnerable to amp will I if I case those are mostly absolute protocols like routing information protocol version 1. But there are modern protocols as well such as gaming. As it's mostly gaming like quake or protocols. As it's mostly obsolete servers, they eventually get updated or replaced or just thrashed. The amount of amplifiers shows down trend, however once in a while a new vulnerable protocol is discovered and the number of amplifiers on the Internet sky rockets at the moment and then starts to decline again. The amplification factor is how much data you can get in return towards a victim, from an amplifying server, compared to the size of a spoofed request. There is also a stable down trend in terms of available amplification power, the factor is quite different for each protocol, say NTP is about 10 times more powerful than L D AP. And most of the amplification attacks are easy to fight with as the source UDP port which is the servers port, the port ‑‑ is fixed. So at first glance you can use a simple bit mask to track town and filter packets which come from a typical amplification port. Moreover, there is a mechanism you can use to actually order that filtering from your upstream connectivity provider. It's called flow specification rules, using Flow Spec you can identify your upstream provider that you do not want to receive, say, a packets from a particular UDP port, which comes in handy when you face powerful amplification attack. However, there are at least two major problems with Flow Spec for that particular task. The first one is obviously ICMP amplification and if you use Flow Spec to drop the packets this will have serious consequences for our protocols and network operability don't do that. And the other one is when an amplification attack doesn't actually have a fixed UDP port. One common example of this used to be word press ping back. During the ping back DDOS an attacker forces a vulnerable word press installation to issue HTTP or HTTPS requests towards a victim, we have literally millions of those installations, as word press is the most popular CMS in the world. We have millions of those installations available on the Internet and attacker is practically capable of generating a few gigabits of just seeing TCP segments, using this method. Here is, for example, six of gigabits. And up to a few dozen gigabits of a total attack. As the word press installation starts a new request itself each time, the source port is chosen randomly by the amplifying server so it's always different, and Flow Spec is no help to us in case of it. LS enabled resource. The only information you have no reliably tell Wordpress server is participating in an attack from a legitimate user is encrypted. There is also different problem, Flow Spec consumes resources and network operators usually hate giving capability to export Flow Spec rules to anyone.
So, ping back was first case of web development causing DDOS problems to ISPs, well does anyone think it would be the last case? So here is our, the memcached server, is essentially a database. The software database, it's frequently used by web developers, it's an in‑memory key store which can be used to boot and retrieve for different keys specified so often used as a fast cache. Here is the name. Long time ago, memcached authors made a huge mistake. They made a Memcache listen on all interfaces including externally facing one by default. There are a lot of machines on the port with 11211 and this is Memcache listener. This basic ASCII protocol was mentioned was ‑‑ got exposed to the Internet, and it doesn't do any authentication. It works over UDP and this weakness has ‑‑ it was known for long. Fours years ago it was discovered numerous methods to exploit Memcache ASCII protocol in the corresponding methods are outlined in his talk on black hat.
In November 2017 security searches have discovered a way for attacker to use the ass key protocol to send large portions from data cache memory towards the victim. Here is how it works. This is all you should do to inject and retrieve value, into memory of remote vulnerable memcached server, the name reflector.example.co. Usually a couple of megabits. For the sake of DDOS amplification this snippet should be run only once. Then, notice the key A, which is set to a value. Then an attacker needs to send packets towards that vulnerable server with victim source IP address and a proper payload, which Memcache will treat as a correct request. Here is how the payload is structured. This is again all an attacker needs to do to send a large portion of data they have put before. The value is reusable and moreover, what is even more convenient for web developers and sort of disastrous for the Internet, the method accepts multiple arguments so here is what you can do to send the value five times in a row, multiplying the total attack traffic five times or ten times or 100. So, back to the amplification factor. It is how much data you can get towards a victim. Our previous champion was NTP with amplification factor according to NIST, as high as 557. Other protocols occur at much lower factor values. For Memcache amplification obviously, in theory the factor would be like millions, luckily in practice all the packets of the response aren't sent at once. In practice, we see amplification factor close to 9 or 10,000. This is still 20 times our previous champion NTP amplification. Seeing Memcache incidents in February arranging between some hundreds of gigabits per second, we expected more up to a figure of 1.5 terabits per second maybe. What actually happened is 1.7 terabits per second ‑‑ boy, were we close.
So how do we mitigate that? Once again, BCP 38 we are all tired of repeating this but please do not allow a spoof traffic to pass through network model. Make sure you don't have a vulnerable memcached instance in your network. Use firewalls or again close pack if you can to filter port 11211 across border of your network as it's highly unlikely there is any legitimate traffic coming to or from that today. There is last snippet posted by Job Snijders from NTT here is a link how you can reclaim it, including memcached in Cisco IOXR. I won't go deep into that, the slides are available on the RIPE meeting website.
What I really want to talk about now is what am I going to do next time such a vulnerability is discovered? Because we know web developers won't stop here, won't stop producing these and gaming industry won't stop either. It seems like high time we will discuss that across connectivity providers community. See, one‑and‑a‑half years ago we already experienced this. Remember all those frightening headlines on the news about the Internet of things threat and terabit DDOS. Ironically, that happened a couple of months after the tech security Working Group in RIPE concluded. So numerous other working groups were launched to an address specifically the Internet of things security. However, it's now 1.7 terabits which is almost 1.7 times more than the mirror attacks and memcached is not IoT. Should we expect a memcached Working Group now? I don't think so. It took three months for Memcache to go from the disclosure to the actual threat, three months overly short interval and we have an even better example now: The Cisco small install vulnerability which got exploited a couple of weeks after the ‑‑ those vulnerabilities were patched and the security advisors were posted. We can make some use of an embargo approach which is when a vulnerability is first disclosed among a few organisations, generally capable of fixing it. However as show us, the embargo approach often doesn't work as expected for a community large enough.
Maybe that is what we want to focus on. Collaboration, timely reaction across the providers, community, security Working Group again, or even better a security incident response team capable of releasing a disaster recovery plan for us next time. So that is basically what I propose to talk about so the next time we face it we will be ready to face it.
Discussion time. Thank you.
(Applause)
JAN ZORZ: Okay. Are there any questions for Artyom? Or suggestions?
JOB SNIJDERS: NTT. This is not specifically a question be encouragement to the audience. When NTT implemented mitigation methods to dampen the effects of NTP amplification it took us weeks and extended period of time before we could mentally accept that we would have to impact traffic flowing for our network for operational purposes. Now, when Memcache came along we were way quicker in deciding this, it took us about 32 hours to deploy this because we have been for the exercise before and come to accept it is simply not sustainable to carry costs in thing range of 1 to 2 terabits so I would encourage this audience to look at the configuration examples and start thinking about implementing configuration snippets on your X routers so that when the next Memcache comes along you can respond just like this. And the next time I bet there will be public discussion on RIPE mailing lists where assessments are made and we figure out how big the amplification damage is but there is ways to prepare for this and one way is to put in the framework in your routers to rate limiting. Thank you.
ARTYOM GAVRICHENKOV: A good discussion.
JAN ZORZ: We are way ahead of the agenda so if there is any other question.
DANIEL KARRENBERG: RIPE historian. I ‑‑ all what Job said, number one. Number two, I am a bit concerned about mentioning the word CERT and CSIRT and stuff like that in this context. We have a long history in, when was it, RIPE was five years old when we proposed to have RIPE involved in this and foreseeing some of the stuff that is going on now, so let's be careful in which words we invoke. On the other hand, going further than on Job's ideas, is maybe we should think about whether we instigate some communications and vetting mechanism in the operator community saying, hey, there is this Memcache D going on, it's got this magnitude, here is a few code snippets, vet them mooning few people who say yeah, okay, those work for me, people we trust, and sort of oil the machinery of getting the information out and getting mitigation methods out. But don't call it a CERT or CSIRT or something like this, this should remain operational. But let's think about how we can oil that machinery beyond just some stuff on the mailing list.
ARTYOM GAVRICHENKOV: I am not for the naming.
DANIEL KARRENBERG: I am sensitive there, that is why I started with historian.
RANDY BUSH: IIJ and Arrcus. CERTs were originally operators. They became bureaucracies for damping information so that only the bad guys got it. And if we form a bureaucracy to serve the function we will end up the same.
DANIEL KARRENBERG: I wasn't going to be so blunt. Bush I believe we have this tradition for a couple of years now.
SPEAKER: A question from Sasha from remote participation. He is asking how do they intend to square constant rate limiting with the EU net neutrality directive?
ARTYOM GAVRICHENKOV: That is not about dropping the traffic for that port completely, that is ‑‑ well, I guess limiting doesn't have any concept against that. So at least I have been told so.
DANIEL KARRENBERG: I am not a lawyer, I don't play one on TV, but operational stuff is examined, so if it's basically hurts your network or hurts other people, I think if I read it, last time I read it is a long time ago but it says that.
ARTYOM GAVRICHENKOV: If you are going to traffic towards some port completely that might interfere with net neutrality. If you are going to limit the excess of tech traffic I don't think so, but I thought you are going to add something.
JOB SNIJDERS: NTT. I would like to ‑‑
JAN ZORZ: Is this a direct response?
JOB SNIJDERS: Yes. Dropping traffic puts you, can put you in jeopardy but if you rate limit it and you rate limit everybody the same, so every port is down to 100 megabits or 1%, then you treat everybody equally. And in the case of NTP, for normal operations you don't need gigabits of NTP traffic usually. So it's something to keep in mind but if you applied to protect your operations and it's applied to everyone I wouldn't see a problem personally. I am not a lawyer, guys don't gang up on me.
SPEAKER: Janos Zsako from Hungary. I am a bit surprised that it's not ‑‑ it's a comment, a general comment I would like to make, so I am a bit surprised because we come to, from time to time to new and new possibilities to have amplification DDOS which you very well mention that could be avoided with the BCP 38 which is very, very old recommendation, and it appears that nobody does nothing about it, so ‑‑ or probably it's a wrong statement, I want to correct it, some people do about it but there still are problems with implementing it globally. I think it would be very useful if somebody found some way to solve this problem because that then we shouldn't talk about new and new possibilities of exploiting it.
ARTYOM GAVRICHENKOV: Let me invite you all to the presentation of Alexander tomorrow morning, when Alex is going to discuss some ways to ‑‑ some other ways to deal with the spoof traffic. Some clever ideas.
RANDY BUSH: IIJ and Arrcus again. I would suggest that ‑‑ let me play my traditional role with Daniel and say what he said quietly more directly: It might be best if Sasha and Job read the document. It is not a problem.
PETER KOCH: DENIC. It's not only important to read the respective directive on net neutrality which is also not construct operational issues where there are and Job was right but his reasoning was slightly off. I suggest another read which is the BEREC report and that is the European organisation, EU organisation collective of the national regulators and some of them have dug very deeply into port blocking and so on and so forth and there is a distinction between proactive blocking for like business reasons or incident response and of course incident response is not affected but this BEREC report is very much worth a read. On the other issue, if I may, I think what we might look into here is something like an operational security Working Group BoF something. I don't think anybody suggested so far that RIPE NCC or the RIPE community engage in making up another CERT. There is a plethora of CERTs, umpteen of them popping up every now and then and all this cyber crowd what we do have is very much fragmented security community and this community has the operational insight that might help there and some of them are very much willing to interact and interoperate and maybe RIPE can be helpful in addressing operational security a bit more. Thank you.
JAN ZORZ: Daniel is off, he gave up. So if there are no other questions, then I would like to, Artyom, to thank you for your presentation.
(Applause)
So my name is Jan and I will take care of the rest of the session. Welcome to Marseille, and I hope we will have a good and productive meeting. And our next speaker needs no special introduction, Randy Bush from IIJ/Arrcus, and he decided to share with us an approach to routing in a massive data centre Clos.
RANDY BUSH: Thank you. Excuse me, we are in Marseille I had to change my e‑mail address for those of you who don't get it, my sympathies. This works. I can connect to two racks with the Ethernet, big enough it's going to work. This might work. Pushing it a little bit of my work. This is not going to work. Anybody have the belief that it would? Good. So what to you do? This is what you do. It's called a Clos network, it is ‑‑ this is just ‑‑ this is a very simple example of one form of Clos. The things to note here are top of rack switches, the spine, which here is a two level hierarchy, and the external connectivity which is generally known as egress instead of ingress and this is because these things tend to produce a lot of data. I think that is just incidental.
Clos is not an acronym, good old Charles in 1953 working in AT&T research non‑blocking switching networks, this was generally meant for telephone networks and crossbar switches etc. But just don't let anybody tell you that Clos should be capitalised. Just an example, IIJ is building a second medium scale data centre, this acronym is usually massive scale data centre, in Shiroy, capacity of 6,000 racks. For 6,000 ‑‑ how do you route in something of this scale? OSPF is good to about 5,000 nodes. IS‑IS is good ‑‑ 500 nodes, sorry. IS‑IS is good to about 1,000. And this is, we know, on our back bones. Why are they limited? They are limited because they are very chatty protocols, because they are repeatedly flooding everything they know, every 30 seconds or whatever their timers are. This is your network on IS‑IS or OSPF. Those are co‑dam ma from ‑‑ one of the really wonderful movie. And this is true, this is what IS‑IS and OSPF do to your network and it works in reasonable scale, okay? It's not going to work with 6,000 nodes, okay? So, what do do you? BGP on the other hand, signals only changes, not retelling everything it knows, every hit. So BGP has become quite common in large scale data centres already. It's great because the updates are infrequent, okay, but equal cost multipath can be very wide in, 32, 64, they have even seen 128, okay? So BGP equal cost multipath, am I going to write the decision process for this, am I going to write BGP policy for big ECMP? Heck, no. I can't write BGP policy that I trust for, let's not go there.
So we consulted the professor, one of the computer scientists to ever live, Edsger Dijkstra, known for one of his algorithms shortest path, okay? And you have shortest path in both ‑‑ shortest path is the decision process in both IS‑IS and OSPF, the SPF in OSPF stands for shortest path. So let's take the quieter protocol and breed it with the shortest path decision process. The path calculation of IS‑IS and the update rate of BGP, okay? SPF, I thought BGP was Vector not link statement of well, we can screw up anything. We are going to replace best path with shortest path and it's a new SAFI address familiarly, the format of the reachability is exactly the same, the same as link state, the same address family is carried, except BGP runs Dijkstra instead of best path. It supports multi protocol, it has to because it's a new SAFI so I have to say I am doing this new SAFI, it handles the BGP links state and I will get into that a little later, and it it has the standard peering models eBGP, iBGP, route reflection, and all the other others we are used to. What really is, is it it takes the core decision process out of BGP 4 and replaces it with SPF. It's really fairly simple. So, it has the hop and path attributes, they come for free, okay? It's got the decision processes gone, replaced by shortest path, tie breakers, what tie breakers? It's already decided. Okay. You just have to make sure that your decision base is correct so you do that by August mating with sequence numbers. So, current definition has a very simplified SPF with point‑to‑point links in a single area because it it scales because of BGP so it can be be a single area with 10,000 nodes, okay? You could support computation of LFAs, segment routing and SIDs, etc, and all the fancy stuff, okay? The link state address family dual stack so you get v4 and v6, it's all there.
The peering model is normal BGP sessions, optionally with route reflection or controller hierarchy. And that little bullet‑point is important because I am going to waste half my time with it, which is link discovery and liveliness detection are outside of BGP. One thing to note about route reflection is, it doesn't have to have the restrictions that we have with classic BGP to deal with the fact that BGP is the world's best information hiding protocol, okay? So the route reflection can be relaxed and be a more complex and less strict topology. The controller can learn the expected topology, so through some other means, I strongly suggest ‑‑ oh, wrong button ‑‑ I strongly suggest you read ‑‑ pushing buttons is not my career. I strongly suggest you read this classic Google paper a couple of years old now, Jupiter Rising, it's really good, really good. I even have a sign for my laptop "do not push buttons randomly".
So, in this mess, every rack is an AS, because it's BGP and because they are announcing something and this is the way, it took me a while to get over it, you can get over it, too. So how does BGP SPF learn link state? This is the game and what I have added to this cake. It needs neighbour discovery, it needs liveness and needs addressability. LLDP is cute, it's a little complex because it's IEEE, but it's dead because it has IPR on everything over 1,500 bytes, and I intend to go over 1,500 bytes because of security stuff, etc. I want something simple, clean and brutal, so CLNP, multicast and all that stuff. So, let's imagine a brand new ether type and we are going to use type length value stuff just like we traditionally do with everything, and I am just going to discuss ether pay loads not framing. Now, I didn't due to the fact that I stupidly left my power brick at home, I had five days in articles, which I suggest you go to, without my laptop. So I didn't get to make the second half of this presentation reasonable and simple so there is a lot of packet porn here for the IETF that I am going to blow through. You can always stop me if there is something you are really interested in.
So, the game is this: There is BGP SPF talking between the devices, it gets link state and the AFI/SAFIs from Ethernet PDUs being exchanged in pure Ethernet, MAC link state over raw Ethernet pushed up the stack to BGP. So BGP learns the topology from this, boom, runs BGP SPF. It's pretty simple, which is good. The east west protocol is the exchange of Ethernet PDUs. All the PDUs are, you sequence them to make sure haven't dropped something, you can make them be multi frame if Ethernet PDU is big enough for you, we have got to have flags, we are nationalistic. There is a check sum and that is lie, I have increased it to 32 bits but you will live. There is a length because it's TLV except this is LTV, somehow I never got it right. I get silly about byte alignment. And every frame is the same and there is a check sum and because never trust anything below your level in the stack. So here is a check sum algorithm. This is the slide on this section of the protocol that is worth suffering through all the packet porn. There are three stages. We start to find each other's MAC, with Ethernet, pure Ethernet discovery, no funny stuff on top, okay? We then have some negotiation that is optional, which is we negotiate the timers for ‑‑ then we have mandatory negotiation to say, hey, I support these AFI/SAFIs, I support v4 addresses, v6, I support MPLS even, I support MPLS. Some people here may not believe that. So, there is a negotiation, and there is the simple thing, the two of us exchanging and I am not exchanging anything beyond my interface, I am just saying here we are on two sides of an Ethernet, my IPv4 address is this, what is yours? Boom. If we support the IPv6 same, the same for labels. And that is the whole story for the exchange. We can go into disgusting detail. As I said, the identifier is the ASN, so at the initial exchange I know my ASN, so I send to you my ASN, and you send ‑‑ but I don't know your ASN yet so the field is zero, I will get it back, fill it in, blah‑blah‑blah. Okay. Once we know each other's MACs, we can start Ethernet keep alives, at a fairly high rate, okay? So you have real liveness at the Ethernet layer. Okay. Then, we can negotiate ‑‑ we can negotiate and the negotiation is which capability, okay? I request something, you agree to it, you deny it, etc.. the things we might negotiate are, one is the timer negotiation, which is the frequency of the keep alive timers, can we miss some of them? How many? And for the AFI/SAFI exchange, what the time out will be. Okay. So now we know the MAC and ether link state and each other's ASNs, what of AFI/SAFIs are we willing to talk? That is again, a capability exchange, okay? It's capability 4, it has again request, agree, deny, and here are four initial capabilities and we can imagine others.
So, we now know what layer 3 addressing each of us support, we have some hot fast Ethernet timers going, and we are ready to announce and withdraw what your and my interface addresses are and we are also saying, if you have multiple addresses, you can say which one is primary. So, there are sequence numbers and ACKs here because it's an unreliable transport. Just exactly what you expect, it's boring. I skipped one. There is IPv4, announce or withdraw, and there is IPv6 announce or withdraw, as Gerub said, 96 bits, no magic. There is MPLS IPv4, there is MPLS IPv6. So, MPLS can have multiple labels, one label should be associated with the AFI/SAFI multiple IP addresses. We now have layer 3 addressability between the two devices. Level 3 liveness should be tested with keep alives just as Layer 2 was. We have the Ethernet keep alives but you don't know the addressing really works, so use BFD, ping, whatever your mother taught you, to keep layer 3 liveness known. Okay?
We now know everything and we know it's alive, push it up the stack. There is an RFC 7752 which is an extension to BGP, which is said to distribute link state, right? It was meant for this, which is you have link state somehow discovered, mainly with IGPs, you use 7752 to push it up the stack to the control element. Okay. I am running short. Instead, we push it up to BGP‑SPF. So what we are doing here is we take, we now know the link state, we use 7752 to push it up and Bob's your uncle, another protocol with more packet porn, it's all the same type of junk. Addressing and routing are done upstairs in BGP‑SPF and Bob's your uncle! 7752, I'm really thinking about throwing that away, not the entire 7752 but the use of it here, because despite the name it's not just link state, it encodes every feature and attribute of OSPF, IS‑IS, LS T, RSVP, everything they could think of it's been IETFed to death. It's only missing SMTP, https, mobility and multicast. So think of dumping and it just going hard‑core to BGP SPF from the link state discovery and there is no IPR in this.
(Applause)
20 seconds to go. We are cruising.
JAN ZORZ: Thank you, we have 14 minutes to go.
RANDY BUSH: I am reading this.
JAN ZORZ: You have Japanese time.
RANDY BUSH: I have whatever RIPE tells me.
JAN ZORZ: We have 15 minutes for questions.
RANDY BUSH: Or answers
JAN ZORZ: I am so glad that every chair does not have its own microphone, believe me.
Martin Winter: I am curious you mentioned about how bad is IS‑ISes for that, what is your opinion on the open fabric which basically takes off my S.I.s and proves to do exactly this?
RANDY BUSH: It's still chatty, right.
MARTIN WINTER: Not open ‑‑ it's reduced a lot in the standard if you read the draft from open fabric.
RANDY BUSH: I haven't. I have to, clearly.
AUDIENCE SPEAKER: Read the document.
MARTIN WINTER: Open fabric, it's a draft in the implementation and we are for implementing that on routing, that part, so yeah it's basically IS I is taken and fix it to basically do it. It's not IS‑IS any more, it's called open fabric and should solve that issue and we are looking at scalability in the ‑‑
RANDY BUSH: What is the propagation?
MARTIN WINTER: Your application what you show there.
RANDY BUSH: Yes, yes. Good. Something to do.
JAN ZORZ: Any other questions? Come on, people. Wake up.
SPEAKER: I start. ‑‑ do you have this actually running right now?
RANDY BUSH: There is an experimental in back pocket, yes
SPEAKER: And that is on open source hardware or some vendorish ‑‑
RANDY BUSH: I can't answer the question.
SPEAKER: Okay.
DANIEL KARRENBERG: That was an answer. Can you go back to, turn on the slides again, the one that has the east, west, north, south, I have a ‑‑
RANDY BUSH: I can't.
DANIEL KARRENBERG: The very colourful one.
RANDY BUSH: There are a couple of them. There is one coming up in a moment.
DANIEL KARRENBERG: That one. That is good enough. I get the red part and the SPF and I really like sort of going back to that stuff. I get the blue part. I just don't get what link check means and what ‑‑
RANDY BUSH: Link check means liveness, so that is the Layer 2 liveness and the fast ether checks, just at MAC ether level but once you have discovered the AFI/SAFIs and have each other's addresses, in IETF you should do layer 3 liveness with, for instance, BFD.
DANIEL KARRENBERG: So that is basically just liveness and local topology?
RANDY BUSH: Yes. All I know at this point is, her address and her address and my address and my address, and I pushed that up. And everything else is SPF.
DANIEL KARRENBERG: In the beginning you were talking about the SPF being pre loaded with some constraints or some expected topology and stuff like that.
RANDY BUSH: If you will notice, I don't know if I can get back there far enough, that the BGP is the top and bottom of BGP are there, right? In the picture where I stole the centre, threw it out and replaced it with SPF and you can apply policy ‑‑ three more, you actually know ‑‑
DANIEL KARRENBERG: Which will guess. Bush I have to learn to click more slowly here. Coffee here is not bad. There we go. So you will notice. So you could start waiting the SPF decision and ‑ SPF handles weights, etc.. my example of that beautiful three Clos is oversimplified because the network is going to have some huffy stuff that it does but notice of course, since you have got SPF, you can handle cross links, you can handle arbitrary clooges on the topology.
DANIEL KARRENBERG: But you have to distribute that ‑‑
RANDY BUSH: With BGP.
DANIEL KARRENBERG: I get it. Thank you.
SPEAKER: Alison Wood, state of organ and ARIN Advisory Council. You kind of talked a bit about keep alives in this scenario. Can you talk about the overhead of the congestion you would have on a large or medium‑sized data centre network.
RANDY BUSH: Are just on the link, the scale of the network makes no difference.
SPEAKER: Dimitry: You have two topologist, your v4 and v6 topology, right?
RANDY BUSH: You might and you might have an MPLS 4 topology and MPLS 6 topology and GRE topology.
SPEAKER: So the chance will be basically always Ethernet ‑‑
RANDY BUSH: The chance for?
SPEAKER: Would be Ethernet for all of your discovery, all of your states because I am confused about the link state on level 2 and link state on Level 3, is one dependent on the other, in your case? Bush the Level 3 discovery of the link state at level 2 is provided at level 2. Level ‑‑ layer 3 adds the layer 3 addresses to it but doesn't change the link state. Stop. The liveness at Level 3, like BFD, should, in capital letters, be checking all the address families on the link, so if you have a link that is supporting 6 and 4, you should have 6 BFD and 4 BFD.
SPEAKER: Okay. Now that is clear. I thought of that but I wanted you to say this explicitly. Thanks.
RANDY BUSH: No extra charge.
JAN ZORZ: We have six minutes to go. Whoever wants to go to the mic, go now or ‑‑ okay. Please, two people in the back and here in front and then Victor.
SPEAKER: BGP SPF is there any implementation or is this just on your sides right now, a patch set for go BGP
RANDY BUSH: There has been experimental implementation.
SPEAKER: This is still ‑‑ you can't say anything beyond it?
RANDY BUSH: I don't know that I can say anything beyond it. I literally don't know whether it's okay or not.
SPEAKER: Maria Matejka, CZ.NIC, I am thinking about connecting this with the outside world. What is the relationship of BGP SPF to the whole world BGP? Should I consider it's something like OSPF or should I exchange routes between BGP SPF and the other BGPs? It's another AFI/SAFI.
RANDY BUSH: For the moment, I can cheat and say no, but that's not really very polite. The right slide is somewhere, right about here. This is where ‑‑ this is where BGP, SPF ends and here is where you are announcing to the global Internet. Now, there is a question since the SPF is merely one capability in BGP MP capability exchange, but I could have both capabilities on the same network. I don't think I want to go there right now, that is being thought about and discussed and I haven't heard anything sufficiently brilliant to make me feel quite comfortable. There is a gentleman standing here at this mic who probably has a better opinion of it
SPEAKER: ‑‑ What happens when you have like let's say 10 routers on the same Ethernet network?
RANDY BUSH: Just multiple exchanges, they are there, each one of them will do the hello and as if I have ten links from the same interface
SPEAKER: You will have N square peer to peer links basically?
RANDY BUSH: No, oh they are all on a mesh and you have N minus one factorial, don't you? N minus one squared. Yes. But if you had wired them up you would still have N minus one squared. The fact that it's a multi support CDMA exchange ‑‑
SPEAKER: Because you could imagine something like route servers, like some nodes that will take special role and all other routers will only connect to these nodes, it will have ‑‑ tens of ‑‑
RANDY BUSH: I can imagine it, I am not sure I see why I would do it.
Victor: Oracle Cloud. So I am highly biased on this topic, as one of the co‑chairs for LSVR in the IETF. I have a comment question, and a response. So you meant ‑‑ and N times minus one squared, right?
RANDY BUSH: Yes, yes.
VICTOR: So comment: I think one of the advantages, there was a comment earlier about there is other opportunities, there is IS‑IS and other protocol were going on. One of the advantages that we had heard coming from the community was that BGPs already used in a lot of BGP fabrics today, in a lot of other areas there is a strong crossover, so folks can ‑‑ although it would be a different ‑‑ different instances of BGP you might use it in different nodes, there is a lot of crossed over as to how one would use it, a lot of automation is put into the networks today, build automation in a more common way, there is another strong advantage. I am not personally saying I really think it's great or not great but some of the comments we heard coming in from the various contributors. A question is kind of like Randy: Were early stages in the development of this right now in the IETF so there was a comment, Randy earlier you said I might have an opinion on whether we put ‑ shall we stack a node with multiple, with BGP MP and do we run this both external, internally on common nodes. I have a personal opinion, and I also have, hat on chair, I think the group is still out on that, personal opinion, as an operator I like to keep things separate. But I think the operator community we want to get that, right. I think that is the opportunity where I like this exposed here so if it's operators who are interested in this work to kind of bring that.
RANDY BUSH: Please tell us why you want to combine the two on the same network?
VICTOR: My personal opinion or hat on opinion?
RANDY BUSH: Not you, the operators to whom you are talking. Why the hell would you want to do it?
SPEAKER: Why you want to do it and why it makes sense. I think there is a case to be made for both ways but we like this to represent the actual community's desires, if there are strong desires in this area.
JAN ZORZ: Thanks, Vic. Thank you, Randy.
RANDY BUSH: Thank you.
(Applause)
JAN ZORZ: All right. We are one one minute late. We at the Programme Committee get very valuable information, if you people rate the talks, so please rate the talks throughout the week, because this is ‑‑ this is the feedback to the Programme Committee, if we select the talks properly. There are also again, I must emphasise, two seats for the elections to the Programme Committee, whoever wants to volunteer or volunteer somebody else, please do. Now, we have a coffee break for half an hour, then it's next plenary session and in the evening you are all welcome to join us at the BCOP task force meeting where we will discuss some interesting stuff. Thank you very much.
(Applause)
(Coffee break)