Plenary session

Tuesday, 15 May 2018

At 11 a.m..

CHAIR: We have our next plenary coming up, please take your seats. Hello everybody, I hope you enjoyed your coffee break. I am with the Programme Committee, I am very happy to introduce Geoff Huston from APNIC who is about to give our presentation on TCP and BBR.

GEOFF HUSTON: Good morning all. I work with APNIC. Or at least I claim I work with APNIC, I am not sure what APNIC think. He is not here, he is somewhere else in the world. So, really, why we build computer networks is not to do routing, that is incidental. The only reason why we really build these networks is to shift data. And oddly enough, it's shifting data that is the real heart of what the Internet was about. Because the previous way we thought about networking was really the telephone network was a classic example. The N sets were done, the network controlled the traffic. Now, it may have only been 64 bit per second, you know, kb per second voice channels but it was all network‑centric. The computer networks changed all that over, that the protocol stacks at either end controlled how their traffic moved through an essentially passive network. Now, we have been doing that for about 30 years and we think we are pretty good at it, but what I want to talk about now is this new generation of flow control protocols that offer us some promise over the next few years of how to get even faster and perhaps even cheaper.

It's about speed. It's always about speed, isn't it? How do you make sessions go really, really fast? Well, the fibre optic guys have been delivering. Megabits, tens of megabits, hundreds, gigabits, ten it is, hundreds, we are looking at terabit now, and we are looking at multi terabit very, very soon. So multiple transition speeds are being sold on the fought on level, end‑to‑end latency, if you really want stuff to move faster get closer. Continental drift isn't helping anybody, the world is still the world, I can't do much about latency, that is just physics. No, I really can't. And then there is the last thing that I really can't do something about, protocol efficiency. We haven't been tracking it. In the 1980s, kilobits per second was a good answer. The work in the 1990s raised that up, we could saturate a T1, 1.5 megabits per second and saturate a T3 and around 34 megabits per second and then 45. We fought up to gigabits per second. That was ten years ago. Where are we now? We are actually not much further town the track, that the maximum speeds we can get out of TCP is still round the gigabit per second. So, why have we sort of managed to do optical transmission really efficiently, yet the protocol elements of this transition are actually holding us back. So, let's look a little bit more about TCP and the way TCP works. Because that really is the heart of the Internet. TCP has no maximum speed or no minimum speed, TCP is adaptive, like your car it will go as fast, if you will, as the car in front. So if there is nothing in front it will keep on trying to go faster. If you are inside you will find you are going extraordinarily slow, same car, different environment. Now, this adaptive rate control mechanism tries to do two things: The first thing is it never, ever tries to leave bandwidth on the table. So if there is available network capacity it tries to be efficient and use it. At the same time, it really does try and avoid losing or dropping packet in the network, because you have got to spend time repairing this and it's also quite bad that packet reordering happens so you are trying to be efficient, not lose it, not reorder it, but don't leave bandwidth on the table. Go as fast as you can reasonably. But at the same time, you should be fair, you shouldn't crowd out all the other TCP sessions, there are ten other sessions, you get one eleventh of the bandwidth is the theory, so if you think about it, this is a flow control process in an analogue computing sense. You are injecting flow and other folk are and they pressure each other and you get some kind of notion of fair sharing. That is the idea. So how does TCP actually work? TCP is one of these instances of what we call an ACT Pacing protocol, I send data to the other end and for every data set or data packet that is received, the other end sends me back an acknowledgement. Interestingly, the pace of these acknowledgements that arrive back tell me the pace that data is leaving the network. So if I send a precisely the same amount of data as each ah I am running at a constant speed, because for packet that leaves the network, I put another packet in based on this ACK flow. So a steady state in TCP is actually keeping that ACK flow and my data rate constant.

So, the problem about this kind of protocol is that it's always old news. What you are getting actions back are things that happened half an RTT ago, it's not right now, it's not what is happening at the other end this instant, it's what happened at the other end a little while ago. So if you think about that, because you have got this delay, you have got to be a little bit conservative about reaction. If my ACK flow is getting impaired, bad news, I should react quickly, because whatever bad thing is happening has probably got worse. So when you get bad news packet loss you really should react quickly, but if there is no data loss, if things are going fine, that is not things are going fine, they were going fine just a little while ago, so you should react and increase your rate but conservatively. This leads to something that we all should be familiar with, the classic flow control algorithm called Reno and it's one of these additive increase, when there is no packet loss, for every RTT put one more segment on the wire so you slowly ramp up the speed. If you get a packet loss, panic. Panic. Immediately drop your sending rate by half. And of course there is slow start which is anything but slow, every RTT you double your sending rate because you can. What does this look like? Classic sore tooth, if you are got a certain rate on the wire you double up very quickly and overshoot and lose packets and react and you keep on reacting and you get down below the link capacity, slowly ramp up, and immediately down so this kind of simulation, if we also add the Q, what is going on in Reno in an ideal sense is, you build the queue to full, drops packets, you react, you react below the link capacity, the queue drains, starts up, so the queue is constantly going from full to empty. Sounds terrific, right? It's shit. Anything that works really, really quickly can't use Reno. If you have got 10 gigabit per second rate over a reasonable line from here to the other side of the Atlantic ocean your packet loss rate has to be impossible. Totally impossible. If you have got 1% loss rate which is a lot except if you are running in mobile networks you can't go faster than 3 megabits per second over Reno, if you want speed it's not your answer. There has has been lots of PhD and papers and all kinds of things how to change the parameters of how quickly you grow up and how slowly or quickly you come down again and there are a few examples of this. But what if we don't use linear increase and this is where a lot of the world was up until a small time ago, CUBIC is one of these ones instead of doing an additive increase uses the power of a CUBIC function to actually oscillate just at that point of viability. So with CUBIC what happens is, you sense where you think the link capacity might be and edge in towards it using slower and slower increments, eventually it will fall and you will get a packet loss, it halves by a little less and edges up. So in some senses, CUBIC is doing brilliantly. It's just moving towards that point of packet loss, it gets there quickly, sustains it. But why does the queue never drain? Why is the queue constantly got packets in? This is delay. This is added latency. This is frustration. Because while you might sense in a ping that things are a lot closer than they are CUBIC adds this dimension that the queue never slows down, it never fully drains out so CUBIC is good but...

So maybe there are other solutions than CUBIC because while it can react quickly to available capacity and while it sits in this phase of queue formation, it still exacerbates buffer flow and that is not a good idea.

Maybe we are thinking about it wrong, maybe the last 30 years has been aiming at the wrong target. Because if you look at a network's buffer states, there is kind of three states going on. I am not going as fast as the link capacity, in that state there is no queue. Every packet that gets passed through the network immediately gets to the top hop because you are never going as fast as the link. Once you get to link capacity and you put more packets in, the routers' queue will start to fill so you are in the next one which is overutilised where the flow rate is just slightly more than the queue, the link capacity you start to induce delay. And then of course you get to the point where all of these algorithms are AMAT saturated, where the queue is completely full, the next packet you put in will get dropped. All loss‑based control systems as the name suggests tries to operate just at the point where you get saturation. And they try and react to let the queues drain.

Wrong target. Wrong point. If you really want speed, you want to do something else. So look at your round trip time for the amount of data volume. When the queue is not being used you are under‑utilised, the round trip time is always the same, it's the latency of all of those links so as you push packets through the ACK will come back at exactly the same time. Once you get to the point where the queues are being formed the more data you push in, the longer the round trip time. To get to the point where the queue, all the queues are full, the critical queue is full, if you push more data in you get loss. All of the loss‑based algorithms work there. Look at it from the point of view of the delivery rate and the ACK rate. The ACK rate is only ever going to be as fast as the bandwidth, no matter how much more data you put in, the ACK speed will be the same. Loss‑based algorithms optimise over here, what you really want to do is optimise at the point of queue formation.

So, how do you detect the formation of queues? That subtle change in delay. Because as you push more data in the queue starts to form, the delay gets higher, and this is the philosophy behind BBR. It's an entirely different algorithm. What it does is it runs blind six eighths of the time so it has an estimate of how much data it should push in and for six round trip times it pushes it in. For the seventh round trip it increases that rate for a short burst of 25%, so it deliberately pulses more data into the network and for the next round trip time it drops by 25% to let the queue drain. At the time when I am pulsing, if the RTT doesn't change that means there is more capacity in the network. Stay there. If the RTT increases, then I have got an onset of queuing. How does this work? Here is again the same kind of abstract model and most of the time you are sitting in a quiet state, every sixth you pulse up by 25% and the next down and you stay there. What does the queue look like? Well an idealised queue it's largely empty except when you pulse the queue will start to form and immediately drain. What is the cheapest switching chip you can buy? One without memory. Because really fast memory in a switching chip costs a fortune, and if you are trying to run a terabits per second, you need the most exotic fastest memory you can buy and I am not even sure you can buy it. Because while processors speed are slowly rising memory speed is a constant. This idea of relieving the pressure in memory if a router actually makes a huge amount of sense when we try and make these networks go faster and faster and faster, because what BBR tries to to is actually not use the memory buffers at all inside the network, it it pulses in and then immediately backs off, just to see if it touched memory. If it touched memory stay the same, if there is available space, no.

Now, this is like driving a car and for six minutes out of every eight shut your eyes, do not look, nothing. Go blind. So it does run blind for a lot of the time and it will not pull back when other things happen to it except when it trapulses the network so it's a completely different style of networking. Where it doesn't react quickly to other people, it will react eventually, but not quickly. And that makes its whole issue of politeness and fairness totally different.

That is the theory. Let's try the practice. The beauty about BBR is Google released it into Linux distribution and has been around for over a year now, and so I was running IPerv 3 on Linode, I am running it across most of the time across the Internet so I can't repeat any experiment, that the Internet and I am kind of looking how it works. First test, Canberra to Melbourne in Australia, 10 gig circuit, I was the only person on it, two machines, I start with CUBIC and it immediately goes up to 7 .5 gigs and then you can see a slight CUBIC then and that is more obvious, it's moving around 8 .5 gigs on 10 gig wire. This is classic CUBIC. I start up BBR, the green. Wow. BBR is like unleasing a bulldozer down the local freeway, it rips it up and no one else gets a look in. The collision with CUBIC is brutal because ‑‑ it runs blind ‑‑ started to place pressure on CUBIC's pulsing into the queue it saw loss and backed off. BBR is running blind, it doesn't care about loss, it repairs at full speed and doesn't react to loss. So every single time CUBIC tried to restart, it got absolutely nowhere, you know the little pulses just went ‑‑ I am impressed. Now, inside APNIC we run this advertising system and we move a lot of data around the world so we thought okay, let's do this from Australia to Germany because you can and try the commodity Internet and not warn anyone because you can. And all of a sudden, I'm getting 400 megabits per second on a sustained rate, Australia‑Germany across the entire commodity. I am deeply impressed, here is a little test I run up BBR and starts to work between 200 and 400 MEGS, pulsing upwards, and start up with simultaneous CUBIC and goes oh shit. That is not going to start. And then just to give it a bit of competition I start up a second BBR, just to see if it will play with each other and they kind of equilibrate a bit and sort them out out for a couple of seconds they leave some room on the table that CUBIC moves into and they go nonlinear, BBR, and starts racing against itself, kills off CUBIC, they have a fine old time with one dominating and then the other, incredibly unstable. The second session dies and first goes up to extraordinary amounts of bandwidth again. This is weird behaviour, but if I really want to move data, phenomenally good. Why is it do that pulsing? Because when it tries to recover those are the lossy vents in red, when I get loss, PBR does go nonlinear in repairing it, it seems the repairs stack up over on top of the normal sending rate so when it repairs loss it really stresses out the network. When there is no loss it's quite stable at 600 megabits per second for seconds this is so cool. So I thought okay let's stress this out completely, 50 PBR streams because the just a computer and interestingly, they to kind of equilibrate around each of them getting 10 megabits per second out of the aggregate flow. So in some ways, if you stand back a bit, they do share, they do kind of share, the result is chaotic, God knows where CUBIC would sit in all of that but BBR itself is extremely efficient. In English there is this word interesting, which is kind of, I am not going to say it's good or disastrous but really interesting. And PBR is incredibly interesting. This move away from loss‑based control has actually given you an extraordinary new opportunity to go incredibly fast. And if you really want to operate a terabit, BBR might be the only way we get there. Because the current systems that stress out on memory and loss have a real problem in going incredibly fast. Interestingly, equally cost multipath which is what you do when you get terabit, BBR loves, it doesn't mind packet reordering because it's running blind, it repairs at full speed. So it looks like this is amazing, it operates a small memory footprints and on cheap switching chips. It's incredible. It competes brilliantly against the revs you using CUBIC, I win, none of you run it, you are not allowed. It's incredibly efficient, really fast, well worth using, except I am now the bully, I am the bulldozer on the road, I am the problem. And the real classic question in the Internet what if anyone does it? Everyone has a problem. Because in this sense, it acts so differently, it slaughters everyone else, that if you want to look in on a BBR‑based network you better run BBR. Why shouldn't you use it? It really doesn't play well with everyone else and it's kind of forcing the anti‑up into everyone running this. Is it a failure? Is it just too greedy? A lot of the networks that you run, particularly in the mobile world use rate‑based policing, you play with the actions, you play with people's traffic to do your own traffic policing. BBR says I just don't care. And if I am running it kind quick so it's all encrypted I just don't care and you can't see me do it anyway and that is the new reality for a lot of these networks, and particularly in the 5G world what we are going to see is encrypted BBR and it will run the way it wants to, not the way the network operator wants it to run. This could well be a real moment for the network operators, and they they might well be saying and burying their head in their hands, oh, my God, you can't do this to us, it's so bad. I can't do selective packet drop or rated red, all of those instrument don't work, it ploughs on relentlessly.

So, Google are aware of this, and they are making some changes to be a little bit nicer to the other kiddies, BBR 2.0 is coming out which is trying to modify some of the aspects of this algorithm but ultimately this is, as we say in English, chalk and cheese, they are completely different ways of looking at the problem, and BBR will never play that fairly with everyone else, it's aiming at a different control point. And maybe they are right. Maybe if kilobits isn't enough, if megabits isn't enough, if gigabits isn't enough and you really want to go fast, you have to do these kinds of rate controlled algorithms rather than loss‑based if you really want to move into the next generation of networking. And perhaps that is a new network architecture in transmission, perhaps this is indeed the way to go.

So, what we don't know about networking is so much more than what we think we know. There is much to understand. We are moving very, very large data certificate, we will continue to do so. BBR is a really interesting step in that direction of throwing everything up in the air and coming down at a different point. And if you are at all in the research community, if you are in the high speed networking community, this is an amazing step. What is even more amazing is the source is out there, it's open code, you too can play with it on Linux, on borrowed machines across the Internet and play havoc with everyone else's CUBIC sessions. Thank you.

CHAIR: Thank you, Geoff. If you would like to ask questions, please come to the mic.

ROBERT KISTELEKI: Incentives question, maybe game theory a bit, if I ever turn on BBR, I will win

GEOFF HUSTON: I have turned it on and I think I am winning.

ROBERT KISTELEKI: If anyone ever does it, anyone else who does not will lose?

GEOFF HUSTON: If you are on the same paths you are going to lose unless you have it turned on too.

ROBERT KISTELEKI: Exactly. What will stop anyone from turning it on? What is give the incentive to anyone to move on to BBR 2 if BBR 1 is better because it gives me more bandwidth?

GEOFF HUSTON: You know, I did put out a patch, the just out there now and I haven't installed it because, you know. Incentives are a strange thing and we are all at heart quite selfish about our part of the network and when you see a protocol like this, it's just greedy, kind of woe, social conscious, selfishness, so yes, it's an interest dynamic totally.

ROBERT KISTELEKI: I don't see why BBR 2 would be successful in this market because BBR is Bert, for me, and, you know, at the cost of everyone else, therefore, in plain theory that is just a win for me. That is assuming that we will keep on using TCP and not just go for QUIC for everything.

GEOFF HUSTON: BBR lies behind QUIC as well, it's a package. You can put this in the user space anywhere you want. QUIC is packaging, not flow control, so, you know, yes, you will see a lot more of this totally, in the coming years.

ROBERT KISTELEKI: If anyone ever comes up with BBR or whatever it's going to be called, that is even greedier the whole world will jump on it most likely.

GEOFF HUSTON: Popular protocols. What is our problem?

SPEAKER: Thanks a lot for this talk. I have two questions, first of all, I worked in exactly this field before so all the tests we do when we validate TCP is large transfer and things like that but in practice what happened is you have a dichotomy and you have like zero point whatever percent of peak flows that need this kind of protocol and you have the 99 point whatever that are short. Did you ever rate BBR on short flows and its interaction in this use case?

GEOFF HUSTON: No, I have not evaluated short flows, I used IPerv 3 and enormous flows, as fast as I could. The start‑up algorithm is very similar to slow start and if most of your tat will sit inside a couple of slow start, it will behave exactly the same as a conventional slow start. If you can get rid of your data before the first packet lossy vent be it's exactly the same behaviour. I am not sure the mice have a problem, it's really the elephant versus the rhinoceros and the elephant is winning.

SPEAKER: Second comment. Because I was on the same field also I really think that there is a problem because we are thinking design of congestion control protocols from the implemented on the end host and we are thinking about what will happen on the network devices, if you are from an ISP you can put whatever switch you want, you will never change the ‑‑ how the end host will work and if you work ‑‑ I work for Facebook, if we change how the Facebook servers send the TCP control, the ISPs, they can just see that. How do you ‑‑ how do you match this to views that are completely ‑‑

GEOFF HUSTON: This is the new world of networking where the applications folk have grown tired of the network folk. They are sick and tired of the network folk looking at their packets, stretching their actions, mucking around with flow and this is a really, part of a statement that includes QUIC, that says I'm going to move the transport protocol into user space, I am going to cloak it with encryption, I am going to protect this from everyone except my server at the other end and I am going to go as fast as I possibly can or want to and I don't care any more, you have tested my patience and it's broken. We are not working together, don't even dream that networking and applications folk get on well together. This is warfare mark one, quite frankly, and it is indeed a savage statement going, not playing with you any more. Not giving my actions visible space, not letting you muck around with my flow, it's my flow hands off. Like it or not, this is where it is.

SPEAKER: Thank you very much.

CHAIR: Please remember to state your name and affiliation. Make it short and concise.

SPEAKER: My name is Illijtsch van Bejnum from Logius. So I think the philosophical questions around are this are very interesting but we are stretched for some time let's discuss this another time. I think we do need something like this because just being nice and not causing any packet loss that means leaving bandwidth on the table. But I am what I am wondering about, apart from ‑ dish want to see ‑‑ apart from that, EC M, didn't hear those three letters so how does ‑‑ how can you add ECN and make this better?

GEOFF HUSTON: I did make a buoyant ECMP across multipath

SPEAKER: Explicit congestion notification

GEOFF HUSTON: Irrelevant. Really irrelevant. Because in some ways you are not trying to stress the queue, you are trying to go way, way back and get at the point where there is a slight onset of queuing, not the point at which you are about to drop a packet. You are trying to bring the control point all the way back to just on the onset of queuing, you are aiming at a different target. ECN is I am going to be nice, before I drop the packet I will let you know you are stressing me out which for CUBIC might work really well. For this thing it's irrelevant, you are aiming at a different control point.

SPEAKER: You said you want to avoid building up queues and if you tune your ECN so it starts being early then it will help you to that, right?

GEOFF HUSTON: If you tune it way, way down, yes, most folk don't do that and I am not sure they ever L we can talk about it later.

SPEAKER: Felippe Duke,, NETASSIST. I actually have tested your algorithm you are talking about and in Linux, and it was fine, in fibre network it was fine and then I switched to some radio based network and my friend told me it will help, why it just slow down so much and I should say one thing: Yes, BBR and in corporation in radio networks are disastrous, they are really destroying any idea of radio network, you can't have the way to work radio network in multipoint isn't such strategy. I can't see how we can fix that, any idea of cooperation, queues are fine and good but Google don't like it, really. They really don't like it for ‑‑ they really hate it from the papers I read before. What is your point about queues or not queues?

GEOFF HUSTON: As I said there is so much we don't know about networking and what BBR has introduced is a different kind of control algorithm competing with loss base. And in some ways the way it's currently set up, it wins every fight most of the time. And maybe that is unfair and either everyone has to do this and the networks get scrambled or there is much more for us to learn and understand. What I do mow is while the optics are getting faster the speeds of memory are not. Memory is the same speed it was ten years ago. If you really want to make things go faster and faster and faster you cannot rely on ultra fast queuing systems inside your routers that are afford because we can't build it any more. And so in some ways the protocol is being forced into a different control point because the silicone is failing, and that really does regard not only a new conversation about a control but a conversation about network architecture, because if you want multiple terabits you have got to adjust your thinking. Thank you.


CHAIR: Our next speaker Louis Poinsignon.

LOUIS POINSIGNON: I build tool ‑‑ what is Cloudflare, just a quick reminder, it's CDN, we are also a DNS resolver, since 1st April, and we receive terabits of traffic. We have 140 point of presence and we are presenting 170 IXPs. So we need to monitor our network because we are a CDN, we want to know when there is an anomaly before the user sees it and before the users notice it. We want to reduce the transit costs, serve the same bit for cheaper and to optimise the network because we want to know which country can, which route is the best, which country are the most efficient, where we can build new data centres and new connections. So this talk is going to be in two parts, the first part is flows and the second on BGP, how we collect data.

On flows, a quick reminder for network samples, so a flow sample is destination IP, interface, size of the payload, time stamp, port, VLANs. This has high cardinality, so a lot of storage, you can aggregate to reduce information. It has a high frequently because depending on your sampling rate you may be sending more samples per second and if you want to build services on it you need reliability so this is three key factors that we feeded at Cloudflare, and our existing pipeline was very monolithic, it was good enough for one of hook ups, you just start up and NP dump, you query it and you get the information you want. We want to automate, to rerouting, statistic have periodic data, every day we want to know the top traffic, the top ISPs, the top IPs. And the top interfaces. And we didn't control entirely how we stored the data, and we wanted to compress it, just keep what is relevant after a while. So at the time you just don't need a unique flow, you just need how much traffic you did on the day. So the current limitation of the pipeline, we are using nfdump and nfacctd, one was storing ROA data and aggregation was done live so it was on a single machine, you could not shard, could not query it on cluster so it was a bit limited when you wanted to do APIs, a lot of queries. And FNACCTD was really nice, you could configure Al aggregations and you have plenty of output possible but every time you need to configure it you need to restart, when we wanted to add sFlow we had to set up a whole new pipeline because the not FA ‑‑ we realised after a while we out grew our collection tools and we had performance issues and we didn't know about that. Why did we build something new? We wanted to use our internal Cloud, containers, load balanced IPs, storage and clusters like S 36789 for instance, and also message brokers which are critical part of data pipeline. So we are going to use. From this point on we did not have any point of failure, single point of failure because we don't want one single machine to run everything. We are tock rate to lose part of the data but not everything. So this would increase reliability, accessibility, maintenance, everything. We are busy for the analytics, for traffic engineering we wanted to do live data and receive just pushing, doing the information at this point in time we are doing this amount of traffic. And you increase reliability by just doing tasks in parallel, just split everything, you put on computation in the rack and second computation in another rack, all of these things. And also, monitoring of the monitoring system, monitoring knowing how many samples we were processing, how many samples we were losing, how many once were corrupted, how much time it takes to decode everything. And last part, all the teams wanted to access the data through common format, not NF touch or just something stored somewhere on single machine on all of our network so we need to build a common database, user common database and user common format and tools. So introduce GoFlow, so GoFlow is just a decoder for NetFlow v9 IPFIX and sFlow. It takes a sample and gets it to ‑‑ it just abstracts it, all it can from the samples so it takes the interface and the IP, it converts it to an abstracted format, which is a format which can be extend and the connector just converts it, it just does one thing, which means we have to build the rest. But the pipeline was the following: You had the decoded packets on Protobuff, you had something to add more information on that, we wanted to say okay, we have the IP address, but we want to add country, we want to add the S number, we want to add Cloudflare plan, if it's a free customer, business customer, enterprise customer, so see how much traffic we do on that, it's done on the members step, it's a process, it takes the protobuff, gets the IP addresses, do a mapping and rebalance. It's one to one mapping. And then you have the next step is you extract from the extended feed, you send it to a database, the equivalent of nfdump, and another part is going to an aggregation pipeline, the aggregation is also a cluster for just summing everything op a configuration we did. So the aggregation is very simple, MapReduce, on the cluster it takes its keys, it keys the packets and it just sums, and you get this kind of triangle shape and in the end you get reduced number of information and you just send it again in Kafka to be inserted and stored. You may want to think should I use that, it's only if you want flexibility in your pipeline, if you want to develop things, if you want to have very custom information because you have to develop, add the mapping, add to your custom data. If you want to add country information you have geo IP databases, you have to do the mapping and create ‑‑ but you may use Amazon or Google or edge base or any other type of database. In our case we use Clickhouse and special team maintains to ‑‑ everybody in the company can use the data that we inserted, and then the aggregators, okay we also use Flink which is a big data framework, to the summing on this key and it's going to scale on the cluster on the Flink cluster and just send it back to Kafka. So you can also do it in many language, you can pre compute. That allowed us to have a live feed of traffic. Just a small comparison. So from two pipelines we have, we just had one single unique pipeline, unified, abstracted. And performance, so then we reached 20,000 flow packets which means more, even more flow, because a flow packet is just going to contain multiple samples and we arrived to more than 20,000 a second, it's like 30 micro seconds for decoding a sample depending it's NetFlow or sFlow which means you can scale ‑‑ you can monitor everything with ‑‑ you know how much time if you are getting congested, if you take more time to decode, this is an issue. And it's very modular because you can create your own producer and send to queue instead of Kafka, you can add NetFlow fields, you just want to stay instead of collecting interfaces we want to collect this other field because the router has it. A small reminder, the routers and the samples to the decoder and Kafka sends it to process and we insert it into Clickhouse and one goes to Flink. The results were like just as simple, gives you all the results you need. You can have ‑‑ we build very quickly a website just showing all the IPv6 traffic over a day, how much percentage per ASN. It ‑‑ SQL queries. We also want to do anomaly detection so we have live track and historical data, we can compute outliers, we can detect things. We can have best maintenance type data centre, per time zone, market shares, transit report.

BGP we did the same, we want prefix to SN, API, but we have 140 data centres and we have almost full tables in all of them, which means millions and millions of routes, more than 60 million, and the main issue is, we want to also scale everything, we want to have many sessions, like sessions live on the Cloud, like a collect or living on a Cloud, so we need load balanced IP, dynamic configuration, not the thing that most collectors do, so and even the automatable clients would not be sufficient, so we developed a BGP server on a library that we just open sourced and it just accept connections and for ‑‑ it doesn't even maintain a full table, it just, it's very lightweight, for what? Receiving an update, for what? So we move into Kafka we remove this problem of back pressure, especially when we generate table dumps. And we have other things that just maintains a looking glass, maintains the full table life. When you have 140 data centre routers you have 140 sessions, and if a machine crash on the load balancers we just lose only one part of the view and it's not like everything is crashing, you don't have to re‑set 140 sessions, you just re‑set a small part so it ‑‑ it's easier to just recompute the thing and it puts less pressure offer the routers and on the whole cluster. Because it's automatically rebalanced, it's ‑‑ yes, you don't need static configuration, just more flexible. So the pressure programme I was talking about, for instance, when everything is done on a single box, you ‑‑ if you have updates that are coming on a periodic basis but at some point you have merged, you generate a full table dump, you can be, while you are generating it it takes maybe a few minutes but updates, what are you to go with that in the meantime? You have to create a whole queue, Kafka is going to take care of that. So you just don't have to worry about that. And the table dump is in our case, so the pipeline looks like this, you have the decoders which on the automatic updates, you have archivers which just generate 5 million dumps and archivers which generate 8 hours full‑time and some goes to Clickhouse if you want to do sequel queries and live looking glass as well.

So, we have live APIs, you can create, give JSON results, and you store everything on S3‑type cluster so anybody also, again in the company, can download the update files, read it with common files.

And it was also, we developed storing system, like the structures, data structure for storing routes, IPv4 and IPv6. So we reached 300 megabyte for a full table which means around 40 gigs in round 4 over dozen machines so you don't have to store everything on machine, you can have 100 and you are just going to have 400 megabytes. So which means you can distribute lookups, it's around millisecond just to generate how a lookup over all of our network. So we did some tryouts, some random facts, we realised people were sending us IX LAN prefixes, smaller than 48 in IPv6, smaller than /24 in IPv4, and what were the longest path ever we saw? It's like 37 AS numbers and the small graph is just what was ‑‑ what the compute processing passing all the files at the same time.

Yes, how we use it in the previous pipeline, we get the archivers, we convert it to CSV and use it in Clickhouse which provides dictionaries so you can do mapping when you are query and also create MMDB, if you know that you do ‑‑ you have look‑ups, you can have IP look‑ups but for not only countries, you can do ASN and we generate these tables we use in our previous pipeline, the flow pipeline. The BGP library is available on GitHub as well. Just a very boilerplate, creates and maintains BGP connection, it is a, code and decode and encode BGP messages, table dumps and maintain a RIB and so just implement the behaviour, if you want to, I don't know, filter route, just like implement a custom behaviour like we do leaving it on just binding to almost everything. You can use that. Some more to come. More code.examples, so you can get started, create a whole ‑‑ the whole pipeline at once and you can visualise it immediately, and you can play with it, you just, yeah, and one last tool is just converter, so the converter to it create IP to something tables, and also open sourced on this URL.

Any questions?

CHAIR: We have like ten minutes, about that, actually, more. Please go ahead, there are two mics to choose. Just to make clear, I see you have a GitHub and I assume they may be trailing behind of your production code or it's pretty much current there.

LOUIS POINSIGNON: Everything is the same rate.

CHAIR: Great.

LOUIS POINSIGNON: The BGP connection is using the BGP library which is open source, we just did a small modification for just custom thing at Cloudflare but the exactly the same code.

SPEAKER: Hi, Blake. Thanks very much for putting this together and open sourcing it. First question is have you thought about making the BGP collector also be able to eat B MP, BGP management data?


SPEAKER: And the second one is when can I get guy this as a software as a service?

LOUIS POINSIGNON: We don't do it as a software as a service, but for the first question, B MP, yes, we have thought about it, and probably implement it when I get sometime. Over wise, we just receive BGP messages and take the B MP layer that we just have to implement. For the software service, I don't know, maybe.

SPEAKER: Joe Afilias. I was interested you seemed to have built a lot of machinery and a shifting a lot of data just to perform a ‑‑ I was wondering in the final analysis how unstable is that data set, do you find all the extra machinery gives you enough extra accuracy that it's worth it and the mapping is constant and the errors you would get from sampling once a week or or something would be minor?

LOUIS POINSIGNON: It's not only for generating AP to ASN. Maybe we can use it to it automate on route leaks or ‑‑ just when we want to do look‑ups or traffic engineering we want to know where we peer, which are the traffic we peer and the transits that people use. It's just very interesting for analysis, not only for generating IP to ASN.

SPEAKER: Hi, Kostas. Thank you for the very interesting talk. My question is, have you evaluated open source tools that are already available and existing because I see that you rewrote quite a lot of stuff. Did you do it on purpose, you wanted to control the code entirely, did you just wanted to rewrite it in go to have performance or what was your incentive on rewriting stuff?

LOUIS POINSIGNON: Depending on your size actually. We reach a point where we wanted more monitoring and flexibility and at this point we didn't find any software that would fit, so this is why we developed this.

SPEAKER: So you evaluated snuff

LOUIS POINSIGNON: Mm‑hmm. Like I said, we used nfdump ‑‑ you probably like ‑ dish mean, most of the time already use it and the fine. Like, we could not live on that and do traffic engineering on that any more.

CHAIR: Cool. You can probably turn this all into the service.

SPEAKER: Patrik Gilmore, ARIN. I noticed you were doing source IP to Cloudflare class business free etc. Does that mean that IP addresses are dedicated to a class or do you have an IP address dedicated to individual customers? Also, does the web server give any information to the system to, for instance if a single IP address is used by more than one customer so you can differential that?

LOUIS POINSIGNON: We do some mapping for that. I am not sure ‑‑

SPEAKER: If you have a web serve with one IP address and lots of customers on it how do you differentiate between the individual customers, are you only doing that per class.


SPEAKER: So the web server is not injecting anything into the system to differentiate individual customers?

LOUIS POINSIGNON: It's the way we are set up, and just I am talking about the network pipeline. This is really from the packets and we just try to get the most information we can. So this is actually how it was structured, so just by class of customer.

SPEAKER: Thank you.

SPEAKER: Geoff ‑‑ have you evaluated Panda?

LOUIS POINSIGNON: I heard about it. I haven't tried it though.

SPEAKER: Doing the support for B NP, it's just really well. We will have to reinvent.

LOUIS POINSIGNON: Only an entire ‑‑

CHAIR: Cool. So I think we are done with that. Thank you.


And here we have Louis Plissonneau.
And he was ‑‑ I am the really Louis, he is the fake one.

CHAIR: I know Facebook is more really.

LOUIS PLISSONNEAU: So I am working at Facebook, doing infrastructure. What I am going to talk about is TTLd. I will explain you all that. I was working for a long time in France Telecom, then I worked a bit for Amazon and now, I am with Facebook so I am mostly interested in network monitoring. Previously on the ISP side and now on the application side.

So TTLd, it's a fancy acronym, the prerequisite is to completely own your own infrastructure. What does that mean? It's first on your data centre, it's not exactly necessary but it makes things easier. Then you want to own your racks. When I say own your rocks like you own all your top of rack switches, and on your top of rack doesn't mine go to the shop and buy some, it means you take the chips, you assemble yourself and build systems and deploy that. So it's completely owned. Then what else? You want to own your hosts. Again, from scratch. All the hosts at Facebook are ‑‑ monitored by us. You will see why you need that. And then you need to own your network, which means that not only buy the cables for the network device but own them and do whatever you want with them. So as you can see, this prerequisite is quite big and it's the opposite of the talk which Charles from Facebook will do later because I want to show you what you can do when you own everything in your infrastructure, we will address the other side of the coin when you to start small. Here, I am going all in. We own everything and we can do crazy things. What we want to do. To surface end‑to‑end TCP retransmit throughout the network. Like everywhere I can go to a network device and know exactly how much end‑to‑end TCP or transmission happened. This is like, if you try to do that, I tried to do that in France Telecom, it's crazy hard. And on top of that I want to use all production packets as probes, which means I have more than I wanted, which has the nice side effect that if there is like a small imprecision in some contours or whatever I don't care because I have billions of other probes coming in and flowing through the networks. So what if every packet was a probe, what can you do? You can measure to end‑to‑end performance metric for TCP, precisely wherever you want, but the probes, because they are ‑‑ every probe is a production packet or every packet is a probe, it means that the probes are following the production traffic because they are, so they are following the ECMP load balancing. If my production packets are willing to go that way all the probes will go that way, if it's the other way I don't care. Yeah, and as I said already the main point is that I can go or collect a contour on any at the vice in the network and know exactly how many end‑to‑end TCP connections has experienced loss on this path, which is a big deal to me.

So, a brief recap on network monitoring. So I spent, yeah, most of my career on network monitoring, and you have mainly two types of network monitoring, active and passive. With this tool, I will introduce middle ground. So, why passive monitoring is not enough. First of all, I am not saying again this is a talk when you are at very large scale so you should not try to implement like this solution, the TTLd solution if you are small or mid‑size, it's once you implement all the passive solutions and active monitoring solutions that you can ‑‑ why passive monitoring was not enough for us. If you connect contours on SMP devices you get how many retransmission ‑‑ how many packets were dropped from the device point of view. If you trust the device. So if the network device that is smokey enough to drop your packets you want to trust them to tell you I dropped this packet and here are the number I dropped. They not always do that. So it's very good starting point, if you are small, this is the first thing you should implement but soon enough you should not trust devices. On the other side of the spectrum, because I want to see end‑to‑end TCP transmit, I can look at the TCP retransit contours exposed by Linux so all our hosts have that, it's all aggregated, we can do plenty of things with that but if I tell you that this rack or this host lost that many packets, why the loss happening? In the cluster switch, in the spine, in the backbone, wherever, nobody knows. So it's not enough. So then we have active monitoring. We have active monitoring. So active monitoring is really like, it's my baby, much more than passive monitoring, I love active monitoring so the injecting packets in the network to detect service on customer impacting loss and triangulate loss to a device. You can have a great time designing these tools and scaling them and I am doing that all day but there are limitations to that. First, potentially all injected packets could be dropped without any single production packet being lost or vice versa. Let's imagine I have seen that already, that you build an active monitoring system injecting UDP probes in your network and you have a TCP blackout not affecting UDP because of whatever, then your DPP will flow through your network which has nothing in it like, you know, it will be the best network state ever according to your UDP probes but not a single production packet will flow through the network so this is the main drawback to me.

The second drawback of active monitoring is the number of probes is many others of magnitudes and packets in the network, obviously you don't want to saturate your network only with controlled traffic which means events that happen only in ‑‑ on my load, they may not always be triggered on your probes or you have some like tiny low signal things that can be ‑‑ cannot be detected, so I had a bit flipping problem so a network devices ‑‑ network I did at some point to flip a bit only in the high load, this particular bit in the mid‑of the payload, I don't know why. The check ‑‑ TCP rejecting the packet but according to everything else there was no loss. Our active monitoring didn't' have randomisation of the payload and out of luck, it was not flipping that bit, I mean that bit was ‑‑ the bit flipping was not straightforward, it was just flipping a common case that was not triggered by active monitoring solution, and yeah, this was bad and we were in ‑‑ you, know, we were looking for that for days.

So, TTLd. So the main concept is every packet in the network is a probe. Did I say that already? I really love it. So every probe is a packet and every packet is a probe. So what you want, you want one bit in the packet header to tell you that this packet was retransmitted, not with when I was doing my PhD in France Telecom I had some fancy reverse engineering things, I don't want all this crap. I know it from the TCP stack on the host, the TCP stack will tell me if the packet was retransmitted. This is most precise evaluation of flows that I can get. So I use the host to mark the packet and obviously my packets which are my probes are undistinguishable for network devices if you are marked or not because otherwise the retransmitted packet could be ‑‑ could flow to somewhere else. So, I will explain, so this is ‑‑ yes, I will talk a bit and show you on the graph. This will be explained in the next slide on the graph. So as, you know, in big fabric network you have network devices in groups that balance the traffic hopefully equally through ECMP which gives capacity and redundancy if it's implemented correctly and what we expect with TTLds if you have one device he can posing more transmit than the others in the group it may be the root cause of the dropped packets and this gives a view on congestion, on device as not ‑‑ from end‑to‑end packet performance view and not from dropped packet in the network. So, because yeah, it obviously a lost packet has much different impact on flow when it's elephant flow or mouse. And obviously you will have neighbouring devices that are seeing the impact of a bad device. Let's see typical network fabric design, so you have racks at the bottom with host, then you have cluster switches and spine switches. So these are the racks, you have all the host, there are plenty of time, these are groups of cluster switches and on the touch the spine. This is a typical design with like four different spines that are connected to all the clusters in that way. Okay. So, let's blanket so that it's clear. Let's say that this device at the top in the spine has a loss. As you may know, it's very difficult to detect loss high in the spine because no host is connected to this device. If I want to measure it like individually, I cannot. All my hosts are living here, and there is ECMP one, if I want to target this one, I end up there. I am not going through the impacted device. So it's a very hard problem to inject host, to that, so I have design active, butts very low load. So what I have this device, it's using packet and it's not telling through that the packets are dropped. With TTLd, what will happen is that this device will be coloured as red, contours are collected correctly and you have all the neighbouring devices that also show loss. So it will not only show that this device is the root cause of the loss but also all the impacted clusters and switches, which means that imagine that you have this device which is bad, but only in certain ways, but you don't have enough traffic there, you can selectively drain like one cluster switch to remediate the problem in this rack, which gives a lot of ability on top of detection because, you know, like, everything, you know the root cause and all the path.

So, technical details. I hope I arise your curiosity because we want to find ideally a single bit in the TCP IP header that would be easy to check, would be the same to all network devices and would not change TCP behaviour, it's a ‑‑ I don't want to enter thatment and yes, it will not change the packet flow if it's set or unset. So we had a hard time to find the guy, because obviously there are not so many empty bits flying around. At first I wanted to use the ECM Bit that mentioned earlier but if we decide to treat them we cannot use that ‑‑ cannot implement the TCMP or anything like that, we also had the idea to use DSCP labels to do that but we are heavily using them and it would be going back to have to use that. So do you see a bit that is not using this thing? We can make a bit of room, because here ‑‑ so you have it. Too bad. So we can use most significant bit of the TTLd field because we are using IPv6 it's ‑‑ which means that if we decide that the width of our network will never be more than 127 hops, which is like way enough, the hop limit field is one byte, we have ‑‑ we can go up to 255 for the max TTL, but no packet will ever take 255 to reach a destination ever, even if but it through ‑‑ twice throughout the Internet. So we decided to limit the 127, which means I have my maximum bit of the hop limit which is free, currently it's always zero and I will decide to send it to one if I want to. How to do that, we use eBPF, so eBPF programme everybody here I guess has used standard BP F packet filtering for your TCP dump and all that. But nowadays, wave new generation eBPF which is able to tune current handling of network events, promoted by Brendan Gregg from Netflix and by Alexei Starovoitov from Facebook. It's a safe and efficient way to insert hooks into the kernel at run time. It can be hooked, fitting very use in our use case for network events. So we hooked it up in dedicated places, in this case we hook it as TC egress of the TCP stack so it means that once the TCP stack has done all it wants with the packet it's giving the packet back to the ‑‑ I mean, to the network card in the kernel through ‑‑ nothing will happen after this packet and we touch it before it's touched on the network. I mean, not exactly on the network, given to the network card. Then, what did we do? We changed the behaviour of a kernel event ways TCP retransmit so it means that the packet when it's going through the TCP stack in the kernel, it is given to the network card, a bit differently, coming out of the TCP retransmit function instead of not the TCP retransmit function to the network card. So I plug my programme there, right at the last minute and right at the point that the TCP stack on the host is telling me there is a retransmit. So obviously when there is a retransmit I send the high bit of the TTL to one which means all of the non‑retransmitted packets they have zero and starting with TTL of 127 and my retransmitted packet start with TTL are high bit set, TTL of 255. And that is the beauty of it. Then, I can take any packet, anywhere in the network, I look at one single bit, this high bit of the TTL in the ‑‑ it's also more convenient to have it in the IP header than TCP header and I can tell you that this packet was retransmitted by the host.

How do we count the packets? So this is where ‑‑ first of all, if we can inject whatever default TTL in our network and change the behaviour of the host by injecting eBPF programme you surely need to have a good hold on your host, you need to a have a good hold on your rack because it will defuse the TTL so you need that and to be able to inject whatever you want on the host at the kernel level so that the prerequisite of owning your host. Because there is a beauty of this bit, so we have an ACL match on the most significant bit of the TTL and have a contour that is bumped on the network device so we implement it on both devices but the best are the part of it as everyone knows, every time it's crossing one network device which means yours has already a hold on this TTL and checking the high bit is practically free because every network device to remove one from this number. So at the same time it's removing one, it's just checking oh, is the high bit your one? So I can do all this measuring without impacting the performance on network devices because obviously I don't want to slow down the network devices to tell me they are working.

So this is currently implemented on all Top of Rack and wedges and the cluster switches that are also and also at very high your ‑‑ between the building of regions ‑‑ of a region. Okay. So but I don't like if I have holes in the network because we have some devices, some network devices that are not  boss devices. So yeah, the thing is that ‑‑ I collect my counters and have to sampling because every packet has to be ‑‑ has to have its TTLd incremented. At the same time you remove one, you check this high bit. So I have a precise view, I can aggregate wherever I want, per so, per destination for the racks or source cluster, destination cluster. If it's only one device in the cluster group I will see it more red than the other and things like that. So I was saying we don't have only S boss devices so on others we have a collection framework that allows us to expose counters from non‑S boss devices inside Facebook so we do the same trick because guess what? Not all F boss devices ‑‑ every single device has to ‑‑ so if it's kiss sew, or Juniper it has to do it also. We put an ACL in place and again, I benefit from everything, so there is no sampling ‑‑ ECMP hashing ‑‑ and transmission for specific device.

Okay. So we have a few visualisation tools. So this one, it's not the most fancy looking, I will show you the fancy ones on the next slide, this is most useful. We have multiple aggregation and we have per data centres so this is a type of service thing from 10 back end DB, something like that. And you see the colouring is like back cluster, if it's white there is no loss ‑‑ it's not exactly ‑‑ it's not exactly zero, no loss, but I have ranges and the white are lower range. And if it's dark red there is loss. Easily, I can see here from one of my services, I have one, two, three clusters that look not as good and I can click on them and hover over them and see from where to where, how many switches are impacting and all that in this layer. So this is very useful, I can ‑‑ group our services here and see all that. This is cool. But not very nice looking. If one is nice looking ones, we have this one. So this is a view of part of a data centre where you see like the spine switches, you remember the colours from previous slide, these are clusters, I mean we have an internal layer between spine switch and racks so we have the spine switch, then you have the bigger clusters and small cluster, and then the small dot is again plenty of top of racks and you have ‑‑ so it's very fancy looking thing. We even have a zoom to see, I love it. If I put the data of the TTLd transmit on that I get this. So I see that these guys are affected, the spine switch are fine, plenty of other clusters unaffected, these are a bit affected. To tell you the truth, this is not so really useful in practice but as a grass it's so nice I wanted to put it on the slides.

So, that is it. I am done. I hope I convince you that we can do really cool stuff when we own the stack and we have and freedom to do so. If you have any questions, I am happy to answer.

CHAIR: Thank you very much. If you have questions, please come on down to the mic phones.


SPEAKER: Brian Trammell. When you were talking about looking at, you negotiation which bits you were going to abuse, I really like the TTL, the TTL hack is super cool and I am going to steal it and try to get people stop abusing ECN and DMCP, that is what we do on research and ‑‑

LOUIS PLISSONNEAU: We also considered ‑‑ because we are IPv6 next‑hop heading of IPv6 but it's causing extra competition on network and we don't want extra competition.

Brian Trammell: You said you considered using the ECN bits, what I don't understand is why didn't you just turn on ECN because you get exactly the same signalling and less loss?

LOUIS PLISSONNEAU: No because I mean, this is a totally different problem to me. Because ECN is like asking the network device if your queue is 80% full tell me it's going to be 100% full soon, which means ‑‑ or whatever the number you want. And which means that we put a lot of trust on the network device. What happens in the network that we have seen is that network device break in ways that are unexpected and probably till yes, I am marking all ECN bits as I should but not doing it or dropping them or things like that. And this is the problems of great failures. You have some some dropping packets and the network device will never tell you.

BRIAN TRAMMELL: So ‑‑ let me rephrase the explanation. The handling of the TTL bits and the ACL on the routers is more trustworthy for you than the handling of the ECN?


BRIAN TRAMMELL: A problem we should fix on the routers.

LOUIS PLISSONNEAU: And the main thing is that I mentioned it but maybe it was not very clear, is that the loss, like all the hosts will inject a certain number of retransmits, this is fine. And this will sent to all the network devices so if one network device is misbehaving in terms of counting the high bit TTL it may not record the loss but the neighbouring device will do and especially the one upstream and downstream and these guys, if they are not collaborating to misbehave I will see that I have X on the previous label if I aggregate on the cluster level and on the next one I am missing so I will see that I have a whole. Obviously I will not be able to triangulate that it's not this guy and all the other, it's a lot of information. And in practice TTL is very well implemented because it's something that they do for a long time, the one on this bit, on this field.

SPEAKER: So I think all that argumentation applies to ECN as well but I will take that argument off‑line. I don't have anything else so thank you very much.

BENEDIKT STOCKEBRAND: Aside from the fact that I'm feeling kind of uncomfortable reusing various bits that are defined for a certain purpose for something else, now with IPv6 especially there is this thing that number, especially ICMP packets use TTL or pop limit of 255 and otherwise they are considered invalid, for example ICMP redirects neighbour discovery, etc. What would that interfere with what you are doing and did you fix that in the stack itself to avoid these things or does it just ‑‑

LOUIS PLISSONNEAU: And this is ‑‑ I only propose, didn't fix it, because how many ICMP packets do you have in the network compared to TCP packets, it's a tiny bit, it's I can't count them and I sure cannot count all the TCP packets we have at Facebook. So the beauty of having every packet as a probe is that if I have a ‑‑ in fact to tell you the truth, there are some racks that reject some other configuration that came from IPv4, didn't have a TTL of 64 that doesn't want to change it Naas, I don't care if I have an outlier or rack of outliers because with the amount of packets that are giving me information, if I have like ‑‑ if I neglect or miscount ICM probes or something like that I don't really care. That is my way of handling it. I know it can be not very satisfying for scientific mind but in practice, I have like four others of magnitude more normal packet than the cases and I will not lose time on this.

SPEAKER: Maybe I misunderstood you there. I thought you would have the hosts implement these things, but basically ‑‑ I think I am getting the idea, okay, but I am still not happy about reusing bits for other purposes. Anyway. But never mind.

SPEAKER: You spent probably five, seven years working on ‑‑ we also mentioned to get hardware vendors to implement it, would give you some. So why are we ‑‑ abusing IP rather than doing something that is well‑designed supported by hardware vendors and could be really used for that.

CHAIR: State your affiliation.

SPEAKER: Jeff Tantsura, Nuage Networks.

LOUIS PLISSONNEAU: I am not sure I understood your question, but what I guess I understand is that why not using more network‑supported protocols, and the to have it in the network, that I want to most precise view so even if a network will tell me that there is a problem there, I know the next‑hop and so there is a lot on the packet going to this next‑hop but the next to next‑hop with ECMP hashing I have no clue where will the loss propagate and then this is a very important information because I want to see like if by chance or by out of chance all the loss is converging throughout a single cluster and this will impact my services inside these clusters and I haven't seen any possibility to do it that way and that is why I want to use end‑to‑end information but I am happy to discuss it.

SPEAKER: It came from Facebook from ‑‑ how to mark stuff in data play.

SPEAKER: We have spoken to Broadcom quite a lot about this. Unfortunately a very large number of the older chips deployed which we can't harness that in. When we are retiring all the stuff we have now, until we get to a critical mass of the newer ‑‑ until it's in most of the network, it doesn't become super useful unless you can see it everywhere, so we haven't implemented it that I know of, the big thing that is going to hold us back for the next couple of years is we will have all of the old Broadcom deployed in the fleet for a minimum of two or three years until we retire it all.

CHAIR: Thank you very much.


So that concludes the two morning sessions. Thank you for attending. Please remember to rate the talks. We have lovely prizes and the information will help us for choosing talks in the future. The Programme Committee has two slots open for elections, if you would like to submit ‑‑ nominate yourself please email pc [at] ripe [dot] net. You have until 15:30, which is the end of the session after lunch. As part of lunch, there is a women in technology lunch which will upstairs in the ballroom where we had the welcome reception yesterday, and the regular lunch will be in the regular part of the building. Thank you.