Reliability and Agility in Alibaba Global Network|
Ennan Zhai (Alibaba)
As one of the largest cloud service providers, Alibaba Cloud serves over one billion customers around the world. To ensure quality service at this scale, the underlying network infrastructure is of critical importance. This talk focuses on two of the essential aspects in our network management: reliability and agility. I will discuss recent examples of our efforts in these areas, including, first, our configuration verification system, Hoyan Jinjing, built to ensure the reliability of our global-scale network, and second, our new cutting-edge compiler, Lyra, designed to support flexible data plane programming on heterogeneous ASICs. In addition, I will also touch on our learned lessons and open questions in these areas.
- 1:30pm - Automatic optimization of endpoint packet-processing programs
Qiongwen Xu (Rutgers University)
We are currently developing a program-synthesis-based compiler that performs automatic performance optimization of low-level packet-processing code. Given an instruction sequence, our compiler leverages the stochastic superoptimization framework  to search for a semantically equivalent instruction sequence which has higher performance than the input sequence. We have prototyped this compiler in the context of BPF , which is an in-kernel virtual machine instruction set that is deployed in several production systems, for example, to implement load balancing , DDoS protection , and container security policies .We have formalized the semantics of the BPF instruction set including domain-specific aspects such as memory access and key-value maps, and introduced several domain specific optimizations to reduce verification time by 4–5 orders of magnitude. Importantly, our optimization framework can incorporate safety conditions based on data flow, which is necessary for the programs to pass the in-kernel BPF static checker  and be executed in the kernel context. The development of this compiler is a work in progress. However, preliminary results are promising, with reliable performance improvements relative to `clang -O3` for BPF programs from the kernel samples as well as programs currently run-on production systems.
- 1:40pm - Multitenancy for Fast and Programmable Networks in the Cloud
Tao Wang (NYU)
Fast and programmable network devices are now readily available, both in the form of programmable switches and smart network-interface cards. Going forward, we envision that these devices will be widely deployed in the networks of cloud providers (e.g., AWS, Azure, and GCP) and exposed as a programmable surface for cloud customers similar to how cloud customers can today rent CPUs, GPUs, FPGAs, and ML accelerators. Making this vision a reality requires us to develop a mechanism to share the resources of a programmable network device across multiple cloud tenants. In other words, we need to provide multitenancy on these devices. To this end, we design compile and run-time approaches to multitenancy. We present preliminary results showing that our design provides both efficient resource utilization and isolation of tenant programs from each other.
- 1:50pm - Lyra: A Cross-Platform Language and Compiler for Data Plane Programming on Heterogeneous ASICs
Jiaqi Gao (Harvard University)
Programmable data plane has been moving towards deployments in data centers as mainstream vendors of switching ASICs enable programmability in their newly launched products, such as Broadcom’s Trident-4, Intel/Barefoot’s Tofino, and Cisco’s Silicon One. However, current data plane programs are written in low-level, chip-specific languages (e.g., P4 and NPL) and thus tightly coupled to the chip-specific architecture. As a result, it is arduous and error-prone to develop, maintain, and composite data plane programs in production networks. This paper presents Lyra, the first cross-platform, high-level language & compiler system that aids the programmers in programming data planes efficiently. Lyra offers a one-big-pipeline abstraction that allows programmers to use simple statements to express their intent, without laboriously taking care of the details in hardware; Lyra also proposes a set of synthesis and optimization techniques to automatically compile this “big-pipeline” program into multiple
pieces of runnable chip-specific code that can be launched directly on the individual programmable switches of the target network. We built and evaluated Lyra. Lyra not only generates runnable real-world programs (in both P4 and NPL), but also uses up to 87.5% fewer hardware resources and up to 78% fewer lines of code than human-written programs.
- 2:00pm - RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers
Hang Zhu (JHU)
Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k- choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
- 2:10pm - Forwarding and Routing with Packet Subscriptions
Theo Jepsen (Stanford University)
In this work, we explore how programmable data planes can naturally provide a higher-level of service to user applications via a new abstraction called packet subscriptions. Packet subscriptions generalize forwarding rules and can be used to express both traditional routing and more esoteric, content based approaches. We present strategies for routing with packet subscriptions in which a centralized controller has a global view of the network, and the network topology is organized as a hierarchical structure. We also describe a compiler for packet subscriptions that uses a novel BDD-based algorithm to efficiently translate predicates into P4 tables that can support O(100K) expressions. Using our system, we have built three diverse applications. We show that these applications can be deployed in brownfield networks while performing line-rate message processing, using the full switch bandwidth of 6.5Tbps.
- 2:20pm - Neptune: Balancing Flexibility, Performance, and Consistency on Heterogeneous Packet Processing Architectures
Praveen Kumar (Cornell University)
Effectively using SmartNIC-based accelerators is challenging because of the fundamental trade-off between flexibility and performance. A promising approach is to use a heterogeneous architecture, but the difficulty of ensuring state consistency with distributed processing across the CPU and NIC adds another dimension to the trade-off. To harness the power of reconfigurable hardware, we present Neptune, a system that guarantees strong consistency while offering flexibility and hardware-level performance in the common case. Neptune features (i) a NIC architecture that flexibly accommodates a broad range of packet-processing tasks while providing state transactions and (ii) a CPU-based runtime system that maps packet processing onto this architecture and dynamically adapts to changing workloads. We present an FPGA-based implementation to show that Neptune flexibly accommodates a diverse set of applications with high 100 Gbps throughput and competitive 1 μs latency.
- 2:30pm - Elastic Switch Programming with P4All
Mary Hogan (Princeton University)
The P4 language enables a range of new network applications. However, it is still far from easy to implement and optimize P4 programs for PISA hardware. Programmers must engage in a tedious “trial and error” process wherein they write their program (guessing it will fit within the hardware) and then check by compiling it. If it fails, they repeat the process. In this talk, we argue that programmers should define elastic data structures that stretch automatically to make use of available switch resources. We present P4All, an extension of P4 that supports elastic switch programming. Elastic data structures also make P4All modules reusable across different applications and hardware targets, where resource needs and constraints may vary. Our design is oriented around use of symbolic primitives (integers that may take on a range of possible values at compile time), arrays, and loops. We can use these primitive mechanisms to build a range of reusable libraries such as hash tables, Bloom filters, sketches, and key value stores. We will explain the important role that elasticity plays in modular programming, and we will show the P4All development process.
Keynote + Q&A: The Accidental SmartNIC|
Bruce Davie (VMWare)
In 1989, a number of U.S. research institutions and universities started collaborating on a set of Gigabit testbeds - trying to build the first networks that could deliver data to and from applications at the seemingly crazy speed of 1Gbps. As part of the Aurora testbed, we built a number of flexible “host network interfaces” - flexible because we didn’t know what tasks would be done in the host and which should be offloaded. Our 1989 design - a couple of Intel CPUs, some big FPGAs, expensive optics - bore a striking similarity to the smartNICs of today. And in many ways, we still don’t know which tasks should be offloaded, which is why we continue to see CPUs and FPGAs on NICs - although some tasks like TCP header processing & tunneling for network virtualization are now well established as offloadable. This talk will examine the long-lived tradeoff between keeping network functions close to the application (in the host) and offloading them to the NIC in the hope of better performance and consider some of the implications of recent announcements of running a full hypervisor on the smartNIC.