The Workshop on Programmable Devices was organized by Prof. Arvind Krishnamurthy (UW), Prof. Aurojit Panda (NYU), and Prof. Scott Shenker (UCB) and conducted remotely on Friday, October 30, 2020.
Agenda - Videos of Presentations
9:00: Welcome – Conference Organizers
Programmable Hardware
9:05: Mothy Roscoe and the Enzian team (ETH-Zurich): The Enzian heterogeneous research computer
9:50: Changhoon Kim (Barefoot/Intel): Our Take on the Future of Programmable Packet Switches
10:20: Andrew Gospodarek (Broadcom), Lionel Pelamourgues (Firebird): Project EOS: Broadcom - Firebird collaboration to offload Envoy Service Mesh Proxy to a Linux SmartNIC
10:45 - 11:00 Break (Gather Online)
SmartNICs and Remote Memory
11:00: Sid Karkare (Marvell): OCTEON DPU for LiquidIO SmartNICs
11:30: Aditya Akella (Univ. of Wisconsin): 1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters
12:00: Matty Kadosh and Liran Liss (Mellanox): The all-programmable accelerated data center platform: from NICs and switches to services and applications
12:45 - 1:30 Lunch break (Gather Online)
Reconfigurable Hardware
1:30: Andrew Putnam (Microsoft): Next-Generation FPGA-based SmartNICs
2:00: David Sidler (Microsoft): StRoM: Smart Remote Memory
2:30: Gordon Brebner (Xilinx): Towards an Open P4 Programmable Hardware Platform
3:00 - 3:30 Break (Gather Online)
In-network Computation
3:30: Ming Liu (VMware/Univ. of Wisconsin): Offloading Distributed Applications onto SmartNICs using iPipe
4:00: Marco Canini (KAUST): In-Network Computation is a Dumb Idea Whose Time Has Come
4:30: Bruce Davie (VMware): The Accidental SmartNIC
5:00: Wrap up
Abstracts
Aditya Akella (Univ of Wisconsin), 1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters
Abstract: Remote Direct Memory Access (RDMA) plays a key role in supporting performance-hungry datacenter applications. However, existing RDMA technologies are ill-suited to multi-tenant datacenters, where applications run at massive scales, tenants require isolation and security, and the workload mix changes over time. The mismatch is rooted in standard RDMA’s two basic design attributes: connection orientedness and complex policies baked into hardware. We describe a new approach to remote memory access – OneShot RMA (1RMA) – suited to the constraints imposed by multi-tenant datacenter settings. The 1RMA NIC is connection-free and fixed-function; it treats each RMA operation independently, assisting software by offering fine-grained delay measurements and fast failure notifications. 1RMA software provides operation pacing, congestion control, failure recovery, and inter-operation ordering, when needed. The 1RMA NIC, deployed in production, supports encryption at line rate (100Gbps and 100M ops/sec) with minimal performance/availability disruption for encryption key rotation.
Andrew Putnam (Microsoft), Next-Generation FPGA-based SmartNICs
Abstract: Microsoft deploys FPGAs (Field Programmable Gate Arrays) worldwide to accelerate a variety of our cloud services, including AI/ML and Networking, and both 1st party and 3rd party workloads. More than 1 million FPGAs have deployed to date in both Microsoft Azure and in Bing.
This talk will present a brief overview how and why FPGAs became an integral part of Microsoft’s cloud, how they have evolved, and how we use them today, particularly in Azure Accelerated Networking. In addition, this talk will present the greatest challenges to FPGA-based SmartNICs and reconfigurable logic in general going forward, and discuss future obstacles that must be addressed for reconfigurable cloud computing to reach its full potential.
Changhoon Kim (Barefoot/Intel/Stanford), Our Take on The Future of Programmable Packet Processing
Abstract: I will share our vision for programmable packet processing; it’s all about making networks and networked systems better. Then I will explain what part in this vision Barefoot and Intel would like to play. Finally I will also briefly introduce a new type of machine that some of us at Stanford -- not Intel -- have been working on to ensure extremely low and predictable RPC latency.
Bruce Davie (VMware), The Accidental SmartNIC
Abstract: In 1989, a number of U.S. research institutions and universities started collaborating on a set of Gigabit testbeds - trying to build the first networks that could deliver data to and from applications at the seemingly crazy speed of 1Gbps. As part of the Aurora testbed, we built a number of flexible “host-network interfaces” - flexible because we didn’t know what tasks would be done in the host and which should be offloaded. Our 1989 design - a couple of Intel CPUs, some big FPGAs, expensive optics - bore a striking similarity to the smartNICs of today. And in many ways we still don’t know which tasks should be offloaded, which is why we continue to see CPUs and FPGAs on NICs - although some tasks like TCP header processing & tunneling for network virtualization are now well established as offloadable. This talk will examine the long-lived tradeoff between keeping network functions close to the application (in the host) and offloading them to the NIC in the hope of better performance, and consider some of the implications of recent announcements of running a full hypervisor on the smartNIC.
Gordon Brebner (Xilinx), Towards an Open P4 Programmable Hardware Platform
Abstract: The P4 community has a well-established open software platform, the BMv2 reference software switch, which has been used by many researchers. To date, P4 hardware platforms have typically been less open, and more proprietary, often making publication and replication of results less easy. One initiative in the direction of greater hardware openness has been the P4→NetFPGA workflow, which allows a P4-programmed pipeline to be inserted into the open reference switch design developed by the NetFPGA community. This allows P4 experiments in FPGA hardware at line rate. In this talk, we introduce a new initiative to build upon the P4→NetFPGA experience, working towards a new ecosystem around ‘Next-Generation NetFPGA’ as the open P4 programmable hardware platform for both NIC and switch research, complementing the BMv2 software platform.
David Sidler (Microsoft), StRoM: Smart Remote Memory
Abstract: Remote Direct Memory Access (RDMA) is being widely adopted due to its high bandwidth and low latency. One tradeoff is the limited verbosity of, especially one-sided, RDMA verbs. In fact, it is still a non-trivial task to implement a system that can fully leverage the benefits of RDMA.
StRoM (Smart Remote Memory) is a programmable RDMA NIC that can be extended with one-side operations by offloading application level kernels. The offloaded application kernels sit on the data path between the network and host memory as such they can be used to perform memory access operations such as traversing remote data structures as well as data processing on incoming or outgoing RDMA data streams. In essence StRoM provides the capability of one-sided RPCs that are directly executed on the remote NIC.
This talk will introduce StRoM, showcase operations that can be offloaded, and discuss some of the shortcomings and future challenges.
Ming Liu (VMware Research/Univ. of Wisconsin), Offloading Distributed Applications onto SmartNICs using iPipe
Abstract: Emerging Multicore SoC SmartNICs, enclosing rich computing resources (e.g., a multicore processor, onboard DRAM, accelerators, programmable DMA engines), hold the potential to offload generic datacenter server tasks. However, it is unclear how to use a Smart- NIC efficiently and maximize the offloading benefits, especially for distributed applications. Towards this end, we characterize four commodity SmartNICs and summarize the offloading performance implications from four perspectives: traffic control, computing capability, onboard memory, and host communication.Based on our characterization, we build iPipe, an actor-based framework for offloading distributed applications onto SmartNICs. At the core of iPipe is a hybrid scheduler, combining FCFS and DRR- based processor sharing, which can tolerate tasks with variable execution costs and maximize NIC compute utilization. Using iPipe, we build a real-time data analytics engine, a distributed transaction system, and a replicated key-value store, and evaluate them on commodity SmartNICs. Our evaluations show that when processing 10/25Gbps of application bandwidth, NIC-side offloading can save up to 3.1/2.2 beefy Intel cores and lower application latencies by 23.0/28.0 μs.
Marco Canini (KAUST), In-Network Computation is a Dumb Idea Whose Time Has Come
Abstract: Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 5.5× for a number of real-world benchmark models.
Sid Karkare (Marvell), OCTEON DPU for LiquidIO SmartNICs
Abstract: We introduce the OCTEON DPU and software platform.The LiquidIO SmartNIC product line has shipped over one million units. The latest solution, LiquidIO III, is a SmartNIC platform that incorporates Marvell's widely deployed OCTEON TX2 DPU with up to 36 Arm® V8 based cores, 5 x100G network connectivity, up to 2 PCI Express Gen 4x16 host interfaces and 6 channels of DDR4 3200 controllers. Leveraging the dedicated OCTEON hardware blocks with open platform software APIs, this solution has the unique ability to offload and accelerate crypto operations, packet processing, security protocols, virtual switch, traffic management and tunneling operations. Marvell's DPDK networking suite supports performance optimized solutions for crypto, IPSec, TLS, network traffic management and packet processing.
Mothy Roscoe (ETH Zurich), The Enzian Research Computer
Abstract: As a research computer, Enzian is designed for computer systems software research and deliberately over-engineered. It is not designed for machine learning, or databases, or computer vision, or bitcoin mining, but rather it is optimized for exploring the design space for custom hardware/software co-design. Enzian is a cache-coherent 2-node asymmetric NUMA system where one node is a 48-core server class CPU and one node is a large FPGA. It has a maximum of 640GiB of DDR4 RAM, and has up 480Gb/s of network bandwidth, both split between the two nodes. Please see our website for more info.
Andrew Gospodarek (Broadcom), Lionel Pelamourgues (Firebird), Project EOS: Broadcom - Firebird collaboration to offload Envoy Service Mesh Proxy to a Linux SmartNIC
Stingray is a state-of-the-art SmartNIC that incorporates a Linux-capable CPU with performance and memory capacity approaching mainstream X86 servers, with better power efficiency. While SmartNICs are typically viewed as function accelerators, Stingray makes it feasible to port networking and security functions from host servers using the same source code. In addition to higher efficiency, offloading enables isolation from the host for role-based management and fault containment.
Envoy is a popular service mesh proxy for microservices architecture distributed applications. Envoy is undergoing rapid evolution and features an expanding set of functions including distributed tracing, service discovery, load balancing, authentication & authorization, traffic shifting, rate limiting, circuit breaking and observability. It also consumes a lot of server resources, making it a candidate to offload. We will discuss our experience in porting it to a Stingray SmartNIC, comparing the performance with a datacenter-class server.
Please note that on February 26, 2021, we have scheduled the University of Washington Cloud Workshop - 2021 with a focus on AI Implementations and Applications: ML architecture, systems, programming environments.
Agenda - Videos of Presentations
9:00: Welcome – Conference Organizers
Programmable Hardware
9:05: Mothy Roscoe and the Enzian team (ETH-Zurich): The Enzian heterogeneous research computer
9:50: Changhoon Kim (Barefoot/Intel): Our Take on the Future of Programmable Packet Switches
10:20: Andrew Gospodarek (Broadcom), Lionel Pelamourgues (Firebird): Project EOS: Broadcom - Firebird collaboration to offload Envoy Service Mesh Proxy to a Linux SmartNIC
10:45 - 11:00 Break (Gather Online)
SmartNICs and Remote Memory
11:00: Sid Karkare (Marvell): OCTEON DPU for LiquidIO SmartNICs
11:30: Aditya Akella (Univ. of Wisconsin): 1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters
12:00: Matty Kadosh and Liran Liss (Mellanox): The all-programmable accelerated data center platform: from NICs and switches to services and applications
12:45 - 1:30 Lunch break (Gather Online)
Reconfigurable Hardware
1:30: Andrew Putnam (Microsoft): Next-Generation FPGA-based SmartNICs
2:00: David Sidler (Microsoft): StRoM: Smart Remote Memory
2:30: Gordon Brebner (Xilinx): Towards an Open P4 Programmable Hardware Platform
3:00 - 3:30 Break (Gather Online)
In-network Computation
3:30: Ming Liu (VMware/Univ. of Wisconsin): Offloading Distributed Applications onto SmartNICs using iPipe
4:00: Marco Canini (KAUST): In-Network Computation is a Dumb Idea Whose Time Has Come
4:30: Bruce Davie (VMware): The Accidental SmartNIC
5:00: Wrap up
Abstracts
Aditya Akella (Univ of Wisconsin), 1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters
Abstract: Remote Direct Memory Access (RDMA) plays a key role in supporting performance-hungry datacenter applications. However, existing RDMA technologies are ill-suited to multi-tenant datacenters, where applications run at massive scales, tenants require isolation and security, and the workload mix changes over time. The mismatch is rooted in standard RDMA’s two basic design attributes: connection orientedness and complex policies baked into hardware. We describe a new approach to remote memory access – OneShot RMA (1RMA) – suited to the constraints imposed by multi-tenant datacenter settings. The 1RMA NIC is connection-free and fixed-function; it treats each RMA operation independently, assisting software by offering fine-grained delay measurements and fast failure notifications. 1RMA software provides operation pacing, congestion control, failure recovery, and inter-operation ordering, when needed. The 1RMA NIC, deployed in production, supports encryption at line rate (100Gbps and 100M ops/sec) with minimal performance/availability disruption for encryption key rotation.
Andrew Putnam (Microsoft), Next-Generation FPGA-based SmartNICs
Abstract: Microsoft deploys FPGAs (Field Programmable Gate Arrays) worldwide to accelerate a variety of our cloud services, including AI/ML and Networking, and both 1st party and 3rd party workloads. More than 1 million FPGAs have deployed to date in both Microsoft Azure and in Bing.
This talk will present a brief overview how and why FPGAs became an integral part of Microsoft’s cloud, how they have evolved, and how we use them today, particularly in Azure Accelerated Networking. In addition, this talk will present the greatest challenges to FPGA-based SmartNICs and reconfigurable logic in general going forward, and discuss future obstacles that must be addressed for reconfigurable cloud computing to reach its full potential.
Changhoon Kim (Barefoot/Intel/Stanford), Our Take on The Future of Programmable Packet Processing
Abstract: I will share our vision for programmable packet processing; it’s all about making networks and networked systems better. Then I will explain what part in this vision Barefoot and Intel would like to play. Finally I will also briefly introduce a new type of machine that some of us at Stanford -- not Intel -- have been working on to ensure extremely low and predictable RPC latency.
Bruce Davie (VMware), The Accidental SmartNIC
Abstract: In 1989, a number of U.S. research institutions and universities started collaborating on a set of Gigabit testbeds - trying to build the first networks that could deliver data to and from applications at the seemingly crazy speed of 1Gbps. As part of the Aurora testbed, we built a number of flexible “host-network interfaces” - flexible because we didn’t know what tasks would be done in the host and which should be offloaded. Our 1989 design - a couple of Intel CPUs, some big FPGAs, expensive optics - bore a striking similarity to the smartNICs of today. And in many ways we still don’t know which tasks should be offloaded, which is why we continue to see CPUs and FPGAs on NICs - although some tasks like TCP header processing & tunneling for network virtualization are now well established as offloadable. This talk will examine the long-lived tradeoff between keeping network functions close to the application (in the host) and offloading them to the NIC in the hope of better performance, and consider some of the implications of recent announcements of running a full hypervisor on the smartNIC.
Gordon Brebner (Xilinx), Towards an Open P4 Programmable Hardware Platform
Abstract: The P4 community has a well-established open software platform, the BMv2 reference software switch, which has been used by many researchers. To date, P4 hardware platforms have typically been less open, and more proprietary, often making publication and replication of results less easy. One initiative in the direction of greater hardware openness has been the P4→NetFPGA workflow, which allows a P4-programmed pipeline to be inserted into the open reference switch design developed by the NetFPGA community. This allows P4 experiments in FPGA hardware at line rate. In this talk, we introduce a new initiative to build upon the P4→NetFPGA experience, working towards a new ecosystem around ‘Next-Generation NetFPGA’ as the open P4 programmable hardware platform for both NIC and switch research, complementing the BMv2 software platform.
David Sidler (Microsoft), StRoM: Smart Remote Memory
Abstract: Remote Direct Memory Access (RDMA) is being widely adopted due to its high bandwidth and low latency. One tradeoff is the limited verbosity of, especially one-sided, RDMA verbs. In fact, it is still a non-trivial task to implement a system that can fully leverage the benefits of RDMA.
StRoM (Smart Remote Memory) is a programmable RDMA NIC that can be extended with one-side operations by offloading application level kernels. The offloaded application kernels sit on the data path between the network and host memory as such they can be used to perform memory access operations such as traversing remote data structures as well as data processing on incoming or outgoing RDMA data streams. In essence StRoM provides the capability of one-sided RPCs that are directly executed on the remote NIC.
This talk will introduce StRoM, showcase operations that can be offloaded, and discuss some of the shortcomings and future challenges.
Ming Liu (VMware Research/Univ. of Wisconsin), Offloading Distributed Applications onto SmartNICs using iPipe
Abstract: Emerging Multicore SoC SmartNICs, enclosing rich computing resources (e.g., a multicore processor, onboard DRAM, accelerators, programmable DMA engines), hold the potential to offload generic datacenter server tasks. However, it is unclear how to use a Smart- NIC efficiently and maximize the offloading benefits, especially for distributed applications. Towards this end, we characterize four commodity SmartNICs and summarize the offloading performance implications from four perspectives: traffic control, computing capability, onboard memory, and host communication.Based on our characterization, we build iPipe, an actor-based framework for offloading distributed applications onto SmartNICs. At the core of iPipe is a hybrid scheduler, combining FCFS and DRR- based processor sharing, which can tolerate tasks with variable execution costs and maximize NIC compute utilization. Using iPipe, we build a real-time data analytics engine, a distributed transaction system, and a replicated key-value store, and evaluate them on commodity SmartNICs. Our evaluations show that when processing 10/25Gbps of application bandwidth, NIC-side offloading can save up to 3.1/2.2 beefy Intel cores and lower application latencies by 23.0/28.0 μs.
Marco Canini (KAUST), In-Network Computation is a Dumb Idea Whose Time Has Come
Abstract: Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 5.5× for a number of real-world benchmark models.
Sid Karkare (Marvell), OCTEON DPU for LiquidIO SmartNICs
Abstract: We introduce the OCTEON DPU and software platform.The LiquidIO SmartNIC product line has shipped over one million units. The latest solution, LiquidIO III, is a SmartNIC platform that incorporates Marvell's widely deployed OCTEON TX2 DPU with up to 36 Arm® V8 based cores, 5 x100G network connectivity, up to 2 PCI Express Gen 4x16 host interfaces and 6 channels of DDR4 3200 controllers. Leveraging the dedicated OCTEON hardware blocks with open platform software APIs, this solution has the unique ability to offload and accelerate crypto operations, packet processing, security protocols, virtual switch, traffic management and tunneling operations. Marvell's DPDK networking suite supports performance optimized solutions for crypto, IPSec, TLS, network traffic management and packet processing.
Mothy Roscoe (ETH Zurich), The Enzian Research Computer
Abstract: As a research computer, Enzian is designed for computer systems software research and deliberately over-engineered. It is not designed for machine learning, or databases, or computer vision, or bitcoin mining, but rather it is optimized for exploring the design space for custom hardware/software co-design. Enzian is a cache-coherent 2-node asymmetric NUMA system where one node is a 48-core server class CPU and one node is a large FPGA. It has a maximum of 640GiB of DDR4 RAM, and has up 480Gb/s of network bandwidth, both split between the two nodes. Please see our website for more info.
Andrew Gospodarek (Broadcom), Lionel Pelamourgues (Firebird), Project EOS: Broadcom - Firebird collaboration to offload Envoy Service Mesh Proxy to a Linux SmartNIC
Stingray is a state-of-the-art SmartNIC that incorporates a Linux-capable CPU with performance and memory capacity approaching mainstream X86 servers, with better power efficiency. While SmartNICs are typically viewed as function accelerators, Stingray makes it feasible to port networking and security functions from host servers using the same source code. In addition to higher efficiency, offloading enables isolation from the host for role-based management and fault containment.
Envoy is a popular service mesh proxy for microservices architecture distributed applications. Envoy is undergoing rapid evolution and features an expanding set of functions including distributed tracing, service discovery, load balancing, authentication & authorization, traffic shifting, rate limiting, circuit breaking and observability. It also consumes a lot of server resources, making it a candidate to offload. We will discuss our experience in porting it to a Stingray SmartNIC, comparing the performance with a datacenter-class server.
Please note that on February 26, 2021, we have scheduled the University of Washington Cloud Workshop - 2021 with a focus on AI Implementations and Applications: ML architecture, systems, programming environments.