Stanford 2016

The IAP Stanford Workshop on the Future of Cloud Computing was organized by Prof. Christos Kozyrakis and Heiner Litz. It was conducted on Friday, October 28, 2016 on the Stanford campus in the Oak West Room, Tresidder Memorial Union.

Agenda - Presentation Videos

8:30-9:00AM   Ana Klimovic, Stanford, “ReFlex: Remote Flash ≈ Local Flash”
9:00-9:30AM Dr. Shubu Mukherjee, Cavium, “Asim: The Cavium Super Model”
9:30-10:00AM Prof. Prof. Sachin Katti, Stanford, “Flashield: Shielding an SSD Cache from Evictions”
10:00-10:30AM  Dr. Amin Vahdat, Google, "Cloud 3.0 and Software Defined Networking"
10:30-11:00AM Lightning Round of Student Posters
11:00-1:00PM Lunch and Cloud Poster Viewing
1:00-1:30PM Dr. Pankaj Mehra, Sandisk, "Evolutionary Changes with Revolutionary Implications: Persistent Memory in the DC"
1:30-1:50PM Dr. Qi Huang, Facebook, "A Streaming Video Engine for Distributed Encoding at Scale"
1:50-2:10PM Prof. José Martínez, Cornell, "Fine-grain Management of Last-level Caches with Minimal Hardware Support"
2:10-2:30PM Prof. Phil Levis, Stanford, “Securing the Internet of Things”
2:30-3:00PM Dr. Derek Chiou, Microsoft, “Microsoft’s Production Configurable Cloud”
3:00-3:20PM Break - Refreshments and Poster Viewing
3:20-3:50PM Jan Medved, Cisco, "Overcoming Performance Challenges for a Massive Distributed Data Store"
3:50-4:10PM Song Han, Stanford, "Deep Compression and Efficient Inference Engine"
4:10-4:30PM Prof. Timothy Roscoe, ETH Zurich, "Barrelfish: An OS for Real, Modern Hardware"
4:30-5:00PM Prof. Peter Bailis, Stanford, "MacroBase: Prioritizing Attention in Fast Data"
5:00-6:00PM Reception - Refreshments and Poster Awards

Abstracts and Bios

Ana Klimovic, Stanford, "ReFlex: Remote Flash ≈ Local Flash"
Abstract: Remote access to NVMe Flash enables flexible scaling and high utilization of Flash storage capacity and throughput within a datacenter. However, existing systems for remote Flash either introduce significant performance overheads or fail to isolate multiple remote clients sharing each Flash device. We present ReFlex: a software-based system for remote Flash access that provides nearly identical performance to accessing local Flash. ReFlex uses a dataplane kernel to closely integrate networking and storage processing to achieve low latency and high throughput with low resource requirements. ReFlex also uses a novel I/O scheduler to enforce tail latency and throughput service-level objectives (SLOs) for thousands of remote clients. ReFlex allows applications to use remote Flash while maintaining their original performance with local Flash.

Bio: Ana Klimovic is a Ph.D. student in the Electrical Engineering Department at Stanford University, advised by Professor Christos Kozyrakis. Her research interests are in computer systems and architecture, with a focus on storage systems and resource management for large-scale datacenters. Ana did her undergraduate studies at the University of Toronto, graduating from the Engineering Science program in 2013. She earned her Masters of Electrical Engineering from Stanford in 2015. She has done multiple research internships, most recently at Facebook and Microsoft Research.

Dr. Shubu Mukherjee, Cavium, "Asim: The Cavium Super Model"
Abstract: Simulation models underlie most high-performance architectural designs in the industry.  In this talk, I will discuss Cavium’s simulation model infrastructure called Asim.   Asim provides a framework for development of fast, accurate, modular, and extensible functional and timing models. Asim’s novelty lies in its ability to boot linux on a 48-core simulated ThunderX platform and run real applications, such as Apache, to collect detailed performance statistics from live web requests. Asim – Cavium’s Super Model – is helping us design multiple generations of ThunderX and other Cavium architectures.

Bio: Shubu Mukherjee is a Cavium Inc. Distinguished Engineer and the lead architect for Cavium’s ThunderXn ARM cores. Shubu is the winner of the ACM SIGARCH Maurice-Wilkes award, a Fellow of ACM, a Fellow of IEEE, and the author of the book, “Architecture Design for Soft Errors.” Shubu holds 40+ patents and has written over 50+ technical papers in top architecture conference and journals. Before joining Cavium in May 2010, Shubu worked at Intel for 9 years and Compaq for 3 years. He received his MS and PhD from the University of Wisconsin-Madison and his B.Tech., from the Indian Institute of Technology, Kanpur.

Prof. Timothy Roscoe, ETH Zurich, "Barrelfish: An OS for real, modern hardware"
Abstract: Modern computer hardware is radically different from OSes like Linux and Windows were architected. Machines today have large numbers of heterogeneous cores, different types of memory distributed throughout the system, and complex interconnects, address spaces, memory hierarchies, network interfaces, etc. Unix-like architectures unsurprisingly struggle on platforms like this. I'll talk about Barrelfish - a radically different open-source research OS built at ETH Zurich and elsewhere which targets the interrelated challenges of scaling, dark silicon, system diversity, and the sheer complexity of modern hardware. Barrelfish is a multikernel: it is structured as a distributed system even on a single machine and is agnostic with regard to shared memory, cache-coherence, and core heterogeneity. It further adopts a variety of techniques to handle hardware complexity not found in traditional OSes. In the course of building Barrelfish over the last 8 years, our group has started to develop a more solidly principled approach to describing hardware and system software than is used today in both open-source and commercial OS designs. We've also reached some conclusions about the hardware platforms we'd like to see built. I'll briefly touch on both these other areas in my talk.

Bio: Timothy Roscoe is a Full Professor in the Systems Group of the Computer Science Department at ETH Zurich. of Technology. He received a PhD from the Computer Laboratory of the University of Cambridge, where he was a principal designer and builder of the Nemesis operating system, as well as working on the Wanda microkernel and Pandora multimedia system. After three years working on web-based collaboration systems at a startup company in North Carolina, Mothy joined Sprint's Advanced Technology Lab in Burlingame, California, working on cloud computing and network monitoring. He then joined Intel Research at Berkeley in April 2002 as a principal architect of PlanetLab, an open, shared platform for developing and deploying planetary-scale services. In September 2006 he spent four months as a visiting researcher in the Embedded and Real-Time Operating Systems group at National ICT Australia in Sydney, before joining ETH Zurich in January 2007. His current research interests include network architecture and the Barrelfish multicore research operating system. He was recently elected Fellow of the ACM for contributions to operating systems and networking research.

Dr. Amin Vahdat, Google, “Cloud 3.0 and Software Defined Networking”
Abstract: Networking ties together storage, distributed computing and security in the Cloud. However, the demands of distributed processing, along with exponentially increasing data and storage, are forcing a re-think of networking technologies. Today's realization of Moore's Law is also shifting the industry away from the single-server computing model. We will discuss a path where the network itself becomes the engine for performance gains, and how this will support the next wave of advances in compute infrastructure.

Bio: Amin Vahdat is a Google Fellow and Technical Lead for networking at Google. He has contributed to Google’s data center, wide area, edge/CDN, and cloud networking infrastructure, with a particular focus on driving vertical integration across large-scale compute, networking, and storage. In the past, he was the SAIC Professor of Computer Science and Engineering at UC San Diego and the Director of UCSD’s Center for Networked Systems. Vahdat received his PhD from UC Berkeley in Computer Science, is an ACM Fellow and a past recipient of the NSF CAREER award, the Alfred P. Sloan Fellowship, and the Duke University David and Janet Vaughn Teaching Award.

Dr. Derek Chiou, Microsoft, “Microsoft's Production Configurable Cloud”
Abstract: Microsoft has been building data centers with a programmable hardware in every server, creating a novel Configurable Cloud. The reconfigurable hardware, in the form of a field programmable gate array (FPGA) is cabled between the NIC and the data center network, as well as being attached to the CPUs via PCIe. This architecture enables an FPGA-centric, rather than CPU-centric, computational model since all communication in and out of the server is first processed by the FPGA that handles common tasks without CPU involvement and passes uncommon, complex tasks to the CPU that acts as a "complexity" offload engine. Microsoft has deployed a diverse set of applications, including deep neural networks and software defined networking acceleration, across its Configurable Cloud. I will describe the Cloud, some of its applications, and their performance.

Bio: Derek Chiou is a Partner Architect at Microsoft where he leads the Azure Cloud Silicon team responsible for FPGAs and ASICs in Microsoft's data centers and an adjunct professor in the Electrical and Computer Engineering Department at The University of Texas at Austin. Before Microsoft/UT, Dr. Chiou was a system architect and lead the performance modeling team at Avici Systems, a manufacturer of terabit core routers.

Dr. Pankaj Mehra, SanDisk/Western Digital, “Evolutionary Changes with Revolutionary Implications: Persistent Memory in the Data Center”
Abstract: With byte-grain persistent memory once again imminent, it helps to consider the history and development of this idea over the past decade and a half. We look at developments at memory, controller, and filesystem level, but it is at the database and application level that the promise of this idea lies. I will show certain critical trends that clearly define the nature of the inflection we are about to witness and then outline the implications across a range of workloads and segments of the broader enterprise / data center ecosystem. The speculative portion of the talk will consider the possibility of memory disaggregating across future fabrics.

Bio: Pankaj has over 20 years of technical experience in developing and architecting scalable, intelligent information systems and services. At Western Digital, he is VP and Senior Fellow working closely with our customers to build accelerated solutions for data centers and applications, and is continuing to shape and evangelize memory and storage technologies.

Prior to joining Western Digital and SanDisk through acquisitions, Pankaj was SVP and Chief Technology Officer at Fusion-io, where he was named a top 50 CTO by ExecRank. He has also worked at Hewlett Packard, Compaq, and Tandem, and held academic and visiting positions at IBM, IIT Delhi, and UC Santa Cruz. He founded HP Labs Russia (2006), Whodini, Inc. (2010), and IntelliFabric, Inc. (2001), and is a contributing author to InfiniBand 1.0. He has held TPC-C and Terabyte Sort performance records, and his work has been recognized in awards from NASA and Sandia National Labs, among others. Pankaj was appointed Distinguished Technologist at Hewlett-Packard in 2004.

Pankaj received his Ph.D. in Computer Science from The University of Illinois at Urbana-Champaign

Dr.Qi Huang, Facebook, “A Streaming Video Engine for Distributed Encoding at Scale”
Abstract: Videos are an increasingly utilized part of the experience of the millions of people that use Facebook. These videos must be uploaded and processed before they can be posted and downloaded. Uploading and processing videos at our scale, and across our many use cases, brings three key challenges. First, we want to provide low latency to support interactive applications. Second, we want to provide a flexible programming model for application developers that is simple to program, enables efficient processing, and improves reliability. Third, we want to be robust to faults and overload.
This talk describes the evolution from our initial monolithic encoding script (MES) system to our current Streaming Video Engine (SVE) that overcomes each of the challenges. SVE provides low latency by increasing end-to-end parallelism. SVE provides a flexible programming model, with granular tasks that are organized as a directed acyclic graph, operating over tracks within videos. SVE achieves robustness through redundancy and a set of escalating mitigations for overload. SVE has been in production since the fall of 2015, provides lower latency than MES, supports many diverse video applications, and has proven to be reliable despite faults and overload.

Bio: Qi Huang is a Research Scientist on the Infrastructure team at Facebook. His primary focus is analyzing and building large scale distributed systems that serves media content efficiently, including processing, storing, and delivery. Prior to join Facebook full-time, Qi finished his PhD at Cornell and was a Facebook fellowship recipient.

Prof. Phil Levis, Stanford, "Securing the Internet of Things"
Abstract: Embedded, networked sensors and actuators are everywhere. They are in engines, monitoring combustion and performance. They are in our shoes and on our wrists, helping us exercise enough and measuring our sleep. They are in our phones, our homes, hospitals, offices, ovens, planes, trains, and automobiles. Their streams of data will improve industry, energy consumption, agriculture, business, and our health. Software processes these streams to provide real-time analytics, insights, and notifications, as well as control and actuate the physical world. The emerging Internet of Things has tremendous potential, but also tremendous dangers. Internet threats today steal credit cards. Internet threats tomorrow will disable home security systems, flood fields, and disrupt hospitals.
The Secure Internet of Things Project (SITP) is 5-year collaboration between Stanford, UC Berkeley, and the University of Michigan. Its goal is to rethink how we design, implement and test the Internet of Things so that it is secure and dependable. I'll give an overview of the project, its research goals, and its participants. I'll talk about two research efforts in the project: Tock, a secure embedded operating system, and a recent ultra-low bandwidth water sensing network deployed in Stanford dorms.

Bio: Philip Levis is an Associate Professor in the computer science and electrical engineering departments. He’s published some papers and won some awards. He likes his students a lot and so tries to buy them snacks very often. He loves great engineering and has a self-destructive aversion to low-hanging fruit.
Song Han, Stanford, “Deep Compression and EIE: deep neural network model compression and hardware acceleration”
Abstract: Deep neural networks have evolved to be the state-of-the-art technique for machine learning tasks ranging from computer vision, speech recognition to natural language processing. However, deep learning algorithms are both computationally intensive and memory intensive, making them power hungry and increasing the TCO of a datacenter. Accessing memory is more than two orders of magnitude more energy consuming than ALU operations, thus it's critical to reduce memory reference. To address this problem, this talk first introduces "Deep Compression" that can compress the deep neural networks by 10x-49x without loss of prediction accuracy. Then this talk will discuss EIE, the "Efficient Inference Engine" that works directly on the deep-compressed DNN model and accelerates the inference, taking advantage of weight sparsity, activation sparsity and weight sharing, which is 13x faster and 3000x more energy efficient than a TitanX GPU. Finally, this talk will introduce ESE, the "Efficient Speech Recognition Engine" which is the FPGA prototype of EIE. Implemented on Xilinx XCKU060 FPGA, ESE has a processing power of 266 GOPS/s working directly on a compressed LSTM network, which is 4.2x speedup over the uncompressed model.

Bio: Song Han is a 5th year PhD student with Prof. Bill Dally at Stanford University. His research interest is the intesection between deep learning and computer architecture. Currently he is improving the efficiency of neural networks by changing both the deep learning algorithm and hardware. He worked on network pruning that removed redundant weights in convolutional neural network by 70%-90%, then he proposed Deep Compression that can compress state-of-the art CNNs by 10x-49x. He compressed SqueezeNet to only 470KB, which can fits fully in on-chip SRAM. Then he designed EIE accelerator, and ASIC that can take advantage of the compressed model with weigtht sparsity, activation sparsity and weights sharing, which is 13x faster and 3000x energy efficient than TitanX GPU. Later he led the design of ESE, which is an FPGA prototype of EIE. Apart from compression and acceleration, he proposed DSD training that regularize neural networks and improves the prediction accuracy. His work has been covered by TheNextPlatform, TechEmergence, Embedded Vision and O'Reilly. His work on Deep Compression has won the best paper award in ICLR'16. Before joining Stanford, Song Han graduated from Tsinghua University.

Prof. José Martínez, Cornell, “Fine-grain Management of Last-level Caches in Multicores with Minimal Hardware Support”
Abstract: Performance isolation is an important goal in server-class environments, and partitioning the last-level cache of a chip multiprocessor (CMP) across co-running applications has proven useful in this regard. Two popular approaches are (a) hardware support for way partitioning, or (b) operating system support for set partitioning, through page coloring. Unfortunately, neither approach by itself is scalable beyond a handful of cores without incurring in significant performance overheads.

I will present SWAP, a scalable and fine-grained cache management technique that seamlessly combines set and way partitioning. By cooperatively managing cache ways and sets, SWAP ("Set and WAy Partitioning") can successfully provide hundreds of fine-grained cache partitions for the manycore era.
SWAP requires no additional hardware beyond way partitioning--in fact, SWAP can be readily implemented in existing commercial servers whose processors do support hardware way partitioning. I will describe how we prototyped SWAP on a 48-core Cavium ThunderX server running Linux, and obtained average speedups over no cache partitioning that are twice as large as those attained with Thunder's native hardware way partitioning alone.
This work was sponsored in part by Cavium; it will be presented at the upcoming HPCA conference in Austin, TX (http://hpca2017.org/).

Bio: José Martínez is a faculty member in the Computer systems Laboratory at Cornell University. His research work has earned several awards; among them: two IEEE Micro Top Picks papers; a HPCA Best Paper Award, and Best Paper Nominations at HPCA and MICRO; a NSF CAREER Award; two IBM Faculty Awards; and the inaugural Computer Science Distinguished Educator Alumnus Award by the University of Illinois. On the teaching side, he has been recognized with two College of Engineering Kenneth A. Goldman '71 Excellence in Teaching awards, a Ruth and Joel Spira Teaching Excellence award, twice as a Merrill Presidential Teacher, and as the 2011 Tau Beta Pi Professor of the Year in the College of Engineering.

Prof. Martínez holds a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. He is the Editor in Chief of IEEE Computer Architecture Letters, and senior member of the ACM and the IEEE. He also serves on the Advisory Board of the Industry-Academia Partnership for architecture, networking, and storage needs of future data centers and cloud computing (IAP).

Jan Medved, Cisco, "Overcoming Performance Challenges for a Massive Distributed Data Store: Network and IoT Data Management using the OpenDaylight SDN Controller"
Abstract: The OpenDaylight SDN controller (ODL) uses the Yang modeling language to describe both its external interfaces and its internal data. Yang-described data can either be stored in ODL’s data store or passed around the system in RPCs or Notifications. This talk will describe the architecture and implementation of ODL’s high-performanc, high-scale data store that was designed specifically for Yang-formatted data.
Bio: Jan Medved is a Distinguished Engineer with Cisco Systems, where he works on SDN-related projects. He has been involved in OpenDaylight since its inception and leads the ODL development team at Cisco. Jan has designed multiple OpenDaylight applications and has worked extensively on performance and scale optimizations of ODL infrastructure. Jan has a Dipl-Ing. degree from Technical University Ilmenau, Germany and MaSC. from University of Toronto, Canada.
Prof. Sachin Katti, Stanford, “Flashield: Shielding an SSD Cache from Evictions”
Abstract: As its price per bit drops, SSD is increasingly becoming the default storage medium for cloud application databases. However, it has not become the preferred storage medium for key-value caches, even though it offers a much lower price per bit and a sufficiently low latency compared to DRAM. This is because key-value caches need to frequently insert, update and evict small objects. This causes excessive writes and erasures on flash storage, since flash only supports writes and erasures of large chunks of data. These excessive writes and erasures significantly shortens the lifetime of flash, rendering it impractical to use for key-value caches. We present Flashield, an SSD key-value cache, which minimizes the number of writes to the SSD by using DRAM as a filter for objects that are not ideal for SSD. In order to minimize SSD writes, Flashield relies on lightweight machine learning profiling to predict which objects are the likeliest to be read frequently and will not be updated in the future, and are therefore prime candidates to be stored on SSD. In order to efficiently utilize the cache's available memory, we design a novel in-memory index for the variable-sized stored on flash that requires only 5 bytes per object on DRAM. We implemented Flashield and demonstrate that it incurs a write amplification significantly under 1×, compared to current systems, which can only achieve a write amplification of 3× and more, without any overhead in terms of hit rate or throughput.

Bio: Sachin Katti is an Associate Professor of Electrical Engineering and Computer Science at Stanford University. His research is on designing novel networked systems with a focus on leveraging ideas from coding theory and statistics and applying it to practical problems. He is also the co-founder of Kumu Networks and Uhana that have commercialized his research in mobile networking.

Prof. Peter Bailis, Stanford, “MacroBase: Prioritizing Attention in Fast Data”
Abstract: As data volumes continue to grow, human attention remains limited. How can data infrastructure help? We are developing MacroBase, a new data analytics engine designed to prioritize attention in fast data streams. MacroBase identifies deviations within streams and generates potential explanations that help contextualize and summarize relevant behaviors. As the first engine to combine streaming outlier detection and streaming explanation operators, MacroBase exploits cross-layer optimizations that deliver order-of-magnitude speedups over existing alternatives while allowing flexible operation across domains including sensor, video, and relational data via extensible feature transform operators. As a result, MacroBase can deliver accurate results at speeds of up to 2M events per second per query on a single core and has begun to deliver meaningful results in production, including at an IoT deployment monitoring hundreds of thousands of vehicles.
Bio:Peter Bailis is an assistant professor of Computer Science at Stanford University. Peter's research in the Future Data Systems group (http://futuredata.stanford.edu/) focuses on the design and implementation of next-generation, post-database data-intensive systems. His work spans large-scale data management, distributed protocol design, and architectures for high-volume complex decision support. He is the recipient of an NSF Graduate Research Fellowship, a Berkeley Fellowship for Graduate Study, best-of-conference citations for research appearing in both SIGMOD and VLDB, and the CRA Outstanding Undergraduate Researcher Award. He received a Ph.D. from UC Berkeley in 2015 and an A.B. from Harvard College in 2011, both in Computer Science.

Sponsors of the Stanford Workshop