Join our 2020-2021 series of webinars featuring topics in AI.
Thursday, May 13, 2021, 11am-12pm PT
Prof. Manya Ghobadi, MIT
Optimizing AI Systems with Optical Technologies
Pre-registration is required. Please register here.
Abstract: Our society is rapidly becoming reliant on deep neural networks (DNNs). New datasets and models are invented frequently, increasing the memory and computational requirements for training. The explosive growth has created an urgent demand for efficient distributed DNN training systems. In this talk, I will discuss the challenges and opportunities for building next-generation DNN training clusters. In particular, I will propose optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design enables accelerating the training time of popular DNN models using reconfigurable topologies by partitioning the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on an optical interconnect. Our results show that compared to similar-cost interconnects, we can improve the training iteration time by up to 5x.
Bio: Manya Ghobadi is an assistant professor at the EECS department at MIT. Before MIT, she was a researcher at Microsoft Research and a software engineer at Google Platforms. Manya is a computer systems researcher with a networking focus and has worked on a broad set of topics, including data center networking, optical networks, transport protocols, and network measurement. Her work has won the best dataset award and best paper award at the ACM Internet Measurement Conference (IMC) as well as Google research excellent paper award.
Thursday, January 28, 2021, 11am-12pm PT
Prof. Christina Delimitrou, Cornell
Leveraging ML to Handle the Increasing Complexity of the Cloud Webinar Video
Christina has received numerous awards for her research at Stanford and Cornell, most recently the 2020 TCCA Young Computer Architect Award.
Abstract: Cloud services are increasingly adopting new programming models, such as microservices and serverless compute. While these frameworks offer several advantages, such as better modularity, ease of maintenance and deployment, they also introduce new hardware and software challenges.
In this talk, I will briefly discuss the challenges that these new cloud models introduce in hardware and software, and present some of of our work on employing ML to improve the cloud’s performance predictability and resource efficiency. I will first discuss Seer, a performance debugging system that identifies root causes of unpredictable performance in multi-tier interactive microservices, and Sage, which improves on Seer by taking a completely unsupervised learning approach to data-driven performance debugging, making it both practical and scalable.
Bio: Christina Delimitrou is an Assistant Professor and the John and Norma Balen Sesquicentennial Faculty Fellow at Cornell University, where she works on computer architecture and computer systems. She specifically focuses on improving the performance predictability and resource efficiency of large-scale cloud infrastructures by revisiting the way these systems are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Research Award, and a Facebook Faculty Research Award. Her work has also received 4 IEEE Micro Top Picks awards and several best paper awards. Before joining Cornell, Christina received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://www.csl.cornell.edu/~delimitrou/
Below, Christina presents at the 2018 MIT Cloud Workshop.
Tuesday, September 29, 2020, 11am-12pm PT
Thursday, March 25, 2021, 11am-12pm PT
Prof. Ana Klimovic, ETH Zurich
Ingesting and Processing Data Efficiently for Machine Learning
Abstract: Machine learning applications have sparked the development of specialized software frameworksand hardware accelerators. Yet, in today’s machine learning ecosystem, one important part of the system stack has received far less attention and specialization for ML: how we store and preprocess training data. This talk will describe the key challenges for implementing high-performance ML input data processing pipelines. We analyze millions of ML jobs running in Google's fleet and find that input pipeline performance significantly impacts end-to-end training performance and resource consumption. Our study shows that ingesting and preprocessing data on-the-fly during training consumes 30% of end-to-end training time, on average. Our characterization of input data pipelines motivates several systems research directions, such as disaggregating input data processing from model training and caching commonly reoccurring input data computation subgraphs. We present the multi-tenant input data processing service that we are building at ETH Zurich, in collaboration with Google, to improve ML training performance and resource usage.
Bio: Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University in 2019. Her dissertation research was on the design and implementation of fast, elastic storage for cloud computing.
Below, Ana receives the Best Poster Award at the 2018 Stanford-UCSC Workshop.
Thursday, November 19, 2020, 11am-12pm PT