The IAP MIT Workshop on the Future of AI and Cloud Computing Applications and Infrastructure was conducted on Friday September 29, 2023 at the Computer Science and Artificial Intelligence Laboratory, MIT in Cambridge, MA.
Venue: Kiva Room in Building 32 (Room 32G-449), Stata Center, 32 Vassar St., MIT, Cambridge, MA
Time: 8:30AM–5PM
This event was co-organized by Professor Christina Delimitrou and the IAP.
Venue: Kiva Room in Building 32 (Room 32G-449), Stata Center, 32 Vassar St., MIT, Cambridge, MA
Time: 8:30AM–5PM
This event was co-organized by Professor Christina Delimitrou and the IAP.
|
Agenda - Videos of Presentations Please see Abstracts and Speaker Bios below the Agenda.
8:30-8:55 – Badge Pick-up – Coffee/Tea and Breakfast Food/Snacks
8:55-9:00 – Welcome - Prof. Christina Delimitrou, MIT
9:00-9:30 – Dr. Carole-Jean Wu, Meta, "Scaling AI Computing Sustainably"
9:30-10:00 – Prof. Joel Emer, MIT and Nvidia, “Einsums, Fibertrees and Dataflow: Architecture for the Post-Moore Era”
10:00-10:30 – Prof. Vijay Janapa Reddi, Harvard, “Architecture 2.0: Why Architects Need a Data-centric AI Gymnasium”
10:30-11:00 – Dr. Richard Kessler, CTO Security & Advanced Technology, Marvell, “AI, Cloud, and Marvell Semiconductor”
11-11:30 – Lightning Round for Student Posters
11:30-12:30 – Lunch and Poster Viewing
12:30-1:00 – Prof. Song Han, MIT, "TinyChat for On-device LLM"
1:00-1:30 – Prof. Manya Ghobadi, MIT, “Next-Generation Optical Networks for Machine Learning Jobs”
1:30-2:00 – Prof. Daniel Sanchez, MIT, "A Hardware and Software Architecture to Accelerate Computation on Encrypted Data"
2:00-2:30 – Break
2:30-3:00 – Sundar Dev, Google, “AI-powered infrastructure for the AI-driven future”
3:00-3:30 – Prof. Christina Delimitrou, MIT, “Designing the Next Generation Cloud Systems: To ML or not to ML”
3:30-4:00 – Dr. Jiaqi Gao, Alibaba, “Towards a 100,000-GPU Machine Learning Infrastructure"
4:00-5:00 – Reception and Best Poster Award
Abstracts and Speaker Bios (listed alphabetically by last name)
Prof. Christina Delimitrou, MIT, “Designing the Next Generation Cloud Systems: To ML or not to ML”
Abstract: Cloud systems are experiencing significant shifts both in their hardware, with an increased adoption of heterogeneity, and their software, with the prevalence of microservices and serverless frameworks. These trends require fundamentally rethinking how the cloud system stack should be designed.
In this talk, I will briefly describe the challenges these hardware and software trends introduce, and discuss how applying machine learning (ML) to hardware design, cluster management, and performance debugging can improve the cloud’s performance, efficiency, predictability, and ease of use, as well as cases where alternative techniques to ML work better. I will first present Sage, a performance debugging system that leverages ML to identify and resolve the root causes of performance issues in cloud microservices. I will then discuss Ursa, an analytically-driven cluster manager for microservices that addresses some of the shortcomings of applying ML to large-scale systems problems.
Bio: Christina Delimitrou is an Associate Professor at MIT, where she works on computer architecture and computer systems. She focuses on improving the performance, predictability, and resource efficiency of large-scale cloud infrastructures by revisiting the way they are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Faculty Research Awards, and a Facebook Faculty Research Award. Her work has also received 5 IEEE Micro Top Picks awards and several best paper awards. Before joining MIT, Christina was an Assistant Professor at Cornell University, and received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://people.csail.mit.edu/delimitrou/
Sundar Dev, Google, “AI-powered infrastructure for the AI-driven future”
Abstract: The field of computer science and engineering is currently experiencing a period of great excitement. We are entering a new era of computing, driven by advances in artificial intelligence and machine learning. These advances have the potential to transform people's lives around the world.
However, this new era also presents challenges. There is an ever-increasing demand for compute, storage, and networking from data-intensive, low-latency, massive-scale online applications and services. At the same time, the underlying hardware layers are reaching the limits of semiconductor physics, and thus unable to keep up with the demand from workloads.
In this talk, I will discuss how Google is tackling the challenges of an AI-driven future by using AI to power our large-scale computing infrastructure. I will cover topics such as:
Bio: Sundar is a performance engineer in the Platforms Infrastructure Engineering Organization at Google. He works on improving the efficiency and increasing the performance of the distributed compute infrastructure that enables Google's user-facing software services, such as Websearch, Gmail, YouTube, Maps, Ads, Workspace, and Google Cloud. His technical interests include computer architecture, distributed and parallel processing systems, hardware/software co-design, and applied machine learning for systems. He joined Google in 2015 after receiving his M.S. in Electrical and Computer Engineering from Georgia Tech.
Prof. Joel Emer, MIT, “Einsums, Fibertrees and Dataflow: Architecture for the Post-Moore Era”
Abstract: Over the past few years, efforts to address the challenges of the end of Moore's Law has led to singnificant rise in domain-specific accelerators. Many of these accelerators target tensor algebraic computations and even more specifically computations on sparse tensors. To exploit that sparsity, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on sparse accelerators does not systematically express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art.
In an analogous fashion to our prior work that categorized DNN dataflows into patterns like weight stationary and output stationary, this talk will try to provide a systematic approach to characterize the range of sparse tensor accelerators. Thus, rather than presenting a single specific combination of a dataflow and concrete data representation, I will present a generalized framework for describing computations, dataflows, the manipulation of sparse tensor operands and data representation options. In this framework, this separation of concerns is intended to better understand designs and facilitate the exploration of the wide design space of sparse tensor accelerators. Included in this framework I will present a description of computations using an extension of the Einstein summation, or Einsum, notation and a format-agnostic abstraction for sparse tensors, called fibertrees. Using the fibertree abstraction, one can express a wide variety of concrete data representations, each with its own advantages and disadvantages. Furthermore by adding a set of operators for activities, like traversal and merging of tensors, the fibertree notation can be used to express dataflows independent of the concrete data representation used for the tensor operands. Thus, using this common language, I will show how to describe a variety of previously proposed sparse tensor accelerator designs.
Bio: For over 45 years, Joel Emer held various research and advanced development positions investigating processor microarchitecture and developing performance modeling and evaluation techniques. He has made architectural contributions to a number of VAX, Alpha and X86 processors and is recognized as one of the developers of the widely employed quantitative approach to processor performance evaluation. He is also well known for his contributions to the advancement of deep learning accelerator design, spatial and parallel architectures, processor reliability analysis, cache organization and simultaneous multithreading. Currently he is a professor at the Massachusetts Institute of Technology and spends part time as a Senior Distinguished Research Scientist in Nvidia's Architecture Research group. Previously, he worked at Intel where he was an Intel Fellow and Director of Microarchitecture Research. Even earlier, he worked at Compaq and Digital Equipment Corporation. He earned a doctorate in electrical engineering from the University of Illinois in 1979. He received a bachelor's degree with highest honors in electrical engineering in 1974, and his master's degree in 1975 -- both from Purdue University. Recognitions of his contributions include "Most Influential Paper Awards" for his work on simultaneous multithreading and reliability analysis, and six of his papers have been selected as IEEE Micro's Top Picks in Computer Architecture and six of his papers have been identified as among the "most significant" of the first 50 years of ISCA. Among his professional honors, he is a Fellow of both the ACM and IEEE, and a member of the NAE. In 2009, he was recipient of the Eckert-Mauchly award for lifetime contributions in computer architecture.
Dr. Jiaqi Gao, Alibaba, “Towards a 100,000-GPU Machine Learning Infrastructure"
Abstract: Recent advances in Large Language Models have revolutionized the way people interact with machines and data. Training LLM models has brought unprecedented pressure to the infrastructure. In this talk, I will present the recent progress in the data center infrastructure for training LLM models and the new challenges we face in pursuing a 100,000-GPU machine learning infrastructure.
Bio: Jiaqi Gao is a researcher and Senior Engineer at Alibaba, where he works on programmable devices, serverless, and large-scale machine learning systems. He received his Ph.D. from Harvard University and B.S. from Tsinghua University. His works have been published in SIGCOMM, NSDI, SOSP, OSDI, SIGMOD, and so on.
Prof. Manya Ghobadi, MIT, “Next-Generation Optical Networks for Machine Learning Jobs”
Abstract: In this talk, I will explore three elements of designing next-generation machine learning systems: congestion control, network topology, and computation frequency. I will show that fair sharing, the holy grail of congestion control algorithms, is not necessarily desirable for deep neural network training clusters. Then I will introduce a new optical fabric that optimally combines network topology and parallelization strategies for machine learning training clusters. Finally, I will demonstrate the benefits of leveraging photonic computing systems for real-time, energy-efficient inference via analog computing. I will discuss that pushing the frontiers of optical networks for machine learning workloads will enable us to fully harness the potential of deep neural networks and achieve improved performance and scalability.
Bio: Manya Ghobadi is faculty in the EECS department at MIT. Her research spans different areas in computer networks, focusing on optical reconfigurable networks, networks for machine learning, and high-performance cloud infrastructure. Her work has been recognized by the ACM-W Rising Star award, Sloan Fellowship in Computer Science, ACM SIGCOMM Rising Star award, NSF CAREER award, Optica Simmons Memorial Speakership award, best paper award at the Machine Learning Systems (MLSys) conference, as well as the best dataset and best paper awards at the ACM Internet Measurement Conference (IMC). Manya received her Ph.D. from the University of Toronto and spent a few years at Microsoft Research and Google before joining MIT.
Prof. Song Han, MIT, "TinyChat for On-device LLM"
Abstract: Deploying large language model (LLM) on the edge is demanding: running copilot services (code completion, office, game chat) locally on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local. Real-time LLM inference is memory bounded. I’ll introduce LLM quantization technique: SmoothQuant and AWQ (Activation-aware Weight Quantization) that can quantize LLM weights to 4bit without losing accuracy, co-designed with TinyChatEngine that implements the compressed W4A16 (4bit weight, 16bit activation) model’s inference, which decode the model from int4 to fp16 at runtime. It’s similar to EIE’s runtime decode, but using linear codebook rather than Kmeans codebook to enable faster runtime decode. TinyChatEngine runs llama2-13B on a laptop and Jetson Orin. TinyChatEngine is written in C/C++ from scratch and has no dependency, making it easy to install and migrate to edge platforms.
Bio: Song Han is an associate professor at MIT EECS. He received his PhD degree from Stanford University. He proposed the “Deep Compression” technique including pruning and quantization that is widely used for efficient AI computing, and “Efficient Inference Engine” that first brought weight sparsity to modern AI chips. He pioneered the TinyML research that brings deep learning to IoT devices, enabling learning on the edge. His team’s work on hardware-aware neural architecture search (once-for-all network) enables users to design, optimize, shrink and deploy AI models to resource-constrained hardware devices, receiving the first place in many low-power computer vision contests in flagship AI conferences, and appeared on MIT homepage. Song received best paper awards at ICLR and FPGA, faculty awards from Amazon, Facebook, NVIDIA, Samsung and SONY. Song was named “35 Innovators Under 35” by MIT Technology Review for his contribution on “deep compression” technique that “lets powerful artificial intelligence (AI) programs run more efficiently on low-power mobile devices.” Song received the NSF CAREER Award for “efficient algorithms and hardware for accelerated machine learning”, IEEE “AIs 10 to Watch: The Future of AI” award, and Sloan Research Fellowship. He teaches efficientml.ai, and open course for efficient AI computing.
Prof. Vijay Janapa Reddi, Harvard, “Architecture 2.0: Why Architects Need a Data-centric AI Gymnasium”
Abstract: In recent years, computer architecture research has been enriched by the advent of machine learning (ML) techniques. With the increasing complexity and design space of modern computing systems, ML-assisted architecture research has become a popular approach to improve the design and optimization of edge and cloud computing, heterogeneous, and complex computer systems. ML techniques, such as deep learning and reinforcement learning, have shown promise in optimizing and designing various hardware and software components of computer systems, such as memory controllers, resource allocation, compiler optimization, cache allocation, scheduling, accelerator coherence, cloud resource sharing, power consumption, security and privacy. This has led to a proliferation of ML-assisted architecture research, with many researchers exploring new methods and algorithms to improve computer systems’ efficiency and learned embeddings for system design. While ML-driven computer architecture tools and methods have the potential to shape the future of computer architecture drastically, a key question remains – what are the foundational building blocks needed for the community to collectively and effectively usher in this new era of “Architecture 2.0”? This talk delves into the major challenges and emphasizes the necessity of establishing a shared ecosystem for ML-aided systems and architecture research. Such an ecosystem would provide researchers with access to public datasets, models, and a unified platform for result sharing and comparison. By facilitating resource sharing, the ecosystem would enhance fairness and reproducibility in ML-driven architecture research, enabling easy replication of work. The ecosystem would aid in establishing baselines for ML systems research by offering standardized tasks and metrics (benchmarks) for performance comparison. To advance this vision, the talk is an action for the community to collaborate in constructing and expanding this shared ecosystem for ML-guided systems and architecture research so that we can collectively foster advancements in the field of architecture.
Bio: Vijay Janapa Reddi is an Associate Professor at Harvard University, Vice President, and Founding Member of MLCommons (mlcommons.org), a nonprofit organization devoted to accelerating machine learning (ML) innovation for all. He co-chairs the MLCommons Research organization and sits on the board of directors of MLCommons. He co-led the development of the MLPerf Inference benchmark for IoT, mobile, edge, and datacenter applications. Before moving to Harvard, he was an Associate Professor at The University of Texas at Austin's Department of Electrical and Computer Engineering. He specializes in developing mobile and edge computing platforms, as well as the Internet of Things. His work is largely based on runtime systems, computer architecture, and applied machine learning methods. Numerous accolades and awards have been awarded to Dr. Janapa-Reddi, including the Gilbreth Lecturer Honor from the National Academy of Engineering (NAE) in 2016, the IEEE TCCA Young Computer Architect Award (2016), the Intel Early Career Award (2013), the Google Faculty Research Awards in 2012, 2013, 2015, 2017, and 2020, the Best Papers at the 2020 Design Automation Conference (DAC), the 2005 International Symposium on Microarchitecture (MICRO), and the 2009 International Symposium on High-Performance Computer Architecture (HPCA). Additionally, he has won various honors and awards, including IEEE Top Picks in Computer Architecture (2006, 2010, 2011, 2016, 2017, 2022, 2023). The MICRO and HPCA Halls of Fame include him (inducted in 2018 and 2019, respectively). He is strongly devoted to expanding access to applied machine learning for STEM, diversity, and the application of AI for social good. To merge embedded systems and machine learning, he developed the Tiny Machine Learning (TinyML) series on edX, a massive open online course (MOOC) that thousands of students worldwide can access and audit for free. Additionally, he oversaw the Austin Hands-on Computer Science (HaCS) program, which the Austin Independent School District used to teach CS to students in grades K-12. Dr. Janapa-Reddi holds degrees in computer science from Harvard University, electrical and computer engineering from the University of Colorado at Boulder, and computer engineering from Santa Clara University. Dr. Janapa-Reddi's life's passion is dedicated to helping individuals and teams succeed while making the world a better place, one bit at a time.
Dr. Richard Kessler, CTO Security & Advanced Technology, Marvell, “AI, Cloud, and Marvell Semiconductor”
Abstract: Marvell data infrastructure products span compute, communication, and storage needs of AI & Cloud systems. I'll describe needs and desires of these systems, and how Marvell current/future products make better AI & Cloud systems
Bio: Richard E. (Rick) Kessler is CTO of Security & Advanced Technology at Marvell, leading Marvell processor, automotive compute, security, and other products. Rick is a principle creator of Marvell's OCTEON and NITROX products, and a pioneer of now-commonplace technology such multi-core CPUs, advanced cryptographic and packet processing, and aggressive SoC integration. He holds a Ph.D. from The University of Wisconsin-Madison and a BS from The University of Iowa.
Prof. Daniel Sanchez, MIT, "A Hardware and Software Architecture to Accelerate Computation on Encrypted Data"
Abstract: Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE suffers from two key challenges. First, it incurs very high overheads: it is about 10,000x slower than native, unencrypted computation on a CPU. Second, FHE is extremely hard to program: translating even simple applications like neural networks takes months of tedious work by FHE experts.
In this talk, I will describe a hardware and software stack that tackles these challenges and enables the widespread adoption of FHE. First, I will give a systems-level introduction to FHE, describing its programming interface, key characteristics, and performance tradeoffs while abstracting away its complex, cryptography-heavy implementation details. Then, I will introduce a programmable hardware architecture that accelerates FHE programs by 5,000x vs. a CPU with similar area and power, erasing most of the overheads of FHE. Finally, I will introduce a new compiler that abstracts away the details of FHE. This compiler exposes a simple, numpy-like tensor programming interface, and produces FHE programs that match or outperform painstakingly optimized manual versions. Together, these techniques make FHE fast and easy to use across many domains, including deep learning, tensor algebra, and other learning and analytic tasks.
Bio: I am a Professor at MIT's Electrical Engineering and Computer Science Department and a member of the Computer Science and Artificial Intelligence Laboratory. I work in computer architecture and computer systems. My current research focuses on large-scale multicores with hundreds to thousands of cores, scalable and efficient memory hierarchies, architectures with quality-of-service guarantees, and scalable runtimes and schedulers.
Before joining MIT in September 2012, I earned a Ph.D. in Electrical Engineering from Stanford University, where I worked with Professor Christos Kozyrakis. I have also received an M.S. in Electrical Engineering from Stanford (2009) and a B.S. in Telecommunications Engineering from the Technical University of Madrid, UPM (2007).
Dr. Carole-Jean Wu, Meta, "Scaling AI Computing Sustainably"
Abstract: The past 50 years has seen a dramatic increase in the amount of compute per person, in particular, those enabled by AI. Despite the positive societal benefits, AI technologies come with significant environmental implications. I will talk about the carbon footprint of AI computing by examining the model development cycle, spanning data, algorithms, and system hardware, and, at the same time, considering the life cycle of system hardware from the perspective of hardware architectures and manufacturing technologies. The talk will capture the operational and manufacturing carbon footprint of AI computing. Based on the industry experience and lessons learned, I will share key challenges, on what and how at-scale optimization can help reduce the overall carbon footprint of AI and computing. This talk will conclude with important development and research directions to advance the field of computing in an environmentally-responsible and sustainable manner.
Bio: Carole-Jean Wu is a Research Director at Meta AI. She is a founding member and a Vice President of MLCommons – a non-profit organization that aims to accelerate machine learning for the benefits of all. Dr. Wu also serves on the MLCommons Board as a Director, chaired the MLPerf Recommendation Benchmark Advisory Board, and co-chaired for MLPerf Inference. Prior to Meta/Facebook, She was an Associate Professor at ASU.
Dr. Wu is passionate about pathfinding and tackling system challenges to enable efficient, responsible AI execution. Her work has been recognized with several awards, including IEEE Micro Top Picks and ACM/IEEE Best Paper Awards. Dr. Wu is the recipient of NSF CAREER Award, CRA-WP Anita Borg Early Career Award Distinction of Honorable Mention, IEEE Young Engineer of the Year Award, Science Foundation Arizona Bisgrove Early Career Scholarship, Facebook AI Infrastructure Mentorship Award, and HPCA and IISWC Hall of Fame. She was the Program Co-Chair of the Conference on Machine Learning and Systems (MLSys), the Program Chair of the IEEE International Symposium on Workload Characterization (IISWC), and the Editor for the IEEE MICRO Special Issue on Environmentally Sustainable Computing. She received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.
8:30-8:55 – Badge Pick-up – Coffee/Tea and Breakfast Food/Snacks
8:55-9:00 – Welcome - Prof. Christina Delimitrou, MIT
9:00-9:30 – Dr. Carole-Jean Wu, Meta, "Scaling AI Computing Sustainably"
9:30-10:00 – Prof. Joel Emer, MIT and Nvidia, “Einsums, Fibertrees and Dataflow: Architecture for the Post-Moore Era”
10:00-10:30 – Prof. Vijay Janapa Reddi, Harvard, “Architecture 2.0: Why Architects Need a Data-centric AI Gymnasium”
10:30-11:00 – Dr. Richard Kessler, CTO Security & Advanced Technology, Marvell, “AI, Cloud, and Marvell Semiconductor”
11-11:30 – Lightning Round for Student Posters
11:30-12:30 – Lunch and Poster Viewing
12:30-1:00 – Prof. Song Han, MIT, "TinyChat for On-device LLM"
1:00-1:30 – Prof. Manya Ghobadi, MIT, “Next-Generation Optical Networks for Machine Learning Jobs”
1:30-2:00 – Prof. Daniel Sanchez, MIT, "A Hardware and Software Architecture to Accelerate Computation on Encrypted Data"
2:00-2:30 – Break
2:30-3:00 – Sundar Dev, Google, “AI-powered infrastructure for the AI-driven future”
3:00-3:30 – Prof. Christina Delimitrou, MIT, “Designing the Next Generation Cloud Systems: To ML or not to ML”
3:30-4:00 – Dr. Jiaqi Gao, Alibaba, “Towards a 100,000-GPU Machine Learning Infrastructure"
4:00-5:00 – Reception and Best Poster Award
Abstracts and Speaker Bios (listed alphabetically by last name)
Prof. Christina Delimitrou, MIT, “Designing the Next Generation Cloud Systems: To ML or not to ML”
Abstract: Cloud systems are experiencing significant shifts both in their hardware, with an increased adoption of heterogeneity, and their software, with the prevalence of microservices and serverless frameworks. These trends require fundamentally rethinking how the cloud system stack should be designed.
In this talk, I will briefly describe the challenges these hardware and software trends introduce, and discuss how applying machine learning (ML) to hardware design, cluster management, and performance debugging can improve the cloud’s performance, efficiency, predictability, and ease of use, as well as cases where alternative techniques to ML work better. I will first present Sage, a performance debugging system that leverages ML to identify and resolve the root causes of performance issues in cloud microservices. I will then discuss Ursa, an analytically-driven cluster manager for microservices that addresses some of the shortcomings of applying ML to large-scale systems problems.
Bio: Christina Delimitrou is an Associate Professor at MIT, where she works on computer architecture and computer systems. She focuses on improving the performance, predictability, and resource efficiency of large-scale cloud infrastructures by revisiting the way they are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Faculty Research Awards, and a Facebook Faculty Research Award. Her work has also received 5 IEEE Micro Top Picks awards and several best paper awards. Before joining MIT, Christina was an Assistant Professor at Cornell University, and received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://people.csail.mit.edu/delimitrou/
Sundar Dev, Google, “AI-powered infrastructure for the AI-driven future”
Abstract: The field of computer science and engineering is currently experiencing a period of great excitement. We are entering a new era of computing, driven by advances in artificial intelligence and machine learning. These advances have the potential to transform people's lives around the world.
However, this new era also presents challenges. There is an ever-increasing demand for compute, storage, and networking from data-intensive, low-latency, massive-scale online applications and services. At the same time, the underlying hardware layers are reaching the limits of semiconductor physics, and thus unable to keep up with the demand from workloads.
In this talk, I will discuss how Google is tackling the challenges of an AI-driven future by using AI to power our large-scale computing infrastructure. I will cover topics such as:
- What are the challenges facing large-scale infrastructure providers like Google
- How is Google using AI to improve the efficiency of our data centers
- How we are using AI to automate the management of our infrastructure
- How we are using AI to develop new hardware and software technologies
Bio: Sundar is a performance engineer in the Platforms Infrastructure Engineering Organization at Google. He works on improving the efficiency and increasing the performance of the distributed compute infrastructure that enables Google's user-facing software services, such as Websearch, Gmail, YouTube, Maps, Ads, Workspace, and Google Cloud. His technical interests include computer architecture, distributed and parallel processing systems, hardware/software co-design, and applied machine learning for systems. He joined Google in 2015 after receiving his M.S. in Electrical and Computer Engineering from Georgia Tech.
Prof. Joel Emer, MIT, “Einsums, Fibertrees and Dataflow: Architecture for the Post-Moore Era”
Abstract: Over the past few years, efforts to address the challenges of the end of Moore's Law has led to singnificant rise in domain-specific accelerators. Many of these accelerators target tensor algebraic computations and even more specifically computations on sparse tensors. To exploit that sparsity, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on sparse accelerators does not systematically express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art.
In an analogous fashion to our prior work that categorized DNN dataflows into patterns like weight stationary and output stationary, this talk will try to provide a systematic approach to characterize the range of sparse tensor accelerators. Thus, rather than presenting a single specific combination of a dataflow and concrete data representation, I will present a generalized framework for describing computations, dataflows, the manipulation of sparse tensor operands and data representation options. In this framework, this separation of concerns is intended to better understand designs and facilitate the exploration of the wide design space of sparse tensor accelerators. Included in this framework I will present a description of computations using an extension of the Einstein summation, or Einsum, notation and a format-agnostic abstraction for sparse tensors, called fibertrees. Using the fibertree abstraction, one can express a wide variety of concrete data representations, each with its own advantages and disadvantages. Furthermore by adding a set of operators for activities, like traversal and merging of tensors, the fibertree notation can be used to express dataflows independent of the concrete data representation used for the tensor operands. Thus, using this common language, I will show how to describe a variety of previously proposed sparse tensor accelerator designs.
Bio: For over 45 years, Joel Emer held various research and advanced development positions investigating processor microarchitecture and developing performance modeling and evaluation techniques. He has made architectural contributions to a number of VAX, Alpha and X86 processors and is recognized as one of the developers of the widely employed quantitative approach to processor performance evaluation. He is also well known for his contributions to the advancement of deep learning accelerator design, spatial and parallel architectures, processor reliability analysis, cache organization and simultaneous multithreading. Currently he is a professor at the Massachusetts Institute of Technology and spends part time as a Senior Distinguished Research Scientist in Nvidia's Architecture Research group. Previously, he worked at Intel where he was an Intel Fellow and Director of Microarchitecture Research. Even earlier, he worked at Compaq and Digital Equipment Corporation. He earned a doctorate in electrical engineering from the University of Illinois in 1979. He received a bachelor's degree with highest honors in electrical engineering in 1974, and his master's degree in 1975 -- both from Purdue University. Recognitions of his contributions include "Most Influential Paper Awards" for his work on simultaneous multithreading and reliability analysis, and six of his papers have been selected as IEEE Micro's Top Picks in Computer Architecture and six of his papers have been identified as among the "most significant" of the first 50 years of ISCA. Among his professional honors, he is a Fellow of both the ACM and IEEE, and a member of the NAE. In 2009, he was recipient of the Eckert-Mauchly award for lifetime contributions in computer architecture.
Dr. Jiaqi Gao, Alibaba, “Towards a 100,000-GPU Machine Learning Infrastructure"
Abstract: Recent advances in Large Language Models have revolutionized the way people interact with machines and data. Training LLM models has brought unprecedented pressure to the infrastructure. In this talk, I will present the recent progress in the data center infrastructure for training LLM models and the new challenges we face in pursuing a 100,000-GPU machine learning infrastructure.
Bio: Jiaqi Gao is a researcher and Senior Engineer at Alibaba, where he works on programmable devices, serverless, and large-scale machine learning systems. He received his Ph.D. from Harvard University and B.S. from Tsinghua University. His works have been published in SIGCOMM, NSDI, SOSP, OSDI, SIGMOD, and so on.
Prof. Manya Ghobadi, MIT, “Next-Generation Optical Networks for Machine Learning Jobs”
Abstract: In this talk, I will explore three elements of designing next-generation machine learning systems: congestion control, network topology, and computation frequency. I will show that fair sharing, the holy grail of congestion control algorithms, is not necessarily desirable for deep neural network training clusters. Then I will introduce a new optical fabric that optimally combines network topology and parallelization strategies for machine learning training clusters. Finally, I will demonstrate the benefits of leveraging photonic computing systems for real-time, energy-efficient inference via analog computing. I will discuss that pushing the frontiers of optical networks for machine learning workloads will enable us to fully harness the potential of deep neural networks and achieve improved performance and scalability.
Bio: Manya Ghobadi is faculty in the EECS department at MIT. Her research spans different areas in computer networks, focusing on optical reconfigurable networks, networks for machine learning, and high-performance cloud infrastructure. Her work has been recognized by the ACM-W Rising Star award, Sloan Fellowship in Computer Science, ACM SIGCOMM Rising Star award, NSF CAREER award, Optica Simmons Memorial Speakership award, best paper award at the Machine Learning Systems (MLSys) conference, as well as the best dataset and best paper awards at the ACM Internet Measurement Conference (IMC). Manya received her Ph.D. from the University of Toronto and spent a few years at Microsoft Research and Google before joining MIT.
Prof. Song Han, MIT, "TinyChat for On-device LLM"
Abstract: Deploying large language model (LLM) on the edge is demanding: running copilot services (code completion, office, game chat) locally on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local. Real-time LLM inference is memory bounded. I’ll introduce LLM quantization technique: SmoothQuant and AWQ (Activation-aware Weight Quantization) that can quantize LLM weights to 4bit without losing accuracy, co-designed with TinyChatEngine that implements the compressed W4A16 (4bit weight, 16bit activation) model’s inference, which decode the model from int4 to fp16 at runtime. It’s similar to EIE’s runtime decode, but using linear codebook rather than Kmeans codebook to enable faster runtime decode. TinyChatEngine runs llama2-13B on a laptop and Jetson Orin. TinyChatEngine is written in C/C++ from scratch and has no dependency, making it easy to install and migrate to edge platforms.
Bio: Song Han is an associate professor at MIT EECS. He received his PhD degree from Stanford University. He proposed the “Deep Compression” technique including pruning and quantization that is widely used for efficient AI computing, and “Efficient Inference Engine” that first brought weight sparsity to modern AI chips. He pioneered the TinyML research that brings deep learning to IoT devices, enabling learning on the edge. His team’s work on hardware-aware neural architecture search (once-for-all network) enables users to design, optimize, shrink and deploy AI models to resource-constrained hardware devices, receiving the first place in many low-power computer vision contests in flagship AI conferences, and appeared on MIT homepage. Song received best paper awards at ICLR and FPGA, faculty awards from Amazon, Facebook, NVIDIA, Samsung and SONY. Song was named “35 Innovators Under 35” by MIT Technology Review for his contribution on “deep compression” technique that “lets powerful artificial intelligence (AI) programs run more efficiently on low-power mobile devices.” Song received the NSF CAREER Award for “efficient algorithms and hardware for accelerated machine learning”, IEEE “AIs 10 to Watch: The Future of AI” award, and Sloan Research Fellowship. He teaches efficientml.ai, and open course for efficient AI computing.
Prof. Vijay Janapa Reddi, Harvard, “Architecture 2.0: Why Architects Need a Data-centric AI Gymnasium”
Abstract: In recent years, computer architecture research has been enriched by the advent of machine learning (ML) techniques. With the increasing complexity and design space of modern computing systems, ML-assisted architecture research has become a popular approach to improve the design and optimization of edge and cloud computing, heterogeneous, and complex computer systems. ML techniques, such as deep learning and reinforcement learning, have shown promise in optimizing and designing various hardware and software components of computer systems, such as memory controllers, resource allocation, compiler optimization, cache allocation, scheduling, accelerator coherence, cloud resource sharing, power consumption, security and privacy. This has led to a proliferation of ML-assisted architecture research, with many researchers exploring new methods and algorithms to improve computer systems’ efficiency and learned embeddings for system design. While ML-driven computer architecture tools and methods have the potential to shape the future of computer architecture drastically, a key question remains – what are the foundational building blocks needed for the community to collectively and effectively usher in this new era of “Architecture 2.0”? This talk delves into the major challenges and emphasizes the necessity of establishing a shared ecosystem for ML-aided systems and architecture research. Such an ecosystem would provide researchers with access to public datasets, models, and a unified platform for result sharing and comparison. By facilitating resource sharing, the ecosystem would enhance fairness and reproducibility in ML-driven architecture research, enabling easy replication of work. The ecosystem would aid in establishing baselines for ML systems research by offering standardized tasks and metrics (benchmarks) for performance comparison. To advance this vision, the talk is an action for the community to collaborate in constructing and expanding this shared ecosystem for ML-guided systems and architecture research so that we can collectively foster advancements in the field of architecture.
Bio: Vijay Janapa Reddi is an Associate Professor at Harvard University, Vice President, and Founding Member of MLCommons (mlcommons.org), a nonprofit organization devoted to accelerating machine learning (ML) innovation for all. He co-chairs the MLCommons Research organization and sits on the board of directors of MLCommons. He co-led the development of the MLPerf Inference benchmark for IoT, mobile, edge, and datacenter applications. Before moving to Harvard, he was an Associate Professor at The University of Texas at Austin's Department of Electrical and Computer Engineering. He specializes in developing mobile and edge computing platforms, as well as the Internet of Things. His work is largely based on runtime systems, computer architecture, and applied machine learning methods. Numerous accolades and awards have been awarded to Dr. Janapa-Reddi, including the Gilbreth Lecturer Honor from the National Academy of Engineering (NAE) in 2016, the IEEE TCCA Young Computer Architect Award (2016), the Intel Early Career Award (2013), the Google Faculty Research Awards in 2012, 2013, 2015, 2017, and 2020, the Best Papers at the 2020 Design Automation Conference (DAC), the 2005 International Symposium on Microarchitecture (MICRO), and the 2009 International Symposium on High-Performance Computer Architecture (HPCA). Additionally, he has won various honors and awards, including IEEE Top Picks in Computer Architecture (2006, 2010, 2011, 2016, 2017, 2022, 2023). The MICRO and HPCA Halls of Fame include him (inducted in 2018 and 2019, respectively). He is strongly devoted to expanding access to applied machine learning for STEM, diversity, and the application of AI for social good. To merge embedded systems and machine learning, he developed the Tiny Machine Learning (TinyML) series on edX, a massive open online course (MOOC) that thousands of students worldwide can access and audit for free. Additionally, he oversaw the Austin Hands-on Computer Science (HaCS) program, which the Austin Independent School District used to teach CS to students in grades K-12. Dr. Janapa-Reddi holds degrees in computer science from Harvard University, electrical and computer engineering from the University of Colorado at Boulder, and computer engineering from Santa Clara University. Dr. Janapa-Reddi's life's passion is dedicated to helping individuals and teams succeed while making the world a better place, one bit at a time.
Dr. Richard Kessler, CTO Security & Advanced Technology, Marvell, “AI, Cloud, and Marvell Semiconductor”
Abstract: Marvell data infrastructure products span compute, communication, and storage needs of AI & Cloud systems. I'll describe needs and desires of these systems, and how Marvell current/future products make better AI & Cloud systems
Bio: Richard E. (Rick) Kessler is CTO of Security & Advanced Technology at Marvell, leading Marvell processor, automotive compute, security, and other products. Rick is a principle creator of Marvell's OCTEON and NITROX products, and a pioneer of now-commonplace technology such multi-core CPUs, advanced cryptographic and packet processing, and aggressive SoC integration. He holds a Ph.D. from The University of Wisconsin-Madison and a BS from The University of Iowa.
Prof. Daniel Sanchez, MIT, "A Hardware and Software Architecture to Accelerate Computation on Encrypted Data"
Abstract: Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE suffers from two key challenges. First, it incurs very high overheads: it is about 10,000x slower than native, unencrypted computation on a CPU. Second, FHE is extremely hard to program: translating even simple applications like neural networks takes months of tedious work by FHE experts.
In this talk, I will describe a hardware and software stack that tackles these challenges and enables the widespread adoption of FHE. First, I will give a systems-level introduction to FHE, describing its programming interface, key characteristics, and performance tradeoffs while abstracting away its complex, cryptography-heavy implementation details. Then, I will introduce a programmable hardware architecture that accelerates FHE programs by 5,000x vs. a CPU with similar area and power, erasing most of the overheads of FHE. Finally, I will introduce a new compiler that abstracts away the details of FHE. This compiler exposes a simple, numpy-like tensor programming interface, and produces FHE programs that match or outperform painstakingly optimized manual versions. Together, these techniques make FHE fast and easy to use across many domains, including deep learning, tensor algebra, and other learning and analytic tasks.
Bio: I am a Professor at MIT's Electrical Engineering and Computer Science Department and a member of the Computer Science and Artificial Intelligence Laboratory. I work in computer architecture and computer systems. My current research focuses on large-scale multicores with hundreds to thousands of cores, scalable and efficient memory hierarchies, architectures with quality-of-service guarantees, and scalable runtimes and schedulers.
Before joining MIT in September 2012, I earned a Ph.D. in Electrical Engineering from Stanford University, where I worked with Professor Christos Kozyrakis. I have also received an M.S. in Electrical Engineering from Stanford (2009) and a B.S. in Telecommunications Engineering from the Technical University of Madrid, UPM (2007).
Dr. Carole-Jean Wu, Meta, "Scaling AI Computing Sustainably"
Abstract: The past 50 years has seen a dramatic increase in the amount of compute per person, in particular, those enabled by AI. Despite the positive societal benefits, AI technologies come with significant environmental implications. I will talk about the carbon footprint of AI computing by examining the model development cycle, spanning data, algorithms, and system hardware, and, at the same time, considering the life cycle of system hardware from the perspective of hardware architectures and manufacturing technologies. The talk will capture the operational and manufacturing carbon footprint of AI computing. Based on the industry experience and lessons learned, I will share key challenges, on what and how at-scale optimization can help reduce the overall carbon footprint of AI and computing. This talk will conclude with important development and research directions to advance the field of computing in an environmentally-responsible and sustainable manner.
Bio: Carole-Jean Wu is a Research Director at Meta AI. She is a founding member and a Vice President of MLCommons – a non-profit organization that aims to accelerate machine learning for the benefits of all. Dr. Wu also serves on the MLCommons Board as a Director, chaired the MLPerf Recommendation Benchmark Advisory Board, and co-chaired for MLPerf Inference. Prior to Meta/Facebook, She was an Associate Professor at ASU.
Dr. Wu is passionate about pathfinding and tackling system challenges to enable efficient, responsible AI execution. Her work has been recognized with several awards, including IEEE Micro Top Picks and ACM/IEEE Best Paper Awards. Dr. Wu is the recipient of NSF CAREER Award, CRA-WP Anita Borg Early Career Award Distinction of Honorable Mention, IEEE Young Engineer of the Year Award, Science Foundation Arizona Bisgrove Early Career Scholarship, Facebook AI Infrastructure Mentorship Award, and HPCA and IISWC Hall of Fame. She was the Program Co-Chair of the Conference on Machine Learning and Systems (MLSys), the Program Chair of the IEEE International Symposium on Workload Characterization (IISWC), and the Editor for the IEEE MICRO Special Issue on Environmentally Sustainable Computing. She received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.