Harvard/MIT 2021

The IAP Harvard/MIT Workshop on the Future of Cloud Computing Applications and Infrastructure was conducted on October 7, 2021. This was the fourth Cloud Workshop hosted by Harvard and MIT.

The Workshop focus was on TinyML, Sustainable AI and Sparse Compute.

October 7, 2021 - 9am-3pm PDT (Online)

Organizers: Prof. Daniel Sanchez (MIT), Prof. Vijay Janapa Reddi (Harvard) and Prof. David Brooks (Harvard).

Agenda: Videos of Presentations

9:00 am PDT: Welcome – Workshop Organizers

Session 1: TinyML (Chair: Daniel Sanchez)

  9:05 - 9:50 (45 mins): Vijay Janapa Reddi, Harvard, "Democratizing TinyML: Generalization, Standardization and
  Automation"
9:50 - 10:30 (40 mins) : Meng Li, Facebook Reality Lab, "Efficient Audio-Visual Understanding on AR Devices"
  10:30 - 11:15 (45 mins) : Song Han, MIT, "Today's AI is Too Big"
  11:15 - 11:45 (30 mins): Evgeni Gousev, Qualcomm, "The TinyML Phenomenon: Current Progress and Opportunities
  Ahead"

11:45 - 12:30 Lunch Break (Gather Online)

Session 2: Sparse Compute and Sustainable AI (Chair: Vijay Janapa Reddi)
12:30 - 1:15 (45 mins): Joel Emer, MIT and Nvidia, "Exploiting Sparsity in Deep Neural Network Accelerator Hardware"
1:15 - 1:45 (30 mins): David Brooks, Harvard, "Architecting Systems for Sustainable AI Computing"
1:45 - 2:15 (30 mins): Fredrik Kjolstad, Stanford, "Compiling Sparse Array Programming Languages"
2:15 - 2:45 (30 mins): Daniel Sanchez, MIT, "Architectural Support for Efficient Sparse Computation"
2:45 - 2:50 (5 mins): Wrap Up

Abstracts and Bios (alphabetically listed by last name) - Please check back later for updates.

David Brooks, Harvard, "Architecting Systems for Sustainable AI Computing"

Abstract: The past decade has seen incredible advances in AI largely driven by improved algorithms and models that can harness large amounts of training data. However, these advances are underpinned by enormous consumption of computational resources for training and inference at scale. As society embraces the benefits of AI across nearly all industries, researchers must provide a path toward sustainable AI computing. This talk illuminates the sources of carbon footprint in modern computer systems and provides a research vision towards improved sustainability. While the energy consumption of computing devices is an important factor, efforts are underway to offset these costs by leveraging renewable energy sources. Significantly, the imputed carbon from manufacturing computing devices consumes a growing share of the total footprint for many companies. This dichotomy in the source of carbon footprint suggests multiple distinct research threads spanning the hardware/software system stack should be explored to provide more sustainable AI systems.

Bio: David Brooks is the Haley Family Professor of Computer Science in the John A. Paulson School of Engineering and Applied Sciences at Harvard University. After completing his PhD from Princeton, he joined the Harvard faculty in 2002. Brooks' research interests include computer design at the hardware-software interface, with a focus on computing for machine learning applications. Professor Brooks is a Fellow of the ACM and IEEE and a recipient of the ACM Maurice Wilkes Award.

Joel Emer, MIT and Nvidia, "Exploiting Sparsity in Deep Neural Network Accelerator Hardware"

Abstract: Recently it has increasingly been observed that exploiting sparsity in hardware for linear algebra computations can result in significant performance improvements. This is because for data that has many zeros compression can reduce reduce both storage space and data movement. In addition, it is possible to take advantage of the simple mathematical equality that anything times zero equals zero because it results in what is commonly referred to as an ineffectual operation. Eliminating spending time do ineffectual operations and the data accesses associated with them can result in a considerable performance and energy improvements over hardware that performs all computations both effectual and ineffectual. One especially popular domain for exploiting sparsity is in deep neural network (DNN) computations, where the operands are often sparse because the input activations have zeros in them introduced by the non-linear RELU operation and the weights may have been explictly pruned such that many of them are zero. Previously proposed deep neural network accelerators have employed a variety of computational dataflows and techniques to compress data to optimize performance and energy efficiency. In an analogous fashion to our prior work that categorized DNN dataflows into patterns like weight stationary and output stationary, this talk will try to characterize the range of sparse DNN accelerators. Thus, rather than presenting a single specific combination of a dataflow and concrete data representation, I will present a generalized framework for describing dataflows and their manipulation of sparse tensor operands. In this framework, the dataflow and the representation of the operands are expressed independently in order to better facilitate the exploration of the wide design space of sparse DNN accelerators. Therefore, I will begin by presenting a format-agnostic abstraction for sparse tensors, called fibertrees. Using the fibertree abstraction, one can express a wide variety of concrete data representations, each with its own advantages and disadvantages. Furthermore by adding a set of operators for activities, like traversal and merging of tensors, the fibertree notation can be used to express dataflows independent of the concrete data representation used for the tensor operands. Thus, using this common language, I will describe a variety of previously proposed sparse neural network accelerator designs, highlighting the choices they made. Finally, I will present the some work on how this framework can be used as the basis of an analytic framework for evaluating the effectiveness of various sparse optimizations in accelerator designs.

Bio: For over 40 years, Joel Emer held various research and advanced development positions investigating processor microarchitecture and developing performance modeling and evaluation techniques. He has made architectural contributions to a number of VAX, Alpha and X86 processors and is recognized as one of the developers of the widely employed quantitative approach to processor performance evaluation. He is also well known for his contributions to the advancement of deep learning accelerator design, spatial and parallel architectures, processor reliability analysis, cache organization and simultaneous multithreading. Currently he is a professor at the Massachusetts Institute of Technology and spends part time as a Senior Distinguished Research Scientist in Nvidia's Architecture Research group. Previously, he worked at Intel where he was an Intel Fellow and Director of Microarchitecture Research. Even earlier, he worked at Compaq and Digital Equipment Corporation. He earned a doctorate in electrical engineering from the University of Illinois in 1979. He received a bachelor's degree with highest honors in electrical engineering in 1974, and his master's degree in 1975 -- both from Purdue University. Recognitions of his contributions include an ACM/SIGARCH-IEEE-CS/TCCA Most Influential Paper Award for his work on simultaneous multithreading, and six other papers that were selected as IEEE Micro's Top Picks in Computer Architecture. Among his professional honors, he is a Fellow of both the ACM and IEEE, and a member of the NAE. In 2009 he was recipient of the Eckert-Mauchly award for lifetime contributions in computer architecture.

Evgeni Gousev, Qualcomm, "The TinyML Phenomenon: Current Progress and Opportunities Ahead"

Abstract: Data fuels digital revolution. Is there a reliable, fast, energy efficient, privacy preserving and scalable way to produce real-time data from the physical world and make it actionable ? And what about the social impact it can create, at the scale ? Fast growing field of tinyML technologies offers such opportunity. Dedicated hardware becomes tiny, more sophisticated and very energy efficient (with mW or less power consumption), algorithms and models - smaller (down to 10s of kB of memory requirements), software – lighter down to deployment on deeply embedded platforms. This presentation will review the state-of-the-art of tinyML (including hardware, algorithmic and software framework aspects) with always-on vision as a case study, describe some examples of technologies and products, illustrate use cases, and discuss near-term trends and opportunities. Technology innovations in this interdisciplinary field is only one the cornerstones of tinyML. This enormous technology innovative wave and fast growing ecosystem create a strong momentum towards new applications and business opportunities. When these tech breakthroughs are fused with the talent, the energy and the passion of the fast growing tinyML global ecosystem, the end result is a transformational power towards creating a new world with trillions of intelligent devices enabled by tinyML technologies that sense, analyze and autonomously act together to create a healthier and more sustainable environment for all – something we are going to witness into the decade, the tinyML Phenomenon.

Bio: Evgeni Gousev is a Senior Director of Qualcomm AI Research. He leads Qualcomm R&D organization in the Bay area and is also responsible for developing ultra low power embedded computing platform, including always on machine vision. He serves as the Chairman of the Board of Directors of tinyML Foundation (www.tinyML.org), a non-profit organization of 6000+ professionals worldwide. The Foundation is focused on supporting and nurturing the fast-growing branch of ultra-low power machine learning technologies and approaches dealing with machine intelligence at the very edge. Evgeni joined Qualcomm in 2005 and led Technology R&D in the MEMS Research and Innovation Center commercializing mirasol display technology. He earned a Ph.D. degree in Solid-State Physics and M.S. in Applied Physics at Moscow Engineering Physics Institute. After graduation, Evgeni joined Rutgers University first as a Postdoctoral Fellow and then as a Research Assistant Professor. While at Rutgers, he performed fundamental research in the area advanced gate dielectric for CMOS devices which, a decade later, became industry wide standards on every modern device. In 1997, he was a Visiting Professor with the Center for Nanodevices and Systems, Hiroshima University, Japan. Shortly after, he joined IBM, where he led projects in the field of advanced silicon technologies in Semiconductor Research and Development Center in East Fishkill and T.J. Watson Research Center in Yorktown Heights, NY.   He has co-edited 26 books and published more than 166 papers (with over 10k citations and h-index of 46: Google Scholar). He is a holder of more than 100 issued and filed patents. Dr. Gousev is a member of several professional boards, committees, panels, and societies. In 2020, Evgeni was inducted into the “Hall of Fame” of SEMI MEMS and Sensors Industry Group.

Song Han, MIT, "Today's AI is Too Big"

Abstract: Today’s AI is too big. Deep neural networks demand extraordinary levels of data and computation, and therefore power, for training and inference. In the global shortage of silicon, this severely limits the practical deployment of AI applications. I will present techniques to improve the efficiency of neural network by model compression, neural architecture search, and new design primitives. I’ll present MCUNet that enables ImageNet-scale inference on micro-controllers that have only 1MB of Flash. Next I will introduce Once-for-All Network, an efficient neural architecture search approach, that can elastically grow and shrink the model capacity according to the target hardware resource and latency constraints. Finally I’ll present new primitives for video understanding and point cloud recognition, which is the winning solution in the 3rd/4th/5th Low-Power Computer Vision Challenges and AI Driving Olympics NuScenes Segmentation Challenge. I will also discuss AI for EDA applications. We hope such TinyML techniques can make AI greener, faster, and more accessible to everyone.

Bio:  Song Han is an assistant professor at MIT’s EECS. He received his PhD degree from Stanford University. His research focuses on efficient deep learning computing. He proposed “deep compression” technique that can reduce neural network size by an order of magnitude without losing accuracy, and the hardware implementation “efficient inference engine” that first exploited pruning and weight sparsity in deep learning accelerators. His team’s work on hardware-aware neural architecture search that bring deep learning to IoT devices was highlighted by MIT News, Wired, Qualcomm News, VentureBeat, IEEE Spectrum, integrated in PyTorch and AutoGluon, and received many low-power computer vision contest awards in flagship AI conferences (CVPR’19, ICCV’19 and NeurIPS’19). Song received Best Paper awards at ICLR’16 and FPGA’17, Amazon Machine Learning Research Award, SONY Faculty Award, Facebook Faculty Award, NVIDIA Academic Partnership Award. Song was named “35 Innovators Under 35” by MIT Technology Review for his contribution on “deep compression” technique that “lets powerful artificial intelligence (AI) programs run more efficiently on low-power mobile devices.” Song received the NSF CAREER Award for “efficient algorithms and hardware for accelerated machine learning” and the IEEE “AIs 10 to Watch: The Future of AI” award.

Fredrik Kjolstad, Stanford, "Compiling Sparse Array Programming Languages"

Abstract: We present the first compiler for the general class of sparse array programming languages (i.e., sparse NumPy). A sparse array programming language supports element-wise operations, reduction, and broadcasting of arbitrary functions over both dense and sparse arrays. Such languages have great expressive power and can express sparse/dense tensor algebra, functions over images, exclusion and inclusion filters, and even graph algorithms. Our compiler generalizes prior work on sparse tensor algebra compilation, which assumes additions and multiplications only, to any function over sparse arrays. We thus generalize the notion of sparse iteration spaces beyond intersections and unions and automatically derive them from how the algebraic properties of the functions interact with the compressed out values of the arrays. We then show for the first time how to compile these iteration spaces to efficient code. The resulting bespoke code performs 1.5–70x (geometric mean of 13.7x) better than the Pydata/Sparse Python library, which implements the alternative approach that reorganizes sparse data and calls pre-written dense functions.

Bio: Fredrik Kjolstad is an Assistant Professor at Stanford University. He works on topics in compilers and programming models. In particular, he is interested in fast compilation and compilers and programming models for sparse computing problems where we separate the algorithms from data representation. His and his group’s research includes the TACO sparse tensor algebra compiler, the Simit language for computing on sparse systems, and the Copy-and-Patch fast compilation technique. He received his PhD from MIT, his master’s degree from the UIUC, and his bachelor’s degree from the Norwegian University of Science and Technology in Gjøvik. He has won the Rosing Award, the Adobe Fellowship, the Google Research Fellowship, MIT EECS Sprowls best dissertation award, and two best/distinguished paper awards.

Meng Li, Facebook AI Research Lab, "Efficient Audio-Visual Understanding on AR Devices"

Abstract: Augmented reality (AR) is a set of technologies that will fundamentally change the way we interact with our environment. It represents a merging of the physical and the digital worlds into a rich, context aware user interface delivered through a socially acceptable form factor such as eyeglasses. The majority of these novel experiences in AR systems will be powered by AI because of their superior ability to handle in-the-wild scenarios. A key AR use case is a personalized, proactive and context-aware Assistant that can understand the user’s activity and their environment using audio-visual understanding models. In this presentation, we will discuss the challenges and opportunities in both training and deployment of efficient audio-visual understanding on AR glasses. We will discuss enabling always-on experiences within a constrained power budget using cascaded multimodal models, and co-designing them with the target hardware platforms. We will present our early work to demonstrate the benefits and potential of such a co-design approach and discuss open research areas that are promising for the
research community to explore.

Bio: Meng Li is currently a Staff AI Research Scientist and Tech Lead with the On Device AI Research in Facebook Reality Lab. Previously, he received the Ph.D. degree from the University of Texas at Austin, Austin, TX, USA, in 2018, under the supervision of Dr. David Z. Pan. His research interests include neural architecture search, AI software/hardware co-design, privacy-preserving ML, etc. Dr. Li was a recipient of the UT Austin Margarida Jacome Outstanding Dissertation Prize in 2019, the EDAA Outstanding Dissertation Award in 2019, the First Place in the Grand Final of ACM Student Research Competition in 2018, as well as the Best Paper Award in HOST’2017 and GLSVLSI’2018.

Vijay Janapa Reddi, Harvard, "Democratizing TinyML: Generalization, Standardization and Automation"

Abstract: Tiny machine learning (ML) is poised to drive enormous growth within the IoT hardware and software industry. Measuring the performance of these rapidly proliferating systems, and comparing them in a meaningful way presents a considerable challenge; the complexity and dynamicity of the field obscure the measurement of progress and make embedded ML application and system design and deployment intractable. To foster more systematic development, while enabling innovation, a fair, replicable, and robust method of evaluating tinyML systems is required. A reliable and widely accepted tinyML benchmark is needed. To fulfill this need, tinyMLPerf is a community-driven effort to extend the scope of the existing MLPerf benchmark suite (mlperf.org) to include tinyML systems. With the broad support of over 75 member organizations, the tinyMLPerf group has begun the process of creating a benchmarking suite for tinyML systems. The talk presents the goals, objectives, and lessons learned (thus far).

Bio: Prof. Janapa Reddi is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to joining Harvard, he was an Associate Professor at The University of Texas at Austin in the Department of Electrical and Computer Engineering. He is a founding member of MLCommons, a non-profit organization focused on accelerating AI innovation, and serves on the MLCommons Board of Directors. He is a Co-Chair of MLPerf Inference that is responsible for fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services. He works closely with the industry. He spent his academic sabbatical at Google from 2017 to early 2019 and over the years he has consulted for other companies such as Facebook, Intel and AMD.

His primary research interests include computer architecture and system-software design to enable mobile computing and autonomus machines. His secondary research interests include building high-performance, energy-efficient and resilient computer systems. Dr. Janapa Reddi is a recipient of multiple honors and awards, including the National Academy of Engineering (NAE) Gilbreth Lecturer Honor (2016), IEEE TCCA Young Computer Architect Award (2016), Intel Early Career Award (2013), Google Faculty Research Awards (2012, 2013, 2015, 2017, 2020), Best Paper at the 2005 International Symposium on Microarchitecture (MICRO), Best Paper at the 2009 International Symposium on High Performance Computer Architecture (HPCA), MICRO and HPCA Hall of Fame (2018 and 2019, respectively), and IEEE’s Top Picks in Computer Architecture awards (2006, 2010, 2011, 2016, 2017).

Beyond his technical research contributions, Dr. Janapa Reddi is passionate about STEM education. He is responsible for the Austin Independent School District’s “hands-on” computer science (HaCS) program, which teaches sixth- and 7th-grade students programming and the general principles that govern a computing system using open-source electronic prototyping platforms. He received a B.S. in computer engineering from Santa Clara University, an M.S. in electrical and computer engineering from the University of Colorado at Boulder, and a Ph.D. in computer science from Harvard University.

Daniel Sanchez, MIT, "Architectural Support for Efficient Sparse Computation"

Abstract: Computer systems have long been designed and optimized for regular computations, i.e., those that operate on dense and structured data, like dense linear algebra. Over time, hardware architectures have evolved many optimizations tailored to regular computations, including vector units and GPUs, prefetchers, and memory hierarchies optimized for wide transfers and little synchronization. As a result, current systems are inefficient on irregular computations, i.e., those that operate on sparse and unstructured data, like graph analytics, sparse deep learning, and sparse linear algebra. This mismatch causes poor hardware utilization and cripples performance on these emerging applications. Moreover, as systems become more specialized, the lack of support for irregularity becomes more limiting: it stymies algorithm progress by forcing the use of inefficient regular computations, and eventually renders specialized architectures obsolete. We are already seeing this at play with the shift from dense to sparse deep learning.

In this talk, I will describe a new set of hardware techniques that bridge the performance gap of irregular, sparse computations. First, many of the challenges in these applications stem from the complexity of traversing sparse data structures. I will describe new, programmable hardware support that accelerates these traversals and enables support for new data movement optimizations, such as locality-aware traversals and application-specific compression. Second, irregular applications suffer from complex synchronization and use existing coherent cache
hierarchies poorly. I will present new techniques that avoid synchronization and reduce data movement further by exploiting the commutativity of scatter updates and by adopting a data-centric execution model that sends compute to data instead of ping-ponging data among cores. Third, irregular applications suffer from load imbalance
that limits utilization. I will describe new hardware techniques that allow using fine-grain pipeline parallelism to avoid load imbalance and attain high compute utilization in both general-purpose cores and specialized architectures like CGRAs. Overall, these techniques reduce data movement and improve performance on key irregular applications (including graph analytics, sparse linear algebra, and databases) by over an order of magnitude.

Bio:  I am an Associate Professor at MIT's Electrical Engineering and Computer Science Department and a member of the Computer Science and Artificial Intelligence Laboratory. I work in computer architecture and computer systems. My current research focuses on large-scale multicores with hundreds to thousands of cores, scalable and efficient memory hierarchies, architectures with quality-of-service guarantees, and scalable runtimes and schedulers.Before joining MIT in September 2012, I earned a Ph.D. in Electrical Engineering from Stanford University, where I worked with Professor Christos Kozyrakis. I have also received an M.S. in Electrical Engineering from Stanford (2009) and a B.S. in Telecommunications Engineering from the Technical University of Madrid, UPM (2007). You can access my Curriculum Vitae here.