The IAP UCI Workshop on the Future of AI and Cloud Computing Applications and Infrastructure was conducted on Thursday, May 2, 2024 at UC Irvine.
Venue: Room 1010, Interdisciplinary Science & Engineering Building (ISEB), 419 Physical Sciences Quad, Irvine, CA 92697
Time: 8:30AM–3PM
This event was co-organized by Professor Hyoukjun Kwon and the IAP.
Time: 8:30AM–3PM
This event was co-organized by Professor Hyoukjun Kwon and the IAP.
Agenda – Videos of Presentations – Please see the Speaker Abstracts and Bios below the Testimonials.
8:30-8:55 – Badge Pick-up – Coffee/Tea and Breakfast Food/Snacks
8:55-9:00 – Welcome – Prof. Hyoukjun Kwon, UCI
9:00-9:30 – Prof. Jason Cong, UCLA, Volgenau Chair for Engineering Excellence, "Can We Automate Chip Design with Deep Learning?"
9:30-10:00 – Dr. Bilge Acun, FAIR @ Meta, "Towards a Sustainable and Efficient LLM Infrastructure"
10:00-10:30 – Prof. Hyoukjun Kwon, UCI, "ML Workloads in AR/VR and their Implication to the ML System Design"
10:30-11:00 – Prof. Quanquan Gu, UCLA and Head of AIDD at ByteDance, "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models"
11:00-11:30 – Dr. Ian Colbert, AMD, "Quantizing Neural Networks for Efficient AI Inference"
11:30-12:30 – Lunch and Poster Viewing
12:30-1:00 – Dr. Somdeb Majumdar, Director of Intel AI Lab, "The Era of Foundation Models – What Lies Beyond LLMs"
1:00-1:30 – Prof. Miryung Kim, UCLA, Vice Chair of Graduate Studies and Amazon Scholar at AWS, "Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing"
1:30-2:00 – Prof. Aparna Chandramowlishwaran, UCI, "Domain Decomposition meets Neural Operator: AI4Science at Scale"
2:00-2:30 – Dr. Ramyad Hadidi, Rain AI, "On-Device Computing: Rain AI’s Mission for Energy-Efficient AI Hardware"
2:30-3:00 – Prof. Nikil Dutt, UCI, Chancellor's Professor of Computer Science, "Adaptive Computer Systems through Computational Self-Awareness"
3:00-3:30 – Best Poster Award
Participants will include faculty, postdocs, students, industry scientists and engineers. The student poster session conducted during the break at lunch is open to any student. The Best Poster Award is $300.
Testimonials from Previous Workshops
Professor David Patterson, the Pardee Professor of Computer Science, UC Berkeley, “I saw strong participation at the Cloud Workshop, with some high energy and enthusiasm; and I was delighted to see industry engineers bring and describe actual hardware, representing some of the newest innovations in the data center.”
Professor Christos Kozyrakis, Professor of Electrical Engineering & Computer Science, Stanford University, “As a starting point, I think of these IAP workshops an intersection of industry’s newest solutions in hardware with academic research in computer architecture; but more so, these workshops additionally cover new subsystems and applications, and in a smaller venue where it is easy to discuss ideas and cross-cutting approaches with colleagues.”
Professor Hakim Weatherspoon, Professor of Computer Science, Cornell University, “I have participated in three IAP Workshops since the first one at Cornell in 2013 and it is great to see that the IAP premise is a success now as it was then, bringing together industry and academia in a focused all-day exchange of ideas. It was a fantastic experience and I look forward to the next one!”
Dr. Carole-Jean Wu, Research Scientist, AI Infrastructure, Facebook Research, and Professor of CSE, Arizona State University, “IAP Workshops provide valuable interactions among faculty, students and industry. The smaller venue and the poster session foster an interactive environment for in-depth discussions and spark new collaborative opportunities. Thank you for organizing this wonderful event! It was very well run.”
Dr. Pankaj Mehra, VP Product Planning, Samsung (currently CEO Elephance Memory), "Terrifically organized Workshops that give all parties -- students, faculty, industry -- valuable insights to take back."
Speaker Abstracts and Bios (listed alphabetically by last name)
Dr. Bilge Acun, Meta, "Towards a Sustainable and Efficient LLM Infrastructure"
Abstract
Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement.
In this talk, I will discuss the sustainability and efficiency challenges of LLMs, and I will introduce CHAI, which is an inference time dynamic attention pruning method to reduce the compute and memory requirements of multi head attention. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute.
Bio
Bilge Acun is a Research Scientist at Meta AI / FAIR, SysML team. She is working on making large scale machine learning systems more sustainable and efficient through algorithmic and system optimizations. She received her Ph.D. degree in 2017 at the Department of Computer Science at University of Illinois at Urbana-Champaign. Her dissertation was awarded 2018 ACM SigHPC Dissertation Award Honorable Mention.
Prof. Aparna Chandramowlishwaran, UCI, "Domain Decomposition meets Neural Operator: AI4Science at Scale"
Abstract
Partial differential equations (PDEs) form the bedrock for modeling a wide variety of scientific phenomena. Traditional numerical solvers can be computationally expensive, especially when high-fidelity solutions are required. Enter the concept of neural operators. Unlike conventional neural networks that map functions between finite-dimensional spaces, neural operators extend this idea to learn operators between infinite-dimensional spaces. Essentially, they aim to directly learn the solution operator of PDEs. Despite their promise, neural operators encounter two significant challenges: data-driven dilemma and geometric generalization. Neural operators need substantial training data to learn effectively. However, the very data needed for training often come from computationally-intensive numerical solvers. This creates a “chicken-and-egg” problem: neural operators need data, but obtaining that data is costly. On the other hand, PDE solvers must handle diverse and complex geometries—irregular shapes, non-periodic boundaries, large domains—a formidable challenge for neural operators.
In this talk, I’ll discuss how the idea of domain decomposition can be applied to neural operators to tackle these twin challenges. We will explore various strategies of partitioning the domain to model both local and global interactions. By partitioning the integration domain into smaller subdomains, neural operators learn local interactions within each subdomain. This approach allows generalization to arbitrary subdomain structures. In addition to local decomposition, global decomposition can be used to construct a function-space analog of the vision transformer. It considers interactions across subdomains and has complexity quadratic in the number of subdomains. By combining domain decomposition with neural operators, we can transcend the boundaries of scale and geometry.
Bio
Aparna Chandramowlishwaran is an Associate Professor at the University of California, Irvine, in the Department of Electrical Engineering and Computer Science. Her research lab— HPC Forge—aims at advancing science using machine learning and high-performance computing. She received her PhD in Computational Science and Engineering from Georgia Tech and was a research scientist at MIT Computer Science and Artificial Intelligence Laboratory prior to joining UCI. She is a recipient of the NSF CAREER award, Google faculty research award, ACM Gordon Bell prize, Intel PhD fellowship, ACM/IEEE George Michael Memorial HPC fellowship, and several best paper awards/finalists, among others.
Dr. Ian Colbert, AMD, "Quantizing Neural Networks for Efficient AI Inference"
Abstract
The quality of deep neural networks has scaled with the size of their training datasets and model architectures. To reduce the rising cost of querying these increasingly large networks, researchers and practitioners have explored a handful of techniques in both hardware and software. One of the most impactful developments has been low-precision quantization, in which a neural network is constrained to require narrower data formats during storage and/or computation. In this talk, I will present an overview of quantization as it pertains to neural networks and introduce Brevitas, AMD’s PyTorch quantization library.
Bio
Ian Colbert is an applied AI/ML researcher in the Software Architecture team at Advanced Micro Devices (AMD). He received his Ph.D. in Machine Learning and Data Science from the Electrical and Computer Engineering Department at UC San Diego. During his 6 years working at AMD, he had led various applied AI/ML research projects related to neural network inference optimization.
Prof. Jason Cong, UCLA, Volgenau Chair for Engineering Excellence, "Can We Automate Chip Design with Deep Learning?"
Abstract
Deep learning has shown promising results on many applications, such as image recognition, natural language processing, and protein folding. Recently, we started investigation of using deep learning to automate chip designs with a focus on design creation, leveraging our multi-decade research experience on high-level synthesis (HLS). In this talk, I present our latest progress on this topic. By coupling high-level synthesis with a set of deep learning techniques, such as graph-based neural networks, transfer learning, and large language models, we achieved promising results on both HLS quality prediction and design space exploration for general applications. When coupled microarchitecture guided optimization for regular structures, such as systolic arrays and stencil computation, we show that it is possible to automate IC designs so that most software programmers can design their own chips for a wide range of applications. This is an encouraging development, as there is a great need to design various kinds of customized hardware accelerators for better performance and energy efficiency, when we are approaching the end of Moore’s Law scaling.
Bio
Jason Cong is the Volgenau Chair for Engineering Excellence Professor at the UCLA Computer Science Department (and a former department chair), with joint appointment from the Electrical and Computer Engineering Department. He is the director of Center for Domain-Specific Computing (CDSC) and the director of VLSI Architecture, Synthesis, and Technology (VAST) Laboratory. Dr. Cong’s research interests include novel architectures and compilation for customizable computing, synthesis of VLSI circuits and systems, and quantum computing. He has over 500 publications in these areas, including 18 best paper awards, and 4 papers in the FPGA and Reconfigurable Computing Hall of Fame. He and his former students co-founded AutoESL, which developed the most widely used high-level synthesis tool for FPGAs (renamed to Vivado HLS and Vitis HLS after Xilinx’s acquisition). He is member of the National Academy of Engineering, and a Fellow of ACM, IEEE, and the National Academy of Inventors. He is recipient of the SIA University Research Award, the EDAA Achievement Award, and the IEEE Robert N. Noyce Medal for “fundamental contributions to electronic design automation and FPGA design methods”.
Prof. Nikil Dutt, UCI, "Adaptive Computer Systems through Computational Self-Awareness"
Abstract
We are seeing an explosion in new classes of applications (e.g., autonomous systems) enabled by Machine Learning to ingest large volumes of data to support analytics, prediction and optimization. However, these applications still exhibit brittle, and even unsafe behaviors in the face of adaptivity in the underlying models, the data and the usage context. Furthermore, the fast-evolving landscape of both computer architectures and applications requires a holistic software/ hardware strategy to facilitate safe and efficient system design. To systematically solve this challenge, we deploy computational self-awareness principles to enable adaptivity in the face of dynamic changes in the application, environment, and computational platforms. I discuss how self-awareness properties can be applied using an adaptive, reflective middleware layer using a holistic approach for performing resource allocation decisions and power management by leveraging concepts from reflective software. Reflection enables dynamic adaptation based on both external feedback and introspection (i.e., self-assessment). In our context, this translates into performing resource management actuation considering both sensing information (e.g., readings from performance counters, power sensors, etc.) to assess the current system state, as well as models to predict the behavior of other system components before performing an action. I will also describe how we can design an energy-efficient memory subsystem through a cross-layer approach that straddles multiple abstraction levels. I will outline two example case studies: end-to-end computational pipelines for autonomous systems, and optimizing data center memory behavior at-scale. I believe computational self-awareness is a rich area for research and outline some future opportunities for using computational self-awareness in emerging applications that require adaptive, energy-efficient architectures.
Bio
Nikil Dutt is a Distinguished Professor of CS, Cognitive Sciences, and EECS at the University of California, Irvine. He received a PhD from the University of Illinois at Urbana-Champaign (1989). His research interests are in embedded systems, EDA, computer architecture and compilers, distributed systems, healthcare IoT, and brain-inspired architectures and computing. He has received numerous best paper awards and is coauthor of 7 books. Professor Dutt has served as EiC of ACM TODAES and AE for ACM TECS and IEEE TVLSI. He is on the steering, organizing, and program committees of several premier EDA and Embedded System Design conferences and workshops, and has also been on the advisory boards of ACM SIGBED, ACM SIGDA, ACM TECS and IEEE ESL. He is an ACM Fellow, IEEE Fellow, and recipient of the IFIP Silver Core Award.
Prof. Quanquan Gu, UCLA, "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models"
Abstract
Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
Bio
Quanquan Gu is an Associate Professor of Computer Science at UCLA. His research is in artificial intelligence and machine learning, with a focus on nonconvex optimization, deep learning, reinforcement learning, large language models, and deep generative models. Recently, he has been utilizing AI to enhance scientific discovery in domains such as biology, medicine, chemistry, and public health. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2014. He is a recipient of the Sloan Research Fellowship, NSF CAREER Award, Simons Berkeley Research Fellowship among other industrial research awards.
Dr. Ramyad Hadidi, Rain AI, "On-Device Computing: Rain AI’s Mission for Energy-Efficient AI Hardware"
Abstract
Today's AI systems are hampered in achieving peak performance on low-power devices, the critical juncture where real-time processing is essential. This disconnect is a significant hurdle in harnessing the full potential of AI, from autonomous systems to large language model (LLM) based agents. The goal is the seamless operation of advanced AI models on local devices. The prevalent separation of memory and computation, along with the prohibitive cost of information processing on current hardware, stands as a barrier to the future of AI. Rain AI's mission is to break these chains by developing the most energy-efficient AI hardware in the industry. In this talk, I will give an overview of Rain-1 and some of its techniques that show our commitment to this mission. I will focus on our state-of-the-art approach in hardware-software co-design, emphasizing our breakthroughs in in-memory computing and AI fine-tuning techniques designed to revolutionize efficient on-device computing.
Bio
Ramyad Hadidi is an applied machine learning researcher at Rain AI, where he develops sophisticated artificial intelligence systems for edge computing. With a Ph.D. in Computer Science from Georgia Institute of Technology, Ramyad's expertise spans edge computing, computer architecture, and machine learning. His doctoral thesis centered on deploying deep neural networks efficiently at the edge. At Rain AI, Ramyad is advancing the field of hardware/software co-design for AI, concentrating on optimizing in-memory computing architectures and enhancing their hardware-software synergy for resource-constrained environments.
Prof. Miryung Kim, UCLA, "Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing"
Abstract
With the development of big data, machine learning, and AI, existing software engineering techniques must be re-imagined to provide the productivity gains that developers desire. Furthermore, specialized hardware accelerators like GPUs or FPGAs have become a prominent part of the current computing landscape. However, developing heterogeneous applications is limited to a small subset of programmers with specialized hardware knowledge. To improve productivity and performance for data-intensive and compute-intensive development, now is the time that the software engineering community should design new waves of refactoring, testing, and debugging tools for big data analytics and heterogeneous application development.
In this talk, we overview software development challenges in this new data-intensive scalable computing and heterogeneous computing domain. We describe examples of automated software engineering (debugging, testing, and refactoring) techniques that target this new domain and share lessons learned from building these techniques.
Bio
Miryung Kim is a Professor and Vice Chair of Graduate Studies at UCLA Computer Science, and she is also an Amazon Scholar at Amazon Web Services. Her current research focuses on software developer tools for data-intensive scalable computing and heterogeneous computing. Her group created automated testing and debugging for Apache Spark and conducted the largest-scale study of data scientists in the industry. Her group's Java bytecode debloating tool, JDebloat, made a tech transfer impact to the Navy.
She has produced 6 professors (Columbia, Purdue, two at Virginia Tech, etc.). For her impact on nurturing the next generation of academics, she received the ACM SIGSOFT Influential Educator Award. She was a Program Co-Chair of FSE 2022. She was a Keynote Speaker at ASE 2019 and ISSTA 2022. She gave Distinguished Lectures at CMU, UIUC, UMN, UC Irvine, etc.
She is a recipient of the 10 Year Most Influential Paper Award from ICSME twice, an NSF CAREER award, a Microsoft Software Engineering Innovation Foundation Award, an IBM Jazz Innovation Award, a Google Faculty Research Award, an Okawa Foundation Research Award, and a Humboldt Fellowship from the Alexander von Humboldt Foundation. She is an ACM Distinguished Member.
Prof. Hyoukjun Kwon, UCI, "ML Workloads in AR/VR and Their Implication to the ML System Design"
Abstract
Augmented and virtual reality (AR/VR) combines many machine learning (ML) models to implement complex applications. Unlike traditional ML workloads, those in AR/VR involve (1) multiple concurrent ML pipelines (cascaded ML models with control/data dependencies), (2) highly heterogeneous modality and corresponding model structures, and (3) heavy dynamic behavior based on the user context and inputs. In addition, AR/VR requires a (4) real-time execution of those ML workloads on (5) energy-constrained wearable form-factor devices. All together, it creates significant challenges to the ML system design targeting for AR/VR.
In this talk, I will first demystify the ML workloads in AR/VR via a recent open benchmark, XRBench, which was developed with industry collaborators at Meta to reflect real use cases. Using the workloads, I will list the challenges and implications of the AR/VR ML workloads to ML system designs. Based on that, I will present hardware and software system design examples tailored for AR/VR ML workloads. Finally, I will discuss research opportunities in the AR/VR ML system design domain.
Bio
Hyoukjun Kwon is an assistant professor in EECS at the University of California, Irvine (UCI). His primary research area is computer architecture, and his research focuses on AI accelerator HW/SW co-design for emerging workloads such as augmented and virtual reality (AR/VR). He was a research scientist at Meta Reality Labs before joining UCI, and he received his Ph.D. in computer science at the Georgia Institute of Technology in 2020. His research works have been recognized by IEEE Top Picks in Computer Architecture Conferences (MAERI and MAESTRO) and an honorable mention at the IEEE ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Award.
Dr. Somdeb Majumdar, Intel, "The Era of Foundation Models – What Lies Beyond LLMs"
Abstract
Large Language Models (LLMs) have transformed the public perception of AI. At Intel Labs, we are already thinking about the next big frontiers. On one hand we explore the scaling behavior of LLMs and how we can bring them to consumer grade hardware. On the other hand, we tackle open questions like how to better ground complex reasoning systems in knowledge and build disruptive solutions for traditional areas like computational chemistry, chip design and video understanding. This talk will touch upon some of these research vectors at Intel Labs that are pushing the state of the art in AI.
Bio
Dr. Somdeb Majumdar is the Director of AI Lab – a research organization within Intel Labs. He received his Ph.D. from the University of California, Los Angeles and spent several years developing ultra-low-power communication systems, wearable medical devices and deep learning systems. He has published at top-tier journals and conferences and holds 27 US patents. At Intel Labs, he leads a multi-disciplinary team developing foundational AI algorithms, scalable open-source software tools and disruptive applications in Computer Vision, Chip Design, Graph Learning, Scientific Computing and other emerging areas.