The IAP UW Workshop on the Future of AI in the Cloud was conducted on Friday, May 9, 2025 at UW.
Venue: Husky Union Building, Room 334, UW, Seattle, WA
This workshop was hosted and co-organized by Prof. Stephanie Wang and the IAP.
Venue: Husky Union Building, Room 334, UW, Seattle, WA
This workshop was hosted and co-organized by Prof. Stephanie Wang and the IAP.
Shihang Vic Li and Matthew Giordano won the Best Poster Award for "NEMO: Flexible and High-Fidelity Telemetry on Programmable Memeory Controllers." Congratulating them (left to right) are Ulf Hanebutte (Marvell), Prof. Stephanie Wang, Mats Oberg (Marvell), Prof. Tom Anderson, Prof. Baris Kasikci, Prof. Simon Peter, Liguang Xie (ByteDance), Victor Cao (Furturewei) and Brad Beckmann (AMD).
Agenda – Videos of Presentations - Please see the Speaker Abstracts and Bios below, along with Testimonials from previous Workshops.
8:30-8:55 – Badge Pick-up – Coffee/Tea and Breakfast Food/Snacks
8:55-9:00 – Welcome – Prof. Stephanie Wang, University of Washington
9:00-9:30 – Keynote: Dr. Ricardo Bianchini, Technical Fellow and Corporate Vice President at Microsoft, “Challenges and Opportunities in Datacenter Power and Sustainability in the AI Era”
9:30-10:00 – Prof. Ratul Mahajan, University of Washington, "Application-defined Networking”
10:00-10:30 – Dr. Ulf Hanebutte, Distinguished Engineer, Marvell, "Towards a Flexible Infrastructure Supporting Diverse AI Workloads of Today and Tomorrow”
10:30-11:00 – Prof. Natasha Jaques, University of Washington, "Reinforcement Learning Fine-tuning of Large Language Models"
11:00-11:30 – Lightning Session for Student Posters
11:30-12:30 – Lunch and Poster Viewing
12:30-1:00 – Keynote: Vinod Grover, Senior Distinguished Engineer, Nvidia, "The Essence of CUDA C++ : Past, Present, and Future"
1:00-1:30 – Prof. Stephanie Wang, University of Washington, "Towards ML System Extensibility"
1:30-2:00 – Prof. Arvind Krishnamurthy, University of Washington, "Optimizing Data Movement for Machine Learning"
2:00-2:30 – Dr. Brad Beckmann, Fellow in Research and Advanced Development, AMD, "Advancing Energy Efficient AI Communication"
2:30-3:00 – Prof. Baris Kasikci, University of Washington, "The Quest For Blazingly Fast LLM Serving"
3:00-4:00 – Best Poster Award and Reception
ABSTRACTS and BIOS (alphabetical order by last name)
Dr. Brad Beckmann, Fellow in Research and Advanced Development, AMD, "Advancing Energy Efficient AI Communication."
Abstract: Reducing power consumption is the dominant challenge for ML system designs. AMD has achieved tremendous scalability in accelerator throughput by leveraging chiplet technology, but this improvement is not free. Much like the rise of multi-core processors two decades ago required software to embrace multi-threaded programming to achieve high performance, tomorrow’s processors will force software to optimize for intra-chip locality to achieve high performance. This talk will highlight how to partition future GPU programs within the chip for power efficiency and how to optimize the subsequent collective communication for the on-chip memory hierarchy.
Bio: Brad Beckmann is a Fellow in AMD Research and Advanced Development group. Brad leads a team of researcher pursuing next-generation hardware and software technologies for scale-up/scale-out GPU networking. Brad joined AMD in 2007 and has led projects innovating in GPU memory consistency models, GPU cache coherence, simulation, and on-chip networks. He also co-led the initial development and release of the gem5 simulator in 2011. He has published over 30 conference and journal papers and co-authored over 40 granted patents. Prior to AMD, Brad was a software developer for Microsoft’s Windows Server Performance team. Brad has a PhD in Computer Science from University of Wisconsin-Madison.
Dr. Ricardo Bianchini, Technical Fellow and Corporate Vice President, Microsoft, "Challenges and opportunities in datacenter power and sustainability in the AI era.”
Abstract: As society's interest in generative AI models and their capabilities continues to soar, we are witnessing an unprecedented surge in compute demand. This surge is stressing every aspect of the cloud ecosystem at a time when hyperscale providers are striving to become carbon-neutral. In this talk, I will address the challenges in managing the power, energy, and sustainability of this expanding AI infrastructure. I will also quickly overview some of my team's early efforts to tackle these challenges and explore potential research avenues going forward. Ultimately, we will need a large research and development effort to create a more sustainable and efficient future for AI.
Short bio: Dr. Ricardo Bianchini is a Technical Fellow and Corporate Vice President at Microsoft Azure, where he leads the team responsible for managing Azure’s Compute workload, server capacity, and datacenter infrastructure with a strong focus on efficiency and sustainability. Before joining Azure in 2022, Ricardo led the Systems Research Group and the Cloud Efficiency team at Microsoft Research (MSR). During his tenure at MSR, he created research projects in power efficiency and intelligent resource management that resulted in large-scale production systems across Microsoft. Prior to joining Microsoft in 2014, he was a Professor at Rutgers University, where he conducted research in datacenter power and energy management, cluster-based systems, and other cloud-related topics. Ricardo is a Fellow of both the ACM and IEEE.
Vinod Grover, Senior Distinguished Engineer, Nvidia, "The Essence of CUDA C++ : Past, Present, and Future"
Abstract: CUDA began as a way to harness GPU power for general-purpose computing. Over time, NVIDIA developed a vision of virtualized GPU architecture, built around C++ integration and the SIMT programming model. This approach enabled breakthroughs in high-performance computing and deep learning, culminating in innovations like Tensor Cores. Looking ahead, CUDA is evolving toward tile-based programming and mega-kernels, with large language models (LLMs) assisting in development for distributed systems.
Bio: Vinod Grover is a Senior Distinguished Engineer at NVIDIA. Since 2007, he has led the development of CUDA C++, a foundational technology for GPU programming. His recent work focuses on improving performance and productivity in deep learning using language and compiler technologies. Prior to NVIDIA, he held roles at Sun Microsystems and Microsoft. Vinod earned a B.S. in Physics from IIT Delhi and an M.S. in Computer Science from Syracuse University.
Dr. Ulf Hanebutte, Distinguished Engineer, Marvell, "Towards a flexible infrastructure supporting diverse AI workloads of today and tomorrow.”
Abstract: Accelerating AI workload of today while enabling a flexible infrastructure that provides opportunities to accelerate the AI workloads of tomorrow is a fundamental computer science challenge. To this end, concepts like Near-Memory-Computing and Data Acceleration and Offload, not long ago only research, are now product offerings. This talk will explore CXL based Near-Memory-Compute Accelerators in the context of AI workloads and provide an introduction to Marvell’s DAO (Data Acceleration Offload) high-performance open-source solution framework and the recently established DAO research facility to foster academic research and collaborations.
Bio: Dr. Ulf Hanebutte is a Distinguished Engineer at Marvell with focus on HW/SW co-design within AI/ML architecture. In this role he has contributed to multiple generations of ML inference accelerator HW and their SW stacks. Collaborating and solving big problems together has marked his extensive career, both at the National Labs and in the private sector, with projects ranging from HPC at Exa-scale to IoT for energy efficient buildings. He holds a Ph.D. from Northwestern University and a Dipl. Ing. in Aero Space Engineering from the University of Stuttgart.
Prof. Natasha Jaques, University of Washington, "Reinforcement Learning Fine-tuning of Large Language Models"
Abstract: Reinforcement Learning (RL) Fine-Tuning of Large Language Models has shown incredible promise, starting with RL from human feedback, and continuing into recent results using verifiable rewards, including DeepSeek R1. I will argue that multi-agent reinforcement learning can take us even further, opening up the potential to provide formal guarantees that LLM outputs will be safe no matter their input. However, RL fine-tuning of a single LLM is already so computationally expensive it is prohibitive for some academic labs; how can we achieve online RL training of multiple LLMs simultaneously? We introduce a multi-node system capable of online self-play RL training, where an attacker and defender LLM co-evolve by playing a zero-sum adversarial game. The attacker attempts to find prompts which elicit an unsafe response from the defender, as judged by a reward model. Both agents use a hidden chain-of-thought to reason about how to develop and defend against attacks. Using well-known game theoretic results, we can show that if this game converges to the Nash equilibrium, the defender’s will output a safe response for any string input. Empirically, we show that while conventional red-teaming approaches which train against a fixed defender continuously discover the same, limited number of exploits, our self-play approach discovers a diverse set of attacks. The resulting defender is safer than models trained with RLHF (the predominant approach to model safety), while retaining core chatting and reasoning capabilities. Our results advocate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
Bio: Natasha Jaques is an Assistant Professor of Computer Science and Engineering at the University of Washington, and a Staff Research Scientist at Google DeepMind. Her research focuses on Social Reinforcement Learning in multi-agent and human-AI interactions. During her PhD at MIT, she developed foundational techniques for training language models with Reinforcement Learning from Human Feedback (RLHF). In the multi-agent space, she has developed techniques for improving coordination through social influence, and unsupervised environment design. Natasha’s work has received various awards, including Best Demo at NeurIPS, an honourable mention for Best Paper at ICML, and the Outstanding PhD Dissertation Award from the Association for the Advancement of Affective Computing. Her work has been featured in Science Magazine, MIT Technology Review, Quartz, IEEE Spectrum, Boston Magazine, and on CBC radio, among others. Natasha earned her Masters degree from the University of British Columbia, undergraduate degrees in Computer Science and Psychology from the University of Regina, and was a postdoctoral fellow at UC Berkeley.
Prof. Baris Kasikci, University of Washington, "The Quest For Blazingly Fast LLM Serving"
Abstract: Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of Users. Recent developments have pushed LLM serving to a compute-bound regime for most common workloads. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving—compute, memory, networking—are executed sequentially within a device.
In this talk I’ll introduce Nanoflow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. Nanoflow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. Nanoflow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate Nanoflow’s end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8×7B, LLaMA-3-8B, etc. With practical workloads, Nanoflow provides 1.91× throughput boost compared to state-of-the-art serving systems, achieving between 50 % to 72 % of optimal throughput across popular models.
Bio: Baris Kasikci is an associate professor in the Paul G. Allen School of Computer Science & Engineering At the University of Washington. His research focuses on building large-scale computer systems that are efficient, reliable, and secure. Previously, he was a Morris Wellman assistant professor in the EECS Department at the University of Michigan and before that, a researcher at Microsoft Research. He has a PhD in Computer Science from EPFL and has held roles at Google, Intel, and VMware. He is the recipient of an NSF CAREER award, Microsoft Faculty Fellowship Intel Rising Star Award, Google Faculty Awards, Intel Faculty Awards, IEEE MICRO Top Picks Awards, Jay Lepreau Best Paper Award at OSDI, SIGCOMM Best Paper Award, MICRO Best Paper Award, VMware fellowship, Roger Needham PhD Award for the best PhD thesis in computer systems in Europe, and the Patrick Denantes Memorial Prize for best PhD thesis at EPFL. More details can be found on his webpage https://homes.cs.washington.edu/~baris/
Prof. Arvind Krishnamurthy, University of Washington, "Optimizing Data Movement for Machine Learning"
Abstract: Over the past decade, advances in machine learning algorithms and models have enabled some remarkable applications. However, these applications place considerable demands on our computing infrastructure, incurring significant equipment costs and processing delays. ML These developments have renewed the focus on optimizing the underlying systems that dictate how efficiently models can be trained and deployed. A vital aspect of system efficiency is the efficient movement of data. ML workloads are intensely data-driven, requiring vast amounts of data to be fed to accelerators (GPUs, TPUs, etc.) for processing. Bottlenecks in data transfer severely limit performance. Minimizing latency and maximizing bandwidth through optimized interconnects and efficient scheduling of compute and communications are crucial for improving ML efficiency. In this talk, I will describe our recent work on efficient orchestration of data movement in training and inference settings and demonstrate how new ways of utilizing the hardware infrastructure can yield significant performance gains.
Bio: Arvind Krishnamurthy is the Short-Dooley Professor in the Paul G. Allen School of Computer Science & Engineering. His research interests are in building effective and robust computer systems in the context of both data centers and Internet-scale systems. More recently, his research has focussed on programmable networks and systems for machine learning. He is an ACM fellow, a past program chair of ACM SIGCOMM and Usenix NSDI, is a former Vice President of Usenix, and has served on the CRA board.
Prof. Ratul Mahajan, University of Washington, "Application-defined networking.”
Abstract: Many new physical and virtual networks are built today to serve a handful of known applications, unlike the Internet which was built to support unknown applications. We argue that the implementation of such networks should be completely application-specific and not layered on top of general-purpose network abstractions from the Internet age. Such layering tends to more than double the latency of applications or makes it difficult to support application-specific handling.
We propose application-defined networking in which application developers specify network functionality in a high-level language and a controller generates a custom distributed implementation that runs across available hardware and software resources. We have instantiated this approach for microservices and service meshes. Our language can express common application network functions in only 7-28 lines of code, and the generated implementation lowers RPC processing latency by up to 82%.
Bio: Ratul Mahajan is an Associate Professor at the University of Washington (Paul G. Allen School of Computer Science). He is also the co-director of UW FOCI (Future of Cloud Infrastructure) and an Amazon Scholar. Prior to that, he was a Co-founder and CEO of Intentionet (acquired by Amazon), a company that pioneered intent-based networking and network verification, and a Principal Researcher at Microsoft Research.
Ratul is a computer systems researcher with a networking focus and has worked on a broad set of topics, including network verification, connected homes, network programming, optical networks, Internet routing and measurements, and mobile systems. He has published over fifty papers in top venues such as SIGCOMM, SOSP, MobiCom, CHI, and PLDI, and many of the technologies that he has helped develop are part of real-world systems at Microsoft and other companies.
Ratul has been recognized as an ACM Distinguished Scientist, an ACM SIGCOMM Rising Star, and a Microsoft Research Graduate Fellow. His papers have won the ACM SIGCOMM Test-of-Time Award, the IEEE William R. Bennett Prize, the ACM SIGCOMM Best Paper Awards (twice), and the HVC Best Paper Award. He got his PhD at the University of Washington and B.Tech at Indian Institute of Technology, Delhi, both in Computer Science and Engineering.
Prof. Stephanie Wang, University of Washington, "Towards ML System Extensibility"
Abstract: With the rise of large language models, distributed execution across multiple accelerators has become commonplace. Current ML systems must adopt complex distributed execution strategies for efficiency, but do so at the cost of extensibility. We believe that it is time to introduce a general-purpose distributed runtime for programming clusters of accelerators that enables: (1) placement flexibility, and (2) interoperability, without sacrificing (3) codesign. We propose using the DAFT API: distributed actors, futures, and tasks. To enable a smooth tradeoff between flexibility vs. performance, we introduce two execution modes: interpreted vs. compiled. We show how current applications in LLM inference and training can be executed as interpreted and compiled DAFT programs, using a prototype built on Ray, and discuss open questions and challenges.
Bio: Stephanie is an assistant professor at University of Washington, a creator of the open-source project Ray, and a founding engineer at Anyscale. Previously, she completed her PhD at UC Berkeley. Her research is in distributed systems, cloud computing, and systems for machine learning and data. Previous projects include Exoshuffle, which broke the Cloudsort record for cost-efficient distributed sort, and Ray Core, the distributed compute engine that was used to train GPT-4.
Testimonials from Previous Workshops
Professor David Patterson, the Pardee Professor of Computer Science, UC Berkeley, Turing Award Laureate, “I saw strong participation at the Cloud Workshop, with some high energy and enthusiasm; and I was delighted to see industry engineers bring and describe actual hardware, representing some of the newest innovations in the data center.”
Professor Christos Kozyrakis, Professor of Electrical Engineering & Computer Science, Stanford University, “As a starting point, I think of these IAP workshops as ‘Hot Chips meets ISCA’, i.e., an intersection of industry’s newest solutions in hardware (Hot Chips) with academic research in computer architecture (ISCA); but more so, these workshops additionally cover new subsystems and applications, and in a smaller venue where it is easy to discuss ideas and cross-cutting approaches with colleagues.”
Professor Hakim Weatherspoon, Professor of Computer Science, Cornell University, “I have participated in three IAP Workshops since the first one at Cornell in 2013 and it is great to see that the IAP premise was a success now as it was then, bringing together industry and academia in a focused workshop and an all-day exchange of ideas. It was a fantastic experience and I look forward to the next IAP Workshop.”
Professor Ken Birman, the N. Rama Rao Professor of Computer Science, Cornell University, “I actually thought it was a fantastic workshop, an unquestionable success, starting from the dinner the night before, through the workshop itself, to the post-event reception for the student Best Poster Awards.”
Dr. Carole-Jean Wu, Research Scientist, AI Infrastructure, Meta Research, and Professor of CSE, Arizona State University, “The IAP Cloud Computing workshop provides a great channel for valuable interactions between faculty/students and the industry participants. I truly enjoyed the venue learning about research problems and solutions that are of great interest to Meta, as well as the new enabling technologies from the industry representatives. The smaller venue and the poster session fostered an interactive environment for in-depth discussions on the proposed research and approaches and sparked new collaborative opportunities. Thank you for organizing this wonderful event! It was very well run.”
Nathan Pemberton, PhD student, UC Berkeley (currently Applied Scientist at AWS), "IAP workshops provide a valuable chance to explore emerging research topics with a focused group of participants, and without all the time/effort of a full-scale conference. Instead of rushing from talk to talk, you can slow down and dive deep into a few topics with experts in the field."
Dr. Pankaj Mehra, VP Product Planning, Samsung (currently Professor at Ohio State University and Founder at Elephance Memory), "Terrifically organized Workshops that give all parties -- students, faculty, industry -- valuable insights to take back"
Professor Vishal Shrivastav, Purdue University, “Attending the IAP workshops as a PhD student at Cornell was a great experience and very rewarding. I really enjoyed the many amazing talks from both the industry and academia. My personal conversations with several industry leaders at the workshop will definitely guide some of my future research."
Professor Ana Klimovic, ETH Zurich, “I attended three IAP workshops as a PhD student at Stanford, and I am consistently impressed by the quality of the talks and the breadth of the topics covered. These workshops bring top-tier industry and academia together to discuss cutting-edge research challenges. It is a great opportunity to exchange ideas and get inspiration for new research opportunities."
Dr. Richard New, VP Research, Western Digital, “IAP workshops provide a great opportunity to meet with professors and students working at the cutting edge of their fields. It was a pleasure to attend the event – lots of very interesting presentations and posters.”
8:30-8:55 – Badge Pick-up – Coffee/Tea and Breakfast Food/Snacks
8:55-9:00 – Welcome – Prof. Stephanie Wang, University of Washington
9:00-9:30 – Keynote: Dr. Ricardo Bianchini, Technical Fellow and Corporate Vice President at Microsoft, “Challenges and Opportunities in Datacenter Power and Sustainability in the AI Era”
9:30-10:00 – Prof. Ratul Mahajan, University of Washington, "Application-defined Networking”
10:00-10:30 – Dr. Ulf Hanebutte, Distinguished Engineer, Marvell, "Towards a Flexible Infrastructure Supporting Diverse AI Workloads of Today and Tomorrow”
10:30-11:00 – Prof. Natasha Jaques, University of Washington, "Reinforcement Learning Fine-tuning of Large Language Models"
11:00-11:30 – Lightning Session for Student Posters
11:30-12:30 – Lunch and Poster Viewing
12:30-1:00 – Keynote: Vinod Grover, Senior Distinguished Engineer, Nvidia, "The Essence of CUDA C++ : Past, Present, and Future"
1:00-1:30 – Prof. Stephanie Wang, University of Washington, "Towards ML System Extensibility"
1:30-2:00 – Prof. Arvind Krishnamurthy, University of Washington, "Optimizing Data Movement for Machine Learning"
2:00-2:30 – Dr. Brad Beckmann, Fellow in Research and Advanced Development, AMD, "Advancing Energy Efficient AI Communication"
2:30-3:00 – Prof. Baris Kasikci, University of Washington, "The Quest For Blazingly Fast LLM Serving"
3:00-4:00 – Best Poster Award and Reception
ABSTRACTS and BIOS (alphabetical order by last name)
Dr. Brad Beckmann, Fellow in Research and Advanced Development, AMD, "Advancing Energy Efficient AI Communication."
Abstract: Reducing power consumption is the dominant challenge for ML system designs. AMD has achieved tremendous scalability in accelerator throughput by leveraging chiplet technology, but this improvement is not free. Much like the rise of multi-core processors two decades ago required software to embrace multi-threaded programming to achieve high performance, tomorrow’s processors will force software to optimize for intra-chip locality to achieve high performance. This talk will highlight how to partition future GPU programs within the chip for power efficiency and how to optimize the subsequent collective communication for the on-chip memory hierarchy.
Bio: Brad Beckmann is a Fellow in AMD Research and Advanced Development group. Brad leads a team of researcher pursuing next-generation hardware and software technologies for scale-up/scale-out GPU networking. Brad joined AMD in 2007 and has led projects innovating in GPU memory consistency models, GPU cache coherence, simulation, and on-chip networks. He also co-led the initial development and release of the gem5 simulator in 2011. He has published over 30 conference and journal papers and co-authored over 40 granted patents. Prior to AMD, Brad was a software developer for Microsoft’s Windows Server Performance team. Brad has a PhD in Computer Science from University of Wisconsin-Madison.
Dr. Ricardo Bianchini, Technical Fellow and Corporate Vice President, Microsoft, "Challenges and opportunities in datacenter power and sustainability in the AI era.”
Abstract: As society's interest in generative AI models and their capabilities continues to soar, we are witnessing an unprecedented surge in compute demand. This surge is stressing every aspect of the cloud ecosystem at a time when hyperscale providers are striving to become carbon-neutral. In this talk, I will address the challenges in managing the power, energy, and sustainability of this expanding AI infrastructure. I will also quickly overview some of my team's early efforts to tackle these challenges and explore potential research avenues going forward. Ultimately, we will need a large research and development effort to create a more sustainable and efficient future for AI.
Short bio: Dr. Ricardo Bianchini is a Technical Fellow and Corporate Vice President at Microsoft Azure, where he leads the team responsible for managing Azure’s Compute workload, server capacity, and datacenter infrastructure with a strong focus on efficiency and sustainability. Before joining Azure in 2022, Ricardo led the Systems Research Group and the Cloud Efficiency team at Microsoft Research (MSR). During his tenure at MSR, he created research projects in power efficiency and intelligent resource management that resulted in large-scale production systems across Microsoft. Prior to joining Microsoft in 2014, he was a Professor at Rutgers University, where he conducted research in datacenter power and energy management, cluster-based systems, and other cloud-related topics. Ricardo is a Fellow of both the ACM and IEEE.
Vinod Grover, Senior Distinguished Engineer, Nvidia, "The Essence of CUDA C++ : Past, Present, and Future"
Abstract: CUDA began as a way to harness GPU power for general-purpose computing. Over time, NVIDIA developed a vision of virtualized GPU architecture, built around C++ integration and the SIMT programming model. This approach enabled breakthroughs in high-performance computing and deep learning, culminating in innovations like Tensor Cores. Looking ahead, CUDA is evolving toward tile-based programming and mega-kernels, with large language models (LLMs) assisting in development for distributed systems.
Bio: Vinod Grover is a Senior Distinguished Engineer at NVIDIA. Since 2007, he has led the development of CUDA C++, a foundational technology for GPU programming. His recent work focuses on improving performance and productivity in deep learning using language and compiler technologies. Prior to NVIDIA, he held roles at Sun Microsystems and Microsoft. Vinod earned a B.S. in Physics from IIT Delhi and an M.S. in Computer Science from Syracuse University.
Dr. Ulf Hanebutte, Distinguished Engineer, Marvell, "Towards a flexible infrastructure supporting diverse AI workloads of today and tomorrow.”
Abstract: Accelerating AI workload of today while enabling a flexible infrastructure that provides opportunities to accelerate the AI workloads of tomorrow is a fundamental computer science challenge. To this end, concepts like Near-Memory-Computing and Data Acceleration and Offload, not long ago only research, are now product offerings. This talk will explore CXL based Near-Memory-Compute Accelerators in the context of AI workloads and provide an introduction to Marvell’s DAO (Data Acceleration Offload) high-performance open-source solution framework and the recently established DAO research facility to foster academic research and collaborations.
Bio: Dr. Ulf Hanebutte is a Distinguished Engineer at Marvell with focus on HW/SW co-design within AI/ML architecture. In this role he has contributed to multiple generations of ML inference accelerator HW and their SW stacks. Collaborating and solving big problems together has marked his extensive career, both at the National Labs and in the private sector, with projects ranging from HPC at Exa-scale to IoT for energy efficient buildings. He holds a Ph.D. from Northwestern University and a Dipl. Ing. in Aero Space Engineering from the University of Stuttgart.
Prof. Natasha Jaques, University of Washington, "Reinforcement Learning Fine-tuning of Large Language Models"
Abstract: Reinforcement Learning (RL) Fine-Tuning of Large Language Models has shown incredible promise, starting with RL from human feedback, and continuing into recent results using verifiable rewards, including DeepSeek R1. I will argue that multi-agent reinforcement learning can take us even further, opening up the potential to provide formal guarantees that LLM outputs will be safe no matter their input. However, RL fine-tuning of a single LLM is already so computationally expensive it is prohibitive for some academic labs; how can we achieve online RL training of multiple LLMs simultaneously? We introduce a multi-node system capable of online self-play RL training, where an attacker and defender LLM co-evolve by playing a zero-sum adversarial game. The attacker attempts to find prompts which elicit an unsafe response from the defender, as judged by a reward model. Both agents use a hidden chain-of-thought to reason about how to develop and defend against attacks. Using well-known game theoretic results, we can show that if this game converges to the Nash equilibrium, the defender’s will output a safe response for any string input. Empirically, we show that while conventional red-teaming approaches which train against a fixed defender continuously discover the same, limited number of exploits, our self-play approach discovers a diverse set of attacks. The resulting defender is safer than models trained with RLHF (the predominant approach to model safety), while retaining core chatting and reasoning capabilities. Our results advocate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
Bio: Natasha Jaques is an Assistant Professor of Computer Science and Engineering at the University of Washington, and a Staff Research Scientist at Google DeepMind. Her research focuses on Social Reinforcement Learning in multi-agent and human-AI interactions. During her PhD at MIT, she developed foundational techniques for training language models with Reinforcement Learning from Human Feedback (RLHF). In the multi-agent space, she has developed techniques for improving coordination through social influence, and unsupervised environment design. Natasha’s work has received various awards, including Best Demo at NeurIPS, an honourable mention for Best Paper at ICML, and the Outstanding PhD Dissertation Award from the Association for the Advancement of Affective Computing. Her work has been featured in Science Magazine, MIT Technology Review, Quartz, IEEE Spectrum, Boston Magazine, and on CBC radio, among others. Natasha earned her Masters degree from the University of British Columbia, undergraduate degrees in Computer Science and Psychology from the University of Regina, and was a postdoctoral fellow at UC Berkeley.
Prof. Baris Kasikci, University of Washington, "The Quest For Blazingly Fast LLM Serving"
Abstract: Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of Users. Recent developments have pushed LLM serving to a compute-bound regime for most common workloads. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving—compute, memory, networking—are executed sequentially within a device.
In this talk I’ll introduce Nanoflow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. Nanoflow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. Nanoflow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate Nanoflow’s end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8×7B, LLaMA-3-8B, etc. With practical workloads, Nanoflow provides 1.91× throughput boost compared to state-of-the-art serving systems, achieving between 50 % to 72 % of optimal throughput across popular models.
Bio: Baris Kasikci is an associate professor in the Paul G. Allen School of Computer Science & Engineering At the University of Washington. His research focuses on building large-scale computer systems that are efficient, reliable, and secure. Previously, he was a Morris Wellman assistant professor in the EECS Department at the University of Michigan and before that, a researcher at Microsoft Research. He has a PhD in Computer Science from EPFL and has held roles at Google, Intel, and VMware. He is the recipient of an NSF CAREER award, Microsoft Faculty Fellowship Intel Rising Star Award, Google Faculty Awards, Intel Faculty Awards, IEEE MICRO Top Picks Awards, Jay Lepreau Best Paper Award at OSDI, SIGCOMM Best Paper Award, MICRO Best Paper Award, VMware fellowship, Roger Needham PhD Award for the best PhD thesis in computer systems in Europe, and the Patrick Denantes Memorial Prize for best PhD thesis at EPFL. More details can be found on his webpage https://homes.cs.washington.edu/~baris/
Prof. Arvind Krishnamurthy, University of Washington, "Optimizing Data Movement for Machine Learning"
Abstract: Over the past decade, advances in machine learning algorithms and models have enabled some remarkable applications. However, these applications place considerable demands on our computing infrastructure, incurring significant equipment costs and processing delays. ML These developments have renewed the focus on optimizing the underlying systems that dictate how efficiently models can be trained and deployed. A vital aspect of system efficiency is the efficient movement of data. ML workloads are intensely data-driven, requiring vast amounts of data to be fed to accelerators (GPUs, TPUs, etc.) for processing. Bottlenecks in data transfer severely limit performance. Minimizing latency and maximizing bandwidth through optimized interconnects and efficient scheduling of compute and communications are crucial for improving ML efficiency. In this talk, I will describe our recent work on efficient orchestration of data movement in training and inference settings and demonstrate how new ways of utilizing the hardware infrastructure can yield significant performance gains.
Bio: Arvind Krishnamurthy is the Short-Dooley Professor in the Paul G. Allen School of Computer Science & Engineering. His research interests are in building effective and robust computer systems in the context of both data centers and Internet-scale systems. More recently, his research has focussed on programmable networks and systems for machine learning. He is an ACM fellow, a past program chair of ACM SIGCOMM and Usenix NSDI, is a former Vice President of Usenix, and has served on the CRA board.
Prof. Ratul Mahajan, University of Washington, "Application-defined networking.”
Abstract: Many new physical and virtual networks are built today to serve a handful of known applications, unlike the Internet which was built to support unknown applications. We argue that the implementation of such networks should be completely application-specific and not layered on top of general-purpose network abstractions from the Internet age. Such layering tends to more than double the latency of applications or makes it difficult to support application-specific handling.
We propose application-defined networking in which application developers specify network functionality in a high-level language and a controller generates a custom distributed implementation that runs across available hardware and software resources. We have instantiated this approach for microservices and service meshes. Our language can express common application network functions in only 7-28 lines of code, and the generated implementation lowers RPC processing latency by up to 82%.
Bio: Ratul Mahajan is an Associate Professor at the University of Washington (Paul G. Allen School of Computer Science). He is also the co-director of UW FOCI (Future of Cloud Infrastructure) and an Amazon Scholar. Prior to that, he was a Co-founder and CEO of Intentionet (acquired by Amazon), a company that pioneered intent-based networking and network verification, and a Principal Researcher at Microsoft Research.
Ratul is a computer systems researcher with a networking focus and has worked on a broad set of topics, including network verification, connected homes, network programming, optical networks, Internet routing and measurements, and mobile systems. He has published over fifty papers in top venues such as SIGCOMM, SOSP, MobiCom, CHI, and PLDI, and many of the technologies that he has helped develop are part of real-world systems at Microsoft and other companies.
Ratul has been recognized as an ACM Distinguished Scientist, an ACM SIGCOMM Rising Star, and a Microsoft Research Graduate Fellow. His papers have won the ACM SIGCOMM Test-of-Time Award, the IEEE William R. Bennett Prize, the ACM SIGCOMM Best Paper Awards (twice), and the HVC Best Paper Award. He got his PhD at the University of Washington and B.Tech at Indian Institute of Technology, Delhi, both in Computer Science and Engineering.
Prof. Stephanie Wang, University of Washington, "Towards ML System Extensibility"
Abstract: With the rise of large language models, distributed execution across multiple accelerators has become commonplace. Current ML systems must adopt complex distributed execution strategies for efficiency, but do so at the cost of extensibility. We believe that it is time to introduce a general-purpose distributed runtime for programming clusters of accelerators that enables: (1) placement flexibility, and (2) interoperability, without sacrificing (3) codesign. We propose using the DAFT API: distributed actors, futures, and tasks. To enable a smooth tradeoff between flexibility vs. performance, we introduce two execution modes: interpreted vs. compiled. We show how current applications in LLM inference and training can be executed as interpreted and compiled DAFT programs, using a prototype built on Ray, and discuss open questions and challenges.
Bio: Stephanie is an assistant professor at University of Washington, a creator of the open-source project Ray, and a founding engineer at Anyscale. Previously, she completed her PhD at UC Berkeley. Her research is in distributed systems, cloud computing, and systems for machine learning and data. Previous projects include Exoshuffle, which broke the Cloudsort record for cost-efficient distributed sort, and Ray Core, the distributed compute engine that was used to train GPT-4.
Testimonials from Previous Workshops
Professor David Patterson, the Pardee Professor of Computer Science, UC Berkeley, Turing Award Laureate, “I saw strong participation at the Cloud Workshop, with some high energy and enthusiasm; and I was delighted to see industry engineers bring and describe actual hardware, representing some of the newest innovations in the data center.”
Professor Christos Kozyrakis, Professor of Electrical Engineering & Computer Science, Stanford University, “As a starting point, I think of these IAP workshops as ‘Hot Chips meets ISCA’, i.e., an intersection of industry’s newest solutions in hardware (Hot Chips) with academic research in computer architecture (ISCA); but more so, these workshops additionally cover new subsystems and applications, and in a smaller venue where it is easy to discuss ideas and cross-cutting approaches with colleagues.”
Professor Hakim Weatherspoon, Professor of Computer Science, Cornell University, “I have participated in three IAP Workshops since the first one at Cornell in 2013 and it is great to see that the IAP premise was a success now as it was then, bringing together industry and academia in a focused workshop and an all-day exchange of ideas. It was a fantastic experience and I look forward to the next IAP Workshop.”
Professor Ken Birman, the N. Rama Rao Professor of Computer Science, Cornell University, “I actually thought it was a fantastic workshop, an unquestionable success, starting from the dinner the night before, through the workshop itself, to the post-event reception for the student Best Poster Awards.”
Dr. Carole-Jean Wu, Research Scientist, AI Infrastructure, Meta Research, and Professor of CSE, Arizona State University, “The IAP Cloud Computing workshop provides a great channel for valuable interactions between faculty/students and the industry participants. I truly enjoyed the venue learning about research problems and solutions that are of great interest to Meta, as well as the new enabling technologies from the industry representatives. The smaller venue and the poster session fostered an interactive environment for in-depth discussions on the proposed research and approaches and sparked new collaborative opportunities. Thank you for organizing this wonderful event! It was very well run.”
Nathan Pemberton, PhD student, UC Berkeley (currently Applied Scientist at AWS), "IAP workshops provide a valuable chance to explore emerging research topics with a focused group of participants, and without all the time/effort of a full-scale conference. Instead of rushing from talk to talk, you can slow down and dive deep into a few topics with experts in the field."
Dr. Pankaj Mehra, VP Product Planning, Samsung (currently Professor at Ohio State University and Founder at Elephance Memory), "Terrifically organized Workshops that give all parties -- students, faculty, industry -- valuable insights to take back"
Professor Vishal Shrivastav, Purdue University, “Attending the IAP workshops as a PhD student at Cornell was a great experience and very rewarding. I really enjoyed the many amazing talks from both the industry and academia. My personal conversations with several industry leaders at the workshop will definitely guide some of my future research."
Professor Ana Klimovic, ETH Zurich, “I attended three IAP workshops as a PhD student at Stanford, and I am consistently impressed by the quality of the talks and the breadth of the topics covered. These workshops bring top-tier industry and academia together to discuss cutting-edge research challenges. It is a great opportunity to exchange ideas and get inspiration for new research opportunities."
Dr. Richard New, VP Research, Western Digital, “IAP workshops provide a great opportunity to meet with professors and students working at the cutting edge of their fields. It was a pleasure to attend the event – lots of very interesting presentations and posters.”