Michael Goin @mgoin
systems engineer making inference fast
About
I've been working in ML inference since 2019, currently focused on making SOTA open-source LLMs run fast on various hardware accelerators in vLLM.
I like working across the stack wherever the bottleneck is - CPU, GPU, compute-bound, memory-bound, io-bound by using Python, PyTorch, C++, CUDA. Most of my time goes into profiling, benchmarking, and figuring out why things are slow.
Before that, my background was in HPC where I worked on robotics, materials science simulations, and neuromorphic computing.
I'm currently working at Red Hat on vLLM to power the open-source AI ecosystem with fast and easy inference. Before acquisition by Red Hat, I was at Neural Magic, where I worked on vLLM and originally built a sparsity-aware inference compiler that optimized CNNs, Transformers, and other models for CPUs.
If you want to reach me, the best way is to ping me @mgoin on vLLM
Slack. I'm always happy to collaborate on projects or ideas related to inference performance!
Work
- 2025-01 -> now Red Hat (acq Neural Magic), Principal Engineer
- 2024-01 -> now vLLM, Core Maintainer
- 2019-09 -> 2024-12 Neural Magic, Engineering Tech Lead
Changelog
Things I've shipped or helped ship.
- 2025-10-09 vLLM + NVIDIA Blackwell Optimized Inference
- 2025-01-27 vLLM V1: A Major Upgrade to vLLM's Core Architecture
- 2025-01-14 Structured Decoding Optimizations in vLLM
- 2024-08-31 Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization (arXiv)
- 2024-06-20 Won a bounty converting nvidia/Nemotron-4-340B-Instruct to work with vLLM (twitter)
- 2024-05-06 Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment (arXiv)
- 2023-10-10 Sparse Fine-tuning for LLM Inference Acceleration (arXiv)
- 2023-10-10 The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models (arXiv)
Talks
- 2024 -> now vLLM Office Hours, ~bi-weekly vLLM update I host with guests from the community. Slides are available in each video's description.
- 2025-11-06 vLLM Zurich Meetup, Slides, Recording
- 2025-11-01 vLLM Beijing Meetup, Slides
- 2025-10-21 PyTorch Conf 2025 - Accelerating Open-Source RL and Agentic Inference with vLLM, Slides, Recording
- 2025-10-09 vLLM Tokyo Meetup, Slides
- 2025-09-18 vLLM Boston Meetup, Slides
- 2025-05-07 NYC vLLM Meetup, Slides