🔎 OptiML Overview

🔎 OptiML Overview#

OptiML: Drop-in inference engine for AI agents.

OptiML, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of OptiML is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation.

Get Started

OptiML Demo

Neuron-Aware Offloading

Hot neuron run on GPU, cold neurons on CPU. Breaks the shackle of VRAM size.

Adaptive Activation Predictors

Learn your model’s pattern dynamically, adaptively. Ensures maximum compatibility.

Sparse Neuron Operators

Only calculate what’s needed. No wasted work on useless data.

Drop-in Integration

Can be applied to all transformer-based models. Accelerates your model effortlessly.

Multi-language Bindings

Supports major programming languages. Front-end libraries available for quick access.

Corporate Collaboration

Teamed up with leading tech companies. Continuously adopts bleeding-edge technology.

Acknowledgements

The OptiML project was initiated at QRG lab, Northwestern University. In the project’s early stage, we received contributions from top minds at leading institutions around the world. Special thanks to them who made this project possible!

NVIDIA
Gaurav Juvekar	Rajesh Gandham	Akif Corduk	Ihar Hrachyshka
Meta
Ashwin Bharambe	Dalton Flanagan
Red Hat
Sebastien Han	Charlie Doern
Purdue University
David Bernal	Jihun Hwang
UCSD
Yutong Huang
Community
Udit Gupta	Sirena Yu	Tomas Janda	Martino Mensio
Emanuel Gerber	Nour Taqatqa	Sung Min Cho	Dennis Wu
Henry Buron	Glen Koundry	Jimmy Kuhlman	Devanshu Desai
Brian Lee	Marko Sterbentz	Sam Leeman	Cameron Barrie
k8sify