πŸ”Ž OptiML Overview

πŸ”Ž OptiML Overview#

OptiML: Drop-in inference engine for AI agents.

OptiML, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of OptiML is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation.


Neuron-Aware Offloading

Hot neuron run on GPU, cold neurons on CPU. Breaks the shackle of VRAM size.

Adaptive Activation Predictors

Learn your model’s pattern dynamically, adaptively. Ensures maximum compatibility.

Sparse Neuron Operators

Only calculate what’s needed. No wasted work on useless data.

Drop-in Integration

Can be applied to all transformer-based models. Accelerates your model effortlessly.

Multi-language Bindings

Supports major programming languages. Front-end libraries available for quick access.

Corporate Collaboration

Teamed up with leading tech companies. Continuously adopts bleeding-edge technology.


Acknowledgements

The OptiML project was initiated at QRG lab, Northwestern University. In the project’s early stage, we received contributions from top minds at leading institutions around the world. Special thanks to them who made this project possible!

NVIDIA

Gaurav Juvekar

Rajesh Gandham

Akif Corduk

Ihar Hrachyshka
Meta

Ashwin Bharambe

Dalton Flanagan
Red Hat

Sebastien Han

Charlie Doern
Purdue University

David Bernal

Jihun Hwang
UCSD

Yutong Huang
Community

Udit Gupta

Sirena Yu

Tomas Janda

Martino Mensio

Emanuel Gerber

Nour Taqatqa

Sung Min Cho

Dennis Wu

Henry Buron

Glen Koundry

Jimmy Kuhlman

Devanshu Desai

Brian Lee

Marko Sterbentz

Sam Leeman

Cameron Barrie

k8sify