⚡ OptiML Demo

⚡ OptiML Demo#

OptiML accelerates local inference by exploiting activation locality: a compact set of “hot” neurons fire frequently across inputs, while the long tail of “cold” neurons is input-dependent. OptiML places the hot subset on the GPU and schedules the cold subset on the CPU, delivering strong throughput with low VRAM on everyday hardware.

Both llama.cpp (left) vs. OptiML (right) were running on the same hardware and fully utilized VRAM on a single RTX 5080.