🔬 OptiML Architecture#

1.  Overview#

This figure presents an architectural overview of OptiML, which consists of both offline and online components. Given the variation in locality properties across different LLMs, the offline component profiles activation sparsity to distinguish between hot and cold neurons. During the online phase, the inference engine loads both types of neurons into the GPU and CPU, enabling low-latency LLM inference at runtime.

your image

The architecture overview and inference workflow of OptiML.#

2.  LLM Profiler and Policy Solver (Offline)#

This component incorporates an LLM profiler that collects activation data during inference, using input requests sampled from general-purpose datasets. It monitors neuron activations across all layers and employs a policy solver to classify neurons as either hot or cold. The solver aims to assign frequently activated (hot) neurons to the GPU, while less active (cold) neurons are offloaded to the CPU. To balance the computational workload, it leverages a neuron impact metric in conjunction with hardware specifications, formulating the problem as an integer linear programming (ILP) task to maximize the total impact of neurons allocated to the GPU.

3.  Neuron-aware LLM Inference Engine (Online)#

Before processing user requests, the online engine assigns the two types of neurons to their respective processing units, based on the offline solver’s output. At runtime, the engine creates both GPU and CPU executors—threads running on the CPU side—to manage concurrent CPU-GPU computations. The engine also predicts neuron activation and skips those that are not expected to be active. Neurons preloaded in GPU memory are processed directly on the GPU, while the CPU computes and transfers the results of its assigned neurons back to the GPU for integration. To efficiently handle sparsity, the engine employs sparse-neuron-aware operators on both CPU and GPU, which operate on individual neuron rows or columns within the computation matrices.