π‘ FAQ#
Which models work best?
Decoder-only transformer families in GGUF with available kernels generally perform well.
Do I need a high-end GPU?
Not necessarily. The hybrid layout reduces VRAM pressure by keeping the long tail on the CPU, making consumer GPUs practical.
How is this different from pure-GPU engines?
OptiML co-designs placement and scheduling around activation locality, trading a modest amount of CPU work for the ability to serve larger models efficiently on a PC.
Does OptiML support Mistral, original Llama, GPT�
OptiML is designed to be easily integrated into any model that uses the transformer architecture, so these models can be supported. However, this repository only provides the solution for Llama 2 and Llama 3 at the moment. More models will follow in the future.
What ifβ¦
Issues are welcome! Feel free to open an issue and attach your running environment and parameters. We will try our best to help you.