💡 FAQ

💡 FAQ#

Which models work best?

Decoder-only transformer families in GGUF with available kernels generally perform well.

Do I need a high-end GPU?

Not necessarily. The hybrid layout reduces VRAM pressure by keeping the long tail on the CPU, making consumer GPUs practical.

How is this different from pure-GPU engines?

OptiML co-designs placement and scheduling around activation locality, trading a modest amount of CPU work for the ability to serve larger models efficiently on a PC.

Does OptiML support Mistral, original Llama, GPT…?

OptiML is designed to be easily integrated into any model that uses the transformer architecture, so these models can be supported. However, this repository only provides the solution for Llama 2 and Llama 3 at the moment. More models will follow in the future.

What if…

Issues are welcome! Feel free to open an issue and attach your running environment and parameters. We will try our best to help you.