πŸ’‘ FAQ

πŸ’‘ FAQ#

Which models work best?

  • Decoder-only transformer families in GGUF with available kernels generally perform well.

Do I need a high-end GPU?

  • Not necessarily. The hybrid layout reduces VRAM pressure by keeping the long tail on the CPU, making consumer GPUs practical.

How is this different from pure-GPU engines?

  • OptiML co-designs placement and scheduling around activation locality, trading a modest amount of CPU work for the ability to serve larger models efficiently on a PC.

Does OptiML support Mistral, original Llama, GPT…?

  • OptiML is designed to be easily integrated into any model that uses the transformer architecture, so these models can be supported. However, this repository only provides the solution for Llama 2 and Llama 3 at the moment. More models will follow in the future.

What if…

  • Issues are welcome! Feel free to open an issue and attach your running environment and parameters. We will try our best to help you.