πŸš€ Quickstart#

1.  Build from source#

git clone https://github.com/NU-QRG/optiml.git
cd optiml
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
  -DOPTIML_CUBLAS=ON \        # CUDA (NVIDIA)
  -DOPTIML_METAL=OFF \        # Apple Silicon (toggle as needed)
  -DOPTIML_OPENCL=OFF         # Other GPU backends (toggle as needed)
cmake --build . -j

Tip: Toggle the back-ends that match your machine (e.g., set OPTIML_METAL=ON on Apple Silicon).

2.  (Optional) Python bindings#

cd bindings/python
pip install -e .

3.  Prepare a model#

OptiML works well with standard GGUF models. If you have original weights, first convert to GGUF, then optionally quantize:

# Example: quantize a GGUF model to Q4_K
./build/optiml-quantize --input <model path> --output model-q4_k.gguf --type q4_k

4.  Run text generation (CLI)#

./build/optiml-cli --model model-q4_k.gguf --prompt "Explain activation locality in one paragraph." --n-predict 128

5.  Start the HTTP demo server#

./examples/server/optiml-server --model model-q4_k.gguf --host 127.0.0.1 --port 8080

Open the provided minimal web UI and chat locally. The server exposes a simple REST API you can call from any client.