3BChat
Llama 3.2 3B Instruct
Meta's small, fast Llama 3.2 in instruct mode. Best for low-latency chat, RAG-style completions, and lightweight tool use. 128K context.
$2.00/hr
Inference Engine
Click Create Endpoint on any model to get a persistent, OpenAI-API-compatible URL serving that model. Works with cURL, the OpenAI Python SDK, and any OpenAI client unchanged. Billed per hour while the endpoint is running.
Meta's small, fast Llama 3.2 in instruct mode. Best for low-latency chat, RAG-style completions, and lightweight tool use. 128K context.
BAAI's multilingual embedding model. Returns 1024-dim vectors usable as drop-in replacements for OpenAI embeddings. Common pick for retrieval (RAG) over African-language corpora.
Alibaba's mid-size Qwen 2.5 in instruct mode. Strong reasoning + multilingual performance, including African languages via the broader Qwen pretraining mix. 32K context.