sourc.dev
Home LLMs Tools SaaS APIs
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
Claude 3.5 Sonnet input $3.00/1M ↓ -50%
GPT-4o input $2.50/1M
Gemini 1.5 Pro input $1.25/1M
Mistral Large input $2.00/1M ↓ -33%
DeepSeek V3 input $0.27/1M
synced 2026-04-05
#45 of 50

Quantisation

Run a 70B model on a laptop

What is quantisation

Quantisation reduces the numerical precision of a model's weights — from 32-bit floating point to 8-bit, 4-bit, or even 2-bit integers. The model gets smaller, runs faster, and uses less memory. The trade-off is a small reduction in output quality.

Here is the scale. Llama 3.1 70B at full precision: 140 GB. At 4-bit quantisation: 35 GB. At 2-bit: 18 GB. The 4-bit version runs on a single GPU that costs $1/hour. The full-precision version needs four GPUs at $4/hour. Output quality difference on MMLU: approximately 1–2 percentage points.

Why it matters

Quantisation is how open-weights models become practical for self-hosting. Without it, running a 70B model requires enterprise hardware. With it, the same model runs on consumer GPUs or cloud instances at a fraction of the cost. Ollama and LM Studio both use quantised models by default. sourc.dev tracks open weights support — quantisation is why that attribute matters for cost.

Verified March 2026 · Source: GGUF format docs, Ollama library

Related terms
Open weightsWhat does 70B meanLLM
← All terms
← Model family Batch API →