The memory formula that matters
Most people estimate only the model weights. Real local inference also needs context memory, runtime overhead, and sometimes memory for more than one loaded model. That is why a model can download successfully but still fail or slow down at a long context length.
- Start with parameter count multiplied by quantization bytes per parameter.
- Add KV cache for the context window and expected batch or concurrency level.
- Reserve a safety buffer for the runtime, graphics driver, OS, and other apps.