Cost model
Cost model
← index
What this is
Colibri tracks every token that passes through an agent session and meters cost
against a configurable budget. The key insight: cache-hit tokens cost 10×
less than fresh tokens on DeepSeek — so the prompt prefix is engineered to be
byte-stable across requests, maximizing cache hits. Three cost modes (fast,
smart, max) represent different points on the speed/cost trade-off, and the
model auto-escalates when a cheaper mode can't keep up.
Decisions
Byte-stable prompt prefix → cache-hit metering
The system prompt and early context blocks are byte-for-byte identical
across consecutive requests to the same DeepSeek endpoint. DeepSeek's cache-hit
pricing discounts these by ~90%. Colibri's colibri-deepseek probe determines
the exact token-count split between cached and fresh tokens per request, and
the cost tracker records both so the session budget reflects the actual
discounted cost, not the nominal token count.
Why not just count tokens: token counting with an offline tokenizer givesyou an upper bound but not the real cost. DeepSeek's API sometimes re-caches and
sometimes doesn't — the probe measures what actually happened. The discount is
too large (10×) to leave unmeasured.
COLIBRI-TOKENOMICS-TRIFECTA.md,
crates/colibri-deepseek/src/lib.rs
Three cost modes (fast → smart → max)
| Mode | Budget (tokens) | Behavior |
|---|---|---|
| Fast | 16K | Maximum cache-hits, minimum fresh tokens. Rejects large expansions early. |
| Smart | 64K | Default. Balances cache reuse with room for follow-up turns. |
| Max | 256K | Almost never hits budget. For one-shot deep tasks where cost is secondary. |
The daemon auto-escalates when a session exhausts its budget in a lower
mode: fast → smart → max. Escalation is one-way (never downgrades mid-session).
Why three modes, not a continuous slider: simplicity wins here. Threewell-understood points cover the space — operators pick by risk appetite, not
by fine-tuning a number. The escalation chain means "start cheap, pay more only
if it works."
→ COLIBRI-TOKENOMICS-TRIFECTA.md,
crates/colibri-daemon/src/cost.rs
T14 compaction (budget trim, not truncate)
When a session is about to exceed its budget, Colibri compacts the tool results
in the volatile region — it sends them through the headroom sidecar for
summarization, then trims the oldest volatile blocks until the prompt fits
within budget. The prefix (system prompt, static context) is never trimmed
— only the volatile suffix.
If compaction is insufficient and auto-escalation is enabled, the mode steps up
before truncating.
Why not just truncate: truncating mid-conversation loses context the agentneeds to continue. Compaction preserves the semantic content at lower token cost.
The headroom sidecar is optional (off by default); without it, the fallback is
simple truncation.
crates/colibri-daemon/src/session.rs
Cache-hit probe (DeepSeek-specific)
The colibri-deepseek crate sends a preflight request with a known prompt to
the DeepSeek API and parses the response headers to determine the cache-hit
split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is
provider-specific — DeepSeek is the only provider that exposes this granularity.
The probe runs once per session configuration change, not per request.
Why a probe and not a hook: middleware that intercepts every API responsewould couple cost tracking to the HTTP layer. A probe decouples it — the cost
tracker asks "what was the cache ratio?" and the probe answers, independently of
how the request was made.
→ crates/colibri-deepseek/src/lib.rs
See also
- task-board — the scheduler that dispatches tasks within session budgets
- mother-hive — MCP architecture (different cost domain)
- quality-gates — the gate that validates cost-mode parsing