Cost model

← index

What this is

Colibri tracks every token that passes through an agent session and meters cost

against a configurable budget. The key insight: cache-hit tokens cost 10×

less than fresh tokens on DeepSeek — so the prompt prefix is engineered to be

byte-stable across requests, maximizing cache hits. Three cost modes (fast,

smart, max) represent different points on the speed/cost trade-off, and the

model auto-escalates when a cheaper mode can't keep up.

Decisions

Byte-stable prompt prefix → cache-hit metering

The system prompt and early context blocks are byte-for-byte identical

across consecutive requests to the same DeepSeek endpoint. DeepSeek's cache-hit

pricing discounts these by ~90%. Colibri's colibri-deepseek probe determines

the exact token-count split between cached and fresh tokens per request, and

the cost tracker records both so the session budget reflects the actual

discounted cost, not the nominal token count.

Why not just count tokens: token counting with an offline tokenizer gives

you an upper bound but not the real cost. DeepSeek's API sometimes re-caches and

sometimes doesn't — the probe measures what actually happened. The discount is

too large (10×) to leave unmeasured.

→ headroom-sidecar,

COLIBRI-TOKENOMICS-TRIFECTA.md, crates/colibri-deepseek/src/lib.rs

Three cost modes (fast → smart → max)

Mode	Budget (tokens)	Behavior
Fast	16K	Maximum cache-hits, minimum fresh tokens. Rejects large expansions early.
Smart	64K	Default. Balances cache reuse with room for follow-up turns.
Max	256K	Almost never hits budget. For one-shot deep tasks where cost is secondary.

The daemon auto-escalates when a session exhausts its budget in a lower

mode: fast → smart → max. Escalation is one-way (never downgrades mid-session).

Why three modes, not a continuous slider: simplicity wins here. Three

well-understood points cover the space — operators pick by risk appetite, not

by fine-tuning a number. The escalation chain means "start cheap, pay more only

if it works."

→ COLIBRI-TOKENOMICS-TRIFECTA.md,

crates/colibri-daemon/src/cost.rs

T14 compaction (budget trim, not truncate)

When a session is about to exceed its budget, Colibri compacts the tool results

in the volatile region — it sends them through the headroom sidecar for

summarization, then trims the oldest volatile blocks until the prompt fits

within budget. The prefix (system prompt, static context) is never trimmed

— only the volatile suffix.

If compaction is insufficient and auto-escalation is enabled, the mode steps up

before truncating.

Why not just truncate: truncating mid-conversation loses context the agent

needs to continue. Compaction preserves the semantic content at lower token cost.

The headroom sidecar is optional (off by default); without it, the fallback is

simple truncation.

→ headroom-sidecar,

crates/colibri-daemon/src/session.rs

Cache-hit probe (DeepSeek-specific)

The colibri-deepseek crate sends a preflight request with a known prompt to

the DeepSeek API and parses the response headers to determine the cache-hit

split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is

provider-specific — DeepSeek is the only provider that exposes this granularity.

The probe runs once per session configuration change, not per request.

Why a probe and not a hook: middleware that intercepts every API response

would couple cost tracking to the HTTP layer. A probe decouples it — the cost

tracker asks "what was the cache ratio?" and the probe answers, independently of

how the request was made.

→ crates/colibri-deepseek/src/lib.rs

Cost model

Cost model

What this is

Decisions

Byte-stable prompt prefix → cache-hit metering

Three cost modes (fast → smart → max)

T14 compaction (budget trim, not truncate)

Cache-hit probe (DeepSeek-specific)

See also