How Tally Works

The Problem

One model doesn't fit all tasks

Most teams pick a single capable model — usually Claude Sonnet or GPT-4o — and use it for everything. It's the safe choice. But it's also expensive.

The reality is that the vast majority of AI calls are not complicated. A quick code comment, a data format conversion, a simple Q&A — these tasks are handled equally well by a model that costs 71% less.

The hard part is knowing which tasks need the expensive model and which don't. Heuristics don't scale. You can't write rules for every scenario. Tally learns it for you.

Cost per 1M tokens

claude-haiku

$0.80

gpt-5.4-mini

$1.50

gemini-flash

$0.75

claude-sonnet

$3.00

gpt-5

$5.00

claude-opus

$15.00

⚠️ Using the wrong model for a simple task costs up to 18× more than necessary.

Step 1 — Before the call

The Semantic Envelope

Before each LLM call, you build a semantic envelope — a lightweight description of what kind of task this is. Think of it as structured metadata about the shape of the work, not the content.

The envelope captures: task type (code-debug, architecture-design, data-analysis…), expected structure (prose, JSON, code, list), context length bucket, whether tools are involved, and how time-sensitive the response is.

This is what Tally reasons over. Not the raw prompt — just its semantic shape. That's what matters for routing, and it's much lighter to transmit.

semantic envelope

{

"task_type": "code-debug",

"structure_type": "code",

"context_length": "long",

"estimated_tokens": "2k-8k",

"tools_used": 2,

"time_sensitive": false,

"protocol_version": "2"

}

// Built by buildEnvelope() — never manually constructed

// Content stays in your app. Only shape goes to Tally.

Step 2 — The Route Decision

Multi-Armed Bandit Routing

Tally uses a contextual multi-armed bandit to pick the best model for each envelope. Each "arm" is a model. Each envelope shape is a context. The bandit balances two goals:

Exploitation — when Tally is confident about which model wins for this task shape, it routes there directly to save money.

Exploration — when confidence is low, or after enough time has passed that model quality might have changed, Tally tries other models to keep its knowledge current.

The result: routing quality improves continuously, even as new models are added to your pool or existing ones are updated by providers.

Exploit confidence ≥ 0.85

High confidence → route directly to the proven winner. Maximum savings, zero guessing.

Explore confidence < 0.85

Low confidence or new model pool → try other models and observe outcomes. Continuous learning.

Calibrate warm-up mode

New installation or model pool change → balanced exploration across all options to build an initial model.

Step 3 — After the call

Telemetry & Learning

After every LLM call, you fire telemetry() with the outcome. This is the feedback loop that makes the bandit smarter.

Tally records: which model was used, whether it was an exploration or exploitation, the token count (for cost estimation), a success/fail signal, and optionally a quality score and quality slugs for nuanced signal.

Telemetry is fire-and-forget — it doesn't block your response to the user. If a listener is unreachable, the SDK queues and retries automatically.

Over time, you get a complete picture of your AI usage: which models handle which tasks, where costs are concentrated, and how quality has trended.

telemetry event

tally.telemetry({

semantic_envelope: envelope,

// What actually happened

model_used: 'claude-haiku-3-5',

recommended_model: 'claude-haiku-3-5',

outcome: 'success',

// Usage for cost tracking

ntok_input: 4200,

ntok_output: 380,

// Optional quality signal

quality_score: 0.92,

session_id: 'sess_abc123'

})

Always On

Full Observability

Every event lands in your Tally dashboard. See model distribution, cost trends, exploration vs. exploitation ratios, and quality scores — all broken down by task type, time, and org.

Set per-org token budgets to track which teams or products are driving costs. Get alerted when a model's quality drops or a cost anomaly occurs.

The Inspector shows live call feeds, cluster analysis, and adoption metrics — so you always know exactly what Tally is doing and why.

📈 Cost Trends

Daily and hourly cost breakdown by model, task type, and org. Compare actual spend vs. baseline (what you'd pay always routing to your most expensive model).

🎲 Exploration Rate

Track how often Tally is still learning vs. exploiting. As the model matures, exploitation rate rises and costs fall.

🔍 Cluster Analysis

See how your real workload maps to task types and which model clusters have emerged as winners for each context.

One model doesn't fit all tasks

The Semantic Envelope

Multi-Armed Bandit Routing

Telemetry & Learning

Full Observability

📈 Cost Trends

🎲 Exploration Rate

🔍 Cluster Analysis

Learn the details

LLM Primer

Streaming vs Non-Streaming

Semantic Shapes

Tally + LLMs

Tally + MCPs

Why not write rules?

Models change constantly

Your workload is unique

Quality signals are noisy

See it running live