Technical Walkthrough

How Tally Works

The full picture: semantic envelopes, bandit learning, telemetry, and why this approach beats heuristics.

The Problem

One model doesn't fit all tasks

Most teams pick a single capable model — usually Claude Sonnet or GPT-4o — and use it for everything. It's the safe choice. But it's also expensive.

The reality is that the vast majority of AI calls are not complicated. A quick code comment, a data format conversion, a simple Q&A — these tasks are handled equally well by a model that costs 71% less.

The hard part is knowing which tasks need the expensive model and which don't. Heuristics don't scale. You can't write rules for every scenario. Tally learns it for you.

Cost per 1M tokens
claude-haiku
$0.80
gpt-4o-mini
$1.50
gemini-flash
$0.75
claude-sonnet
$3.00
gpt-4o
$5.00
claude-opus
$15.00
⚠️ Using the wrong model for a simple task costs up to 18× more than necessary.
Step 1 — Before the call

The Semantic Envelope

Before each LLM call, you build a semantic envelope — a lightweight description of what kind of task this is. Think of it as structured metadata about the shape of the work, not the content.

The envelope captures: task type (code-debug, architecture-design, data-analysis…), expected structure (prose, JSON, code, list), context length bucket, whether tools are involved, and how time-sensitive the response is.

This is what Tally reasons over. Not the raw prompt — just its semantic shape. That's what matters for routing, and it's much lighter to transmit.

semantic envelope
{
"task_type": "code-debug",
"structure_type": "code",
"context_length": "long",
"estimated_tokens": "2k-8k",
"tools_used": 2,
"time_sensitive": false,
"protocol_version": "2"
}
 
// Built by buildEnvelope() — never manually constructed
// Content stays in your app. Only shape goes to Tally.
Step 2 — The Route Decision

Multi-Armed Bandit Routing

Tally uses a contextual multi-armed bandit to pick the best model for each envelope. Each "arm" is a model. Each envelope shape is a context. The bandit balances two goals:

Exploitation — when Tally is confident about which model wins for this task shape, it routes there directly to save money.

Exploration — when confidence is low, or after enough time has passed that model quality might have changed, Tally tries other models to keep its knowledge current.

The result: routing quality improves continuously, even as new models are added to your pool or existing ones are updated by providers.

Exploit confidence ≥ 0.85

High confidence → route directly to the proven winner. Maximum savings, zero guessing.

Explore confidence < 0.85

Low confidence or new model pool → try other models and observe outcomes. Continuous learning.

Calibrate warm-up mode

New installation or model pool change → balanced exploration across all options to build an initial model.

Step 3 — After the call

Telemetry & Learning

After every LLM call, you fire telemetry() with the outcome. This is the feedback loop that makes the bandit smarter.

Tally records: which model was used, whether it was an exploration or exploitation, the token count (for cost estimation), a success/fail signal, and optionally a quality score and quality slugs for nuanced signal.

Telemetry is fire-and-forget — it doesn't block your response to the user. If a listener is unreachable, the SDK queues and retries automatically.

Over time, you get a complete picture of your AI usage: which models handle which tasks, where costs are concentrated, and how quality has trended.

telemetry event
tally.telemetry({
semantic_envelope: envelope,
 
// What actually happened
model_used: 'claude-haiku-3-5',
recommended_model: 'claude-haiku-3-5',
outcome: 'success',
 
// Usage for cost tracking
ntok_input: 4200,
ntok_output: 380,
 
// Optional quality signal
quality_score: 0.92,
session_id: 'sess_abc123'
})
Always On

Full Observability

Every event lands in your Tally dashboard. See model distribution, cost trends, exploration vs. exploitation ratios, and quality scores — all broken down by task type, time, and org.

Set per-org token budgets to track which teams or products are driving costs. Get alerted when a model's quality drops or a cost anomaly occurs.

The Inspector shows live call feeds, cluster analysis, and adoption metrics — so you always know exactly what Tally is doing and why.

📈 Cost Trends

Daily and hourly cost breakdown by model, task type, and org. Compare actual spend vs. baseline (what you'd pay always routing to your most expensive model).

🎲 Exploration Rate

Track how often Tally is still learning vs. exploiting. As the model matures, exploitation rate rises and costs fall.

🔍 Cluster Analysis

See how your real workload maps to task types and which model clusters have emerged as winners for each context.

Go deeper

Learn the details

Each topic below has its own full page. Start with the LLM Primer if you're new to API-mode language models.

Philosophy

Why not write rules?

The obvious alternative to Tally is writing your own routing logic. Here's why that doesn't scale.

Models change constantly

Providers update models, add capabilities, and adjust pricing. A rule that made sense six months ago may now be wrong. Tally adapts; static rules don't.

Your workload is unique

Generic guidance ("use Haiku for simple tasks") doesn't know your users, your prompts, or your quality bar. Tally learns your specific patterns.

Quality signals are noisy

A model might score well on benchmarks but underperform on your exact use case. Tally measures quality on your actual outputs, not synthetic evals.

See it running live

The harness generates synthetic workloads and streams routing decisions in real time.

Next up

LLM Primer