Core Routing

Tally + LLMs

The full picture: how Tally routes across language model providers, the bandit algorithm that learns your workload, and the telemetry loop that keeps it current.

The landscape

Many models. One workload. The wrong default.

The major providers now offer a range of models spanning roughly 20x in price and a much smaller range in quality for most real-world tasks. Claude Haiku, GPT-4o mini, and Gemini Flash sit at the affordable end. Claude Opus, GPT-4o, and Gemini Pro sit at the expensive end. In between are the mid-tier workhorses — Sonnet, GPT-4o, Gemini Pro — that most teams default to because they're good enough for everything and not frighteningly expensive.

The default of "one model for everything" isn't irrational — it's simple. Simplicity has value. But it leaves a significant amount of money on the table, because the expensive model is doing work the cheap model could handle just as well.

Anthropic Claude

claude-haiku-3-5$0.80
claude-sonnet-4$3.00
claude-opus-4$15.00

OpenAI GPT

gpt-4o-mini$1.50
gpt-4o$5.00
o1$15.00

Google Gemini

gemini-flash$0.75
gemini-pro$3.50
gemini-ultra$10.00+

Prices per million input tokens, approximate. These change — which is exactly why hard-coded routing rules decay. Tally tracks provider pricing and incorporates it into routing decisions automatically.

Step 1 — Before the call

Describing the shape of a task.

Before calling an LLM, you call tally.route(envelope) with a semantic envelope — a lightweight description of what kind of task this is. The envelope describes the shape of the work without revealing the content. Your prompts never leave your app.

task_type
What kind of work is this? Drives the primary routing decision.
code-debug code-generation data-analysis creative-writing summarisation qa-simple architecture-design
structure_type
What format is the output? Informs streaming recommendation and model selection.
prose code json list mixed
context_length
How large is the context being sent? Filters models that can't fit the payload.
short medium long very-long
tools_used
Number of tools or MCP servers in scope. Higher values favour models with strong tool-use capabilities.
time_sensitive
Is this on the critical path to a human response? true biases toward lower-latency models even at higher cost.
quality_floor
Minimum acceptable quality score (0–1). Tally will not route to a model whose observed quality for this task type falls below this floor.
Step 2 — The routing decision

The multi-armed bandit:
confidence-driven routing.

Tally uses a contextual multi-armed bandit algorithm to pick the best model for each envelope. Each model in your pool is an "arm." The algorithm maintains a confidence score per (task_type, model) pair, updated continuously by incoming telemetry.

When confidence for a model is high, the bandit exploits — it routes directly to the proven winner, minimising cost. When confidence is low — because the task shape is new, a model was recently updated, or exploration time has elapsed — the bandit explores, trying other models to gather signal and keep its knowledge current.

Bandit confidence — task: code-debug
claude-haiku-3-5
0.91
exploit
claude-sonnet-4
0.73
explore
gpt-4o-mini
0.58
explore
gemini-flash
0.22
calibrate
Haiku has earned high confidence for this task type. Gemini Flash is new to the pool — calibration phase.

The calibration phase runs when a model is first added to your pool. Tally distributes traffic across all models proportionally until it has enough signal to form reliable confidence scores. This typically takes 50–200 calls per task type, depending on variance.

Exploration is not waste. Every exploration event is an investment — it keeps the model honest as providers update their LLMs and as your workload evolves. A bandit that never explores will drift out of date. Tally keeps a small, controlled exploration budget running at all times.

Step 3 — The feedback loop

Telemetry: how the bandit
keeps learning.

After every LLM call, you fire tally.telemetry() with the outcome. This is the signal that updates the bandit's confidence scores and drives continuous improvement. It's fire-and-forget — it does not block your response, and if a listener is temporarily unreachable, the SDK queues and retries automatically.

The quality signal in telemetry can be as simple as a success/failure boolean or as nuanced as a 0–1 quality score with slugs (e.g., ["hallucination", "incomplete"] for a poor response). The richer the signal, the faster the bandit learns.

Safety net

Cost optimisation, bounded by your quality floor.

Tally optimises for cost — but never at the expense of quality below your threshold. Every model in your pool has an observed quality score per task type, derived from the telemetry you send back. When you specify a quality_floor in an envelope, Tally filters out any model whose observed quality for that task type falls below it, then picks the cheapest from what remains.

🛡

Quality floor is a hard constraint, not a preference.

If no model in your pool clears the quality floor for a given task type, Tally returns your configured fallback model — typically your highest-quality option. It will never recommend a model it has observed failing your quality bar. You set the floor. Tally respects it absolutely.

This means Tally routes aggressively to cheap models when quality data supports it, and conservatively falls back to reliable options when it doesn't have enough signal — or when signal indicates the cheaper model doesn't meet your bar.

The cheap model wins more
than you'd expect. Tally proves it.

Most teams are surprised to discover that 60–75% of their real workload is handled at equal quality by their cheapest model. Tally surfaces this with hard data, not estimates. Run it on your actual calls and see.

Paid Features

Routing Recommendations.

Telemetry is free, forever. What you pay for is routing recommendations — Tally returning the right model for each call, backed by the bandit and your own quality history.

Always free

Telemetry

Every LLM call recorded in full. No caps, no expiry, no sampling on the data side.

  • Full call log — model, tokens, latency, outcome
  • Cost breakdown per call and rolled up by model, org, and time
  • Quality history — slugs and scores you report back
  • Model distribution over time — see your workload evolving
  • Export at any time — CSV, JSON, or API
  • Routing recommendations on every call — paid only

One cent per recommendation. Founding pricing — subject to revision. Free accounts get 10% sampled at no charge. Full details on the Pricing page.

Start routing smarter.

Your real workload, intelligently routed from day one.

Next section

Pricing

See exactly what we charge — and why it is embarrassingly simple.