Tally + LLMs

The landscape

Many models. One workload. The wrong default.

The major providers now offer a range of models spanning roughly 20x in price and a much smaller range in quality for most real-world tasks. Claude Haiku, GPT-4o mini, and Gemini Flash sit at the affordable end. Claude Opus, GPT-4o, and Gemini Pro sit at the expensive end. In between are the mid-tier workhorses — Sonnet, GPT-4o, Gemini Pro — that most teams default to because they're good enough for everything and not frighteningly expensive.

The default of "one model for everything" isn't irrational — it's simple. Simplicity has value. But it leaves a significant amount of money on the table, because the expensive model is doing work the cheap model could handle just as well.

Anthropic Claude

claude-haiku-3-5$0.80

claude-sonnet-4$3.00

claude-opus-4$15.00

OpenAI GPT

gpt-5.4-mini$1.50

gpt-5$5.00

o1$15.00

Google Gemini

gemini-flash$0.75

gemini-pro$3.50

gemini-ultra$10.00+

Prices per million input tokens, approximate. These change — which is exactly why hard-coded routing rules decay. Tally tracks provider pricing and incorporates it into routing decisions automatically.

Step 1 — Before the call

Describing the shape of a task.

Before calling an LLM, you call tally.route(envelope) with a semantic envelope — a lightweight description of what kind of task this is. The envelope describes the shape of the work without revealing the content. Your prompts never leave your app.

task_type

What kind of work is this? Drives the primary routing decision.

code-debug code-generation data-analysis creative-writing summarisation qa-simple architecture-design

structure_type

What format is the output? Informs streaming recommendation and model selection.

prose code json list mixed

context_length

How large is the context being sent? Filters models that can't fit the payload.

short medium long very-long

tools_used

Number of tools or MCP servers in scope. Higher values favour models with strong tool-use capabilities.

time_sensitive

Is this on the critical path to a human response? true biases toward lower-latency models even at higher cost.

quality_floor

Minimum acceptable quality score (0–1). Tally will not route to a model whose observed quality for this task type falls below this floor.

Step 2 — The routing decision

The multi-armed bandit:
confidence-driven routing.

Tally uses a contextual multi-armed bandit algorithm to pick the best model for each envelope. Each model in your pool is an "arm." The algorithm maintains a confidence score per (task_type, model) pair, updated continuously by incoming telemetry.

When confidence for a model is high, the bandit exploits — it routes directly to the proven winner, minimising cost. When confidence is low — because the task shape is new, a model was recently updated, or exploration time has elapsed — the bandit explores, trying other models to gather signal and keep its knowledge current.

Bandit confidence — task: code-debug

claude-haiku-3-5

0.91

exploit

claude-sonnet-4

0.73

explore

gpt-5.4-mini

0.58

explore

gemini-flash

0.22

calibrate

Haiku has earned high confidence for this task type. Gemini Flash is new to the pool — calibration phase.

The calibration phase runs when a model is first added to your pool. Tally distributes traffic across all models proportionally until it has enough signal to form reliable confidence scores. This typically takes 50–200 calls per task type, depending on variance.

Exploration is not waste. Every exploration event is an investment — it keeps the model honest as providers update their LLMs and as your workload evolves. A bandit that never explores will drift out of date. Tally keeps a small, controlled exploration budget running at all times.

Step 3 — The feedback loop

Telemetry: how the bandit
keeps learning.

After every LLM call, you fire tally.telemetry() with the outcome. This is the signal that updates the bandit's confidence scores and drives continuous improvement. It's fire-and-forget — it does not block your response, and if a listener is temporarily unreachable, the SDK queues and retries automatically.

📋

Build Envelope

Describe task shape — type, length, tools

→

🎯

route()

Tally returns model + streaming recommendation

→

🤖

Call LLM

Your app calls the provider directly

→

📡

telemetry()

Report outcome — success, tokens, quality

→

📈

Bandit updates

Confidence scores adjust. Next call is smarter.

The quality signal in telemetry can be as simple as a success/failure boolean or as nuanced as a 0–1 quality score with slugs (e.g., ["hallucination", "incomplete"] for a poor response). The richer the signal, the faster the bandit learns.

Safety net

Cost optimisation, bounded by your quality floor.

Tally optimises for cost — but never at the expense of quality below your threshold. Every model in your pool has an observed quality score per task type, derived from the telemetry you send back. When you specify a quality_floor in an envelope, Tally filters out any model whose observed quality for that task type falls below it, then picks the cheapest from what remains.

🛡

Quality floor is a hard constraint, not a preference.

If no model in your pool clears the quality floor for a given task type, Tally returns your configured fallback model — typically your highest-quality option. It will never recommend a model it has observed failing your quality bar. You set the floor. Tally respects it absolutely.

This means Tally routes aggressively to cheap models when quality data supports it, and conservatively falls back to reliable options when it doesn't have enough signal — or when signal indicates the cheaper model doesn't meet your bar.

Privacy bridge

Drill back to your own calls
without sharing them with us.

Tally only stores the shape of every call — tokens, latency, cost, cluster — never the prompt or the response. That's the point. It's how Tally can sit beside an LLM provider without seeing its traffic.

The downside used to be that when a $5 call showed up in your Recent Calls, you couldn't tell what was actually in it without opening another window and hunting. Drilldown closes that gap without breaking the privacy property.

You give Tally two things. First, a URL template — something like https://your-site.example/calls/{call_id} — registered on your provider in the portal's Consumer → Drilldown tab. Second, an external_call_id on each telemetry event, whatever id you use to look up that call internally (a message row, a turn id, whatever). Both pieces are opaque to Tally.

When a user clicks the ↗ icon on a row in the portal, their browser opens the URL with {call_id} substituted. Your origin authenticates the request via your own session cookies, looks up the call, and renders whatever detail page you want. Tally never sees the URL fetched, never sees the response, never sees anything you render. The lookup, the auth, and the rendering all happen on your side. We're just handing back the id you gave us.

It's the same trust model as Stripe's "view in your dashboard" links or OAuth resource servers: the third party shows the resource id, the user goes to their own app for the data. Two endpoints' worth of code on your side, and now a $5 call in Tally is one click away from the actual conversation that produced it.

Paid Features

Routing Recommendations.

Telemetry is free, forever. What you pay for is routing recommendations — Tally returning the right model for each call, backed by the bandit and your own quality history.

Always free

Telemetry

Every LLM call recorded in full. No caps, no expiry, no sampling on the data side.

Full call log — model, tokens, latency, outcome
Cost breakdown per call and rolled up by model, org, and time
Quality history — slugs and scores you report back
Model distribution over time — see your workload evolving
Export at any time — CSV, JSON, or API
Routing recommendations on every call — paid only

Paid — $0.01 per recommendation

Routing Recommendations

Every call to /route returns a model recommendation: which model, a confidence score, and a streaming mode. Free accounts get a recommendation on 10% of calls at no charge — enough to see it working. Paid accounts get one on every call.

Per-call model recommendation on every request
Confidence score + full decision trace
Streaming vs non-streaming recommendation
Quality floor enforcement — never route below your bar
Bandit learns from your telemetry, improves continuously
Network signal from aggregated Tally traffic across providers

One cent per recommendation. Founding pricing — subject to revision. Free accounts get 10% sampled at no charge. Full details on the Pricing page.

Many models. One workload. The wrong default.

Anthropic Claude

OpenAI GPT

Google Gemini

Describing the shape of a task.

The multi-armed bandit:
confidence-driven routing.

Telemetry: how the bandit
keeps learning.

Cost optimisation, bounded by your quality floor.

Quality floor is a hard constraint, not a preference.

The cheap model wins more
than you'd expect. Tally proves it.

Drill back to your own calls
without sharing them with us.

Routing Recommendations.

Telemetry

Routing Recommendations

Start routing smarter.

Tally + LLMs

Many models. One workload. The wrong default.

Anthropic Claude

OpenAI GPT

Google Gemini

Describing the shape of a task.

The multi-armed bandit:confidence-driven routing.

Telemetry: how the banditkeeps learning.

Cost optimisation, bounded by your quality floor.

Quality floor is a hard constraint, not a preference.

The cheap model wins morethan you'd expect. Tally proves it.

Drill back to your own callswithout sharing them with us.

Routing Recommendations.

Telemetry

Routing Recommendations

Start routing smarter.

The multi-armed bandit:
confidence-driven routing.

Telemetry: how the bandit
keeps learning.

The cheap model wins more
than you'd expect. Tally proves it.

Drill back to your own calls
without sharing them with us.