The full picture: how Tally routes across language model providers, the bandit algorithm that learns your workload, and the telemetry loop that keeps it current.
The major providers now offer a range of models spanning roughly 20x in price and a much smaller range in quality for most real-world tasks. Claude Haiku, GPT-4o mini, and Gemini Flash sit at the affordable end. Claude Opus, GPT-4o, and Gemini Pro sit at the expensive end. In between are the mid-tier workhorses — Sonnet, GPT-4o, Gemini Pro — that most teams default to because they're good enough for everything and not frighteningly expensive.
The default of "one model for everything" isn't irrational — it's simple. Simplicity has value. But it leaves a significant amount of money on the table, because the expensive model is doing work the cheap model could handle just as well.
Prices per million input tokens, approximate. These change — which is exactly why hard-coded routing rules decay. Tally tracks provider pricing and incorporates it into routing decisions automatically.
Before calling an LLM, you call tally.route(envelope) with a semantic envelope —
a lightweight description of what kind of task this is. The envelope describes
the shape of the work without revealing the content. Your prompts never leave your app.
true biases toward lower-latency models even at higher cost.
Tally uses a contextual multi-armed bandit algorithm to pick the best model for each envelope. Each model in your pool is an "arm." The algorithm maintains a confidence score per (task_type, model) pair, updated continuously by incoming telemetry.
When confidence for a model is high, the bandit exploits — it routes directly to the proven winner, minimising cost. When confidence is low — because the task shape is new, a model was recently updated, or exploration time has elapsed — the bandit explores, trying other models to gather signal and keep its knowledge current.
The calibration phase runs when a model is first added to your pool. Tally distributes traffic across all models proportionally until it has enough signal to form reliable confidence scores. This typically takes 50–200 calls per task type, depending on variance.
Exploration is not waste. Every exploration event is an investment — it keeps the model honest as providers update their LLMs and as your workload evolves. A bandit that never explores will drift out of date. Tally keeps a small, controlled exploration budget running at all times.
After every LLM call, you fire tally.telemetry() with the outcome.
This is the signal that updates the bandit's confidence scores and drives continuous improvement.
It's fire-and-forget — it does not block your response, and if a listener is temporarily
unreachable, the SDK queues and retries automatically.
The quality signal in telemetry can be as simple as a success/failure boolean or as
nuanced as a 0–1 quality score with slugs (e.g., ["hallucination", "incomplete"]
for a poor response). The richer the signal, the faster the bandit learns.
Tally optimises for cost — but never at the expense of quality below your threshold.
Every model in your pool has an observed quality score per task type, derived from
the telemetry you send back. When you specify a quality_floor in an envelope,
Tally filters out any model whose observed quality for that task type falls below it,
then picks the cheapest from what remains.
If no model in your pool clears the quality floor for a given task type, Tally returns your configured fallback model — typically your highest-quality option. It will never recommend a model it has observed failing your quality bar. You set the floor. Tally respects it absolutely.
This means Tally routes aggressively to cheap models when quality data supports it, and conservatively falls back to reliable options when it doesn't have enough signal — or when signal indicates the cheaper model doesn't meet your bar.
Most teams are surprised to discover that 60–75% of their real workload is handled at equal quality by their cheapest model. Tally surfaces this with hard data, not estimates. Run it on your actual calls and see.
Telemetry is free, forever. What you pay for is routing recommendations — Tally returning the right model for each call, backed by the bandit and your own quality history.
Every LLM call recorded in full. No caps, no expiry, no sampling on the data side.
Every call to /route returns a model recommendation:
which model, a confidence score, and a streaming mode. Free accounts
get a recommendation on 10% of calls at no charge —
enough to see it working. Paid accounts get one on every call.
One cent per recommendation. Founding pricing — subject to revision. Free accounts get 10% sampled at no charge. Full details on the Pricing page.
Your real workload, intelligently routed from day one.