The full picture: semantic envelopes, bandit learning, telemetry, and why this approach beats heuristics.
Most teams pick a single capable model — usually Claude Sonnet or GPT-4o — and use it for everything. It's the safe choice. But it's also expensive.
The reality is that the vast majority of AI calls are not complicated. A quick code comment, a data format conversion, a simple Q&A — these tasks are handled equally well by a model that costs 71% less.
The hard part is knowing which tasks need the expensive model and which don't. Heuristics don't scale. You can't write rules for every scenario. Tally learns it for you.
Before each LLM call, you build a semantic envelope — a lightweight description of what kind of task this is. Think of it as structured metadata about the shape of the work, not the content.
The envelope captures: task type (code-debug, architecture-design, data-analysis…), expected structure (prose, JSON, code, list), context length bucket, whether tools are involved, and how time-sensitive the response is.
This is what Tally reasons over. Not the raw prompt — just its semantic shape. That's what matters for routing, and it's much lighter to transmit.
Tally uses a contextual multi-armed bandit to pick the best model for each envelope. Each "arm" is a model. Each envelope shape is a context. The bandit balances two goals:
Exploitation — when Tally is confident about which model wins for this task shape, it routes there directly to save money.
Exploration — when confidence is low, or after enough time has passed that model quality might have changed, Tally tries other models to keep its knowledge current.
The result: routing quality improves continuously, even as new models are added to your pool or existing ones are updated by providers.
High confidence → route directly to the proven winner. Maximum savings, zero guessing.
Low confidence or new model pool → try other models and observe outcomes. Continuous learning.
New installation or model pool change → balanced exploration across all options to build an initial model.
After every LLM call, you fire telemetry() with the outcome.
This is the feedback loop that makes the bandit smarter.
Tally records: which model was used, whether it was an exploration or exploitation, the token count (for cost estimation), a success/fail signal, and optionally a quality score and quality slugs for nuanced signal.
Telemetry is fire-and-forget — it doesn't block your response to the user. If a listener is unreachable, the SDK queues and retries automatically.
Over time, you get a complete picture of your AI usage: which models handle which tasks, where costs are concentrated, and how quality has trended.
Every event lands in your Tally dashboard. See model distribution, cost trends, exploration vs. exploitation ratios, and quality scores — all broken down by task type, time, and org.
Set per-org token budgets to track which teams or products are driving costs. Get alerted when a model's quality drops or a cost anomaly occurs.
The Inspector shows live call feeds, cluster analysis, and adoption metrics — so you always know exactly what Tally is doing and why.
Daily and hourly cost breakdown by model, task type, and org. Compare actual spend vs. baseline (what you'd pay always routing to your most expensive model).
Track how often Tally is still learning vs. exploiting. As the model matures, exploitation rate rises and costs fall.
See how your real workload maps to task types and which model clusters have emerged as winners for each context.
Each topic below has its own full page. Start with the LLM Primer if you're new to API-mode language models.
How LLMs work in API mode — statelessness, tokens, context windows, and the property that makes per-turn routing possible.
When to stream LLM responses, when not to, and how Tally reads task shape to recommend the right delivery mode.
What a shape is, how it's computed entirely on your side from structural metadata, and why your prompts and content never leave your application.
Routing across providers — Claude, GPT, Gemini and beyond. The bandit algorithm, quality floors, and the telemetry feedback loop.
Tool calls have a unique cost profile. How Tally routes MCP workloads, the invocation vs synthesis distinction, and why Tally is always a witness.
The obvious alternative to Tally is writing your own routing logic. Here's why that doesn't scale.
Providers update models, add capabilities, and adjust pricing. A rule that made sense six months ago may now be wrong. Tally adapts; static rules don't.
Generic guidance ("use Haiku for simple tasks") doesn't know your users, your prompts, or your quality bar. Tally learns your specific patterns.
A model might score well on benchmarks but underperform on your exact use case. Tally measures quality on your actual outputs, not synthetic evals.
The harness generates synthetic workloads and streams routing decisions in real time.
Next up
LLM Primer →