Intelligent AI Routing

Stop overpaying
for AI models

Tally sits between your application and your LLM providers. It learns which model is best for each type of task — then routes automatically, cutting costs without sacrificing quality.

Get Started Free See how it works

tally route · live

→ ROUTE code-debug · context:long

model claude-haiku-3-5

savings ↓ 71% vs sonnet

reason high exploit confidence

→ ROUTE architecture-design · tools:6

model claude-sonnet-4-5

reason high complexity — exploring

→ ROUTE data-analysis · structure:json

model claude-haiku-3-5

savings ↓ 71%

─────────────────────────────

calls 2,847 today

saved $4.23 vs always-sonnet

quality 97.4% success rate

Why Tally

The smart layer between
your app and the LLMs

Most teams pick one model and use it for everything. Tally learns the shape of each task and routes to the most cost-effective model that will still get the job done.

💰

Cut costs without guesswork

Tally's multi-armed bandit learns which model handles each task type well — then exploits that knowledge to save you money on every call.

🎯

Quality never compromised

Routing decisions are driven by real success signals. If a cheaper model starts underperforming, Tally detects it and adjusts automatically.

📊

Full observability

Every call is tagged with semantic metadata — task type, complexity, tools, structure. See exactly where your AI budget is going.

🧠

Gets smarter over time

Each telemetry event feeds the bandit. The longer Tally runs on your workload, the more precisely it can exploit model strengths.

🔌

Drop-in SDK

Two API calls — route() before and telemetry() after. No infrastructure changes, no proxy servers, no rewrites.

🏢

Team-aware billing

Organize by org, set per-team token budgets, and track which products or users are driving costs. Multi-org support built in.

How It Works

Three steps to smarter routing

Tally wraps your existing LLM calls. No infrastructure changes required.

Describe the task

Before each LLM call, build a semantic envelope describing the task — its type, complexity, structure, tools needed, and context length. Takes one line of code.

Ask Tally which model

Call route() with the envelope and your available models. Tally's bandit returns the recommended model — either exploiting what it knows or exploring to keep learning.

Report the outcome

After the LLM responds, fire telemetry() with the result — tokens used, success/fail, quality score. Tally updates its model and the routing gets smarter.

example.ts

import { TallyClient, buildEnvelope } from '@tally/sdk'

const tally = new TallyClient({

apiKey: process.env.TALLY_API_KEY

})

// 1. Describe the task

const envelope = buildEnvelope({

taskType: 'code-debug',

contextLength: 'long'

})

// 2. Get route recommendation

const { recommended_model } =

await tally.route(envelope, models)

// 3. Call LLM, then report

const result = await callLLM(recommended_model)

tally.telemetry({ model_used: recommended_model,

outcome: 'success', ntok: result.tokens })

Full technical walkthrough →

Live Demo

Watch the bandit learn

The harness generates realistic workloads — code debugging, architecture design, data analysis, content writing — and streams live routing decisions as Tally learns which model handles each scenario best.

Watch exploration vs. exploitation play out in real time. See cost savings accumulate with every correctly routed call.

Run the harness →

harness · live diagnostics

Events: 412 · Rate: 5/s · Mode: random

[412] haiku code-debug $0.0003 exploit

[411] sonnet arch-design $0.0021 explore

[410] haiku data-analysis $0.0002 exploit

[409] haiku code-review $0.0004 exploit

[408] gpt4o creative-writing $0.0018 explore

Model distribution (last 100):

haiku ████████████████████ 67%

sonnet ████████ 24%

gpt-4o ███ 9%

Total cost: $0.2847 vs $0.9103 baseline

Savings: $0.6256 (68.7%)

Stop overpayingfor AI models

The smart layer betweenyour app and the LLMs