Tally never reads your prompts. It reads the shape of your calls — a structural fingerprint computed entirely on your side, from metadata you control.
Every AI call has two layers. The content layer — the actual prompt text, user messages, system instructions, tool results — is yours. It's sensitive. It stays in your application and never touches Tally.
The structural layer is different. What kind of task is this? How much context is involved? Are tools in scope? Is the response time-sensitive? These properties describe the shape of the work without revealing anything about what the work actually is. And shape is what determines which model will handle it best.
A semantic shape is that structural fingerprint. It's built from a handful of fields that your SDK derives from the call parameters you already know — task type, context length, output format, tools in scope. Together, those fields form a point in "task space." Calls that land at the same point behave similarly across models. That's what Tally routes on.
The key insight: the properties that determine which model you need are structural, not semantic. You don't need to send a word of your actual content to route well.
Before making an LLM call, you build a semantic envelope using
the SDK's buildEnvelope() helper. This takes parameters you already
know — the task type, context size, expected output — and wraps them into a
structured object. No prompt text, no message content, no user data.
code-debug, architecture-design,
data-analysis, summarisation, qa-simple,
creative-writing. This is the strongest routing signal —
different task types have dramatically different model requirements.
prose, code,
json, list, mixed.
Structured outputs (JSON, typed code) require models with stronger schema adherence.
Free-form prose is more tolerant of lighter models.
short (<2k tokens),
medium (2k–8k), long (8k–32k), extended (32k+).
Long context calls have different capability requirements and cost profiles.
true means the user is
waiting for a fast response — optimise for latency. false allows
slower, more cost-effective models for background or batch tasks.
tiny (<500), small (500–2k),
medium (2k–8k), large (8k–32k).
This informs cost estimation and model selection for token-sensitive workloads.
That's the full envelope. Every Tally routing decision and every piece of telemetry is built on top of exactly these fields. Nothing else flows out of your application.
Here is the complete journey from your call to Tally's routing decision. Notice where the content boundary is — and notice that Tally never crosses it.
You know what kind of task this is, how big the context is, whether tools are involved, whether the user is waiting. This knowledge lives in your application code.
buildEnvelope({ task_type, structure_type, context_length, ... })
runs entirely in your application. It constructs the structural descriptor from
the parameters you pass. Your prompt text, system message, user message —
none of that is an input to this function. It is never touched.
The envelope fields are normalised into discrete buckets and hashed into a stable cluster ID. Two calls with identical structural properties produce the same cluster ID, regardless of what they're actually asking. This computation also happens entirely in your application — no network call yet.
The SDK sends the envelope to Tally's routing API. This is the first and only network call before your LLM request. The envelope is the full payload — six structural fields plus the cluster ID. No text. No content.
Tally looks up the cluster, consults its bandit state, and returns the recommended model for this shape — plus a confidence score and exploration flag. Sub-millisecond. Your app now has a routing decision.
You call your chosen LLM provider directly — Anthropic, OpenAI, Google, or any other. Tally is not in this call. It is not a proxy. Your full prompt and context go directly from your application to your LLM provider.
After you have the result, the SDK asynchronously sends the outcome to Tally: which model was used, whether it succeeded, input/output token counts, optional quality score. This is fire-and-forget — it does not block your response. The actual LLM output is never sent.
Tally sees two things: a structural descriptor before the call, and a numeric outcome after it. Your data — prompts, responses, user messages, tool results — is never in the picture.
This is not a privacy policy promise. It's a description of what the protocol physically transmits. Tally cannot see the content layer because the content layer is never sent.
code-debug)
code)
long)
Think of it like mailing a package. The courier knows the dimensions and weight of the box — enough to choose the right delivery method. They never open it. They never need to.
Tally knows the dimensions of your AI call: how complex, how long, what format, what tools. That's the shape. The contents of the call are your business and go directly from your application to your LLM provider — Anthropic, OpenAI, or whoever you've chosen. Tally is never in that path.
This is not a policy stance born of legal caution. It is the architecture of the system. The routing decision requires no knowledge of your content. So we don't ask for it. And because we don't ask for it, we can't mishandle it.
Every time a call with a given shape completes, Tally records the outcome — which model was used, whether it succeeded, the token cost, the optional quality score. These outcomes accumulate against the cluster ID for that shape.
As samples build up, the routing confidence for each cluster increases. The bandit converges from exploration (trying different models to learn what works) toward exploitation (routing directly to the proven winner for that shape).
High-confidence clusters route directly to the best model. Low-confidence clusters still explore — trying different models to update their beliefs. This exploration rate naturally drops over time as clusters accumulate evidence, which is why routing cost-efficiency improves the longer Tally runs on your workload.
Cross-account learning: clusters are shared across all Tally users (subject to privacy opt-out). A shape that others have seen thousands of times arrives already confident, even on day one for a new account.
The Tally live demo generates synthetic workloads and shows routing decisions in real time. Every decision includes the semantic envelope, the shape cluster ID, the model selected, and the confidence score driving that decision.
It's the fastest way to see how Tally maps task types to clusters, and how the bandit responds to different shapes with different confidence levels. No account required.
Synthetic calls. Real routing logic. See how the bandit assigns envelopes to clusters, picks models, and updates as outcomes arrive.
Open the Live Demo →Smart model selection from structural metadata alone. Start free, no credit card.
Next up
Tally + LLMs →