Privacy-First Routing

Semantic Shapes

Tally never reads your prompts. It reads the shape of your calls — a structural fingerprint computed entirely on your side, from metadata you control.

The core concept

A fingerprint of structure,
not content.

Every AI call has two layers. The content layer — the actual prompt text, user messages, system instructions, tool results — is yours. It's sensitive. It stays in your application and never touches Tally.

The structural layer is different. What kind of task is this? How much context is involved? Are tools in scope? Is the response time-sensitive? These properties describe the shape of the work without revealing anything about what the work actually is. And shape is what determines which model will handle it best.

A semantic shape is that structural fingerprint. It's built from a handful of fields that your SDK derives from the call parameters you already know — task type, context length, output format, tools in scope. Together, those fields form a point in "task space." Calls that land at the same point behave similarly across models. That's what Tally routes on.

The key insight: the properties that determine which model you need are structural, not semantic. You don't need to send a word of your actual content to route well.

The semantic envelope

Six fields that describe any AI call.

Before making an LLM call, you build a semantic envelope using the SDK's buildEnvelope() helper. This takes parameters you already know — the task type, context size, expected output — and wraps them into a structured object. No prompt text, no message content, no user data.

task_type
The category of cognitive work. Examples: code-debug, architecture-design, data-analysis, summarisation, qa-simple, creative-writing. This is the strongest routing signal — different task types have dramatically different model requirements.
structure_type
The expected output format: prose, code, json, list, mixed. Structured outputs (JSON, typed code) require models with stronger schema adherence. Free-form prose is more tolerant of lighter models.
context_length
How much input context is involved: short (<2k tokens), medium (2k–8k), long (8k–32k), extended (32k+). Long context calls have different capability requirements and cost profiles.
tools_used
The number of MCP tools available in scope. Even tools that aren't called still consume input tokens (schema overhead). High tool counts bias toward models with strong tool-calling support. Zero means a pure text completion call.
time_sensitive
Is this a user-blocking, real-time call? true means the user is waiting for a fast response — optimise for latency. false allows slower, more cost-effective models for background or batch tasks.
estimated_tokens
A rough token budget bucket: tiny (<500), small (500–2k), medium (2k–8k), large (8k–32k). This informs cost estimation and model selection for token-sensitive workloads.

That's the full envelope. Every Tally routing decision and every piece of telemetry is built on top of exactly these fields. Nothing else flows out of your application.

Computed on your side

The SDK does the work.
Your content never moves.

Here is the complete journey from your call to Tally's routing decision. Notice where the content boundary is — and notice that Tally never crosses it.

1

Your app knows the call parameters

You know what kind of task this is, how big the context is, whether tools are involved, whether the user is waiting. This knowledge lives in your application code.

2

SDK builds the envelope — locally, in your process

buildEnvelope({ task_type, structure_type, context_length, ... }) runs entirely in your application. It constructs the structural descriptor from the parameters you pass. Your prompt text, system message, user message — none of that is an input to this function. It is never touched.

3

SDK computes the shape hash — locally, in your process

The envelope fields are normalised into discrete buckets and hashed into a stable cluster ID. Two calls with identical structural properties produce the same cluster ID, regardless of what they're actually asking. This computation also happens entirely in your application — no network call yet.

4

SDK asks Tally for a routing recommendation

The SDK sends the envelope to Tally's routing API. This is the first and only network call before your LLM request. The envelope is the full payload — six structural fields plus the cluster ID. No text. No content.

5

Tally returns a model recommendation

Tally looks up the cluster, consults its bandit state, and returns the recommended model for this shape — plus a confidence score and exploration flag. Sub-millisecond. Your app now has a routing decision.

6

Your app calls the LLM directly

You call your chosen LLM provider directly — Anthropic, OpenAI, Google, or any other. Tally is not in this call. It is not a proxy. Your full prompt and context go directly from your application to your LLM provider.

7

SDK fires telemetry after the call

After you have the result, the SDK asynchronously sends the outcome to Tally: which model was used, whether it succeeded, input/output token counts, optional quality score. This is fire-and-forget — it does not block your response. The actual LLM output is never sent.

Tally sees two things: a structural descriptor before the call, and a numeric outcome after it. Your data — prompts, responses, user messages, tool results — is never in the picture.

The privacy boundary

What flows to Tally.
What stays with you.

This is not a privacy policy promise. It's a description of what the protocol physically transmits. Tally cannot see the content layer because the content layer is never sent.

🔒 Stays in your application

Your prompt text and system instructions
User messages and conversation history
MCP tool call parameters
MCP tool results and retrieved data
LLM responses and completions
Any business data or PII in context

📤 Sent to Tally

Task type (e.g. code-debug)
Output format type (e.g. code)
Context length bucket (e.g. long)
Number of tools in scope
Time-sensitivity flag
Token count + success/fail outcome (after the call)

🔒 The content boundary is architectural, not contractual.

Think of it like mailing a package. The courier knows the dimensions and weight of the box — enough to choose the right delivery method. They never open it. They never need to.

Tally knows the dimensions of your AI call: how complex, how long, what format, what tools. That's the shape. The contents of the call are your business and go directly from your application to your LLM provider — Anthropic, OpenAI, or whoever you've chosen. Tally is never in that path.

This is not a policy stance born of legal caution. It is the architecture of the system. The routing decision requires no knowledge of your content. So we don't ask for it. And because we don't ask for it, we can't mishandle it.

Learning over time

Clusters grow smarter
with every call.

Every time a call with a given shape completes, Tally records the outcome — which model was used, whether it succeeded, the token cost, the optional quality score. These outcomes accumulate against the cluster ID for that shape.

As samples build up, the routing confidence for each cluster increases. The bandit converges from exploration (trying different models to learn what works) toward exploitation (routing directly to the proven winner for that shape).

Routing confidence by cluster — example state
code-debug · code · long
91%
summarisation · prose · short
87%
qa-simple · json · short
79%
architecture-design · prose · extended
62%
data-analysis · json · medium · tools:3
31% — exploring

High-confidence clusters route directly to the best model. Low-confidence clusters still explore — trying different models to update their beliefs. This exploration rate naturally drops over time as clusters accumulate evidence, which is why routing cost-efficiency improves the longer Tally runs on your workload.

Cross-account learning: clusters are shared across all Tally users (subject to privacy opt-out). A shape that others have seen thousands of times arrives already confident, even on day one for a new account.

See it yourself

Shapes are live in the demo.

The Tally live demo generates synthetic workloads and shows routing decisions in real time. Every decision includes the semantic envelope, the shape cluster ID, the model selected, and the confidence score driving that decision.

It's the fastest way to see how Tally maps task types to clusters, and how the bandit responds to different shapes with different confidence levels. No account required.

Watch shapes get routed live.

Synthetic calls. Real routing logic. See how the bandit assigns envelopes to clusters, picks models, and updates as outcomes arrive.

Open the Live Demo →

Routing that never reads your data.

Smart model selection from structural metadata alone. Start free, no credit card.

Next up

Tally + LLMs