Why Tally — Infrastructure for the AI Era

Infrastructure, not SaaS

We have one aim,
and we all share in it.

Most software is sold to you. You pay a subscription, you get features, someone else's roadmap shapes what you can build. That is SaaS. Tally is not that.

Tally is infrastructure — closer in spirit to TCP/IP or a public road than to another dashboard you log into once a month. We exist to solve one problem for the entire industry, not to sell you features that keep you locked in.

Our success is not measured in ARR. It is measured in GPU-hours saved. Every correctly routed call — every time a simple task goes to a lightweight model instead of burning a flagship — is a win for everyone using these systems. Including you. Including your competitors. Including us. That is what infrastructure looks like.

The world has only so many GPUs. We all want to use them. The only sane response is to use them wisely.

— The founding premise of Tally

Right model, right task

Nobody needs a particle accelerator
to make fudge.

Language models exist on a spectrum from genuinely small and fast to genuinely massive and powerful. The massive ones are extraordinary. They can reason across domains, hold vast context, write code that compiles the first time, and do things that felt like science fiction five years ago.

They are also incredibly expensive to run — in money, in energy, in raw GPU time. And the uncomfortable truth is that the vast majority of what we actually ask them to do does not require any of that power.

Summarise this email. Convert this JSON. Answer this FAQ. Format this address. Check this spelling. A model that costs 71% less handles all of it just as well. When you route every task to a flagship model out of habit or fear, you are not being safe. You are being wasteful.

Tally learns the difference. That is its entire purpose.

The new cost of compute

When Anthropic spent $10k building a C compiler,
they probably left $3,000 on the table.

Anthropic recently published something remarkable: they used Claude to build a fully functional C compiler. The compute bill came to roughly $10,000. That is a stunning demonstration — a task that would take a team of engineers months, compressed into a single AI-powered run.

We looked at that kind of workload closely. Our estimate: with intelligent routing — directing the repetitive, mechanical steps to smaller models and reserving the flagship for genuine reasoning — you save approximately 30% without meaningful quality loss. That is $3,000 back. For a single run. Multiply that by the number of times your team runs something like it this month.

Compute costs are no longer a rounding error. For many engineering teams they now rival the cost of the engineers themselves.

This is not an edge case. Industry figures from Goldman Sachs, Sequoia, and others have all flagged the same concern: AI compute expenditure is growing faster than the value being captured from it. For mid-size companies it has grown from a line item to a budget category. For larger ones it is now a constraint. The question is no longer "can we afford AI?" — it is "can we afford to waste it?"

A small aside

If a $10,000 compute bill surprises you, there is a prior question worth asking: was the workload structured for AI from the start? Breaking a task into well-scoped steps, giving each step a clear contract, right-sizing context at every stage — these decisions happen before routing and matter just as much. We will help with that in time. For now, even perfectly-structured workloads benefit from routing. The savings are additive, not either/or.

The answer to rising compute costs is not to spend less on AI. It is to spend it on the right model for every task. A lightweight model costs 85–95% less than a flagship and handles the majority of what you actually ask AI to do. Tally learns which is which, automatically, on every call.

The stateless conversation

Every turn is a fresh question.
Treat it like one.

Here is something that gets lost in the excitement around conversational AI: the underlying inference is stateless. Each turn in a conversation is a new API call. The model does not remember the previous turn — you reconstruct context by passing history in the prompt.

That matters enormously for routing. Turn one might be a complex architectural question that genuinely warrants a flagship model. Turn two might be "can you write that as bullet points?" Turn three might be "actually, never mind."

Locking a conversation to a single model for its entire lifetime is an architectural assumption baked in at the wrong level. The right model for a conversation is the right model for each turn of that conversation. Tally evaluates the shape of every request independently — because that is what the stateless architecture actually asks for.

Global GPU allocation — a thought experiment

Over-routed (flagship model, simple task)

Right-sized (correct model for task)

Available

The Green Dragon

Red on the outside.
Green to the core.

AI has an energy problem. Training large models produces extraordinary amounts of CO₂. Inference — running them day after day at scale — compounds that cost many times over. The industry talks about this in abstract terms. Tally addresses it concretely, on every single call.

We call her the Red Dragon — because that is her colour. But she runs green. Every exploitation event, every correctly downsized call, is a small act of conservation multiplied across millions of requests. This is the most powerful greening technology AI could ask for — not because it is dramatic, but because it is continuous, automatic, and cumulative.

The GPU you did not waste today is the GPU someone else gets to use tomorrow. That is not marketing. That is arithmetic.

Open and crowd-sourced

Smarter for everyone.
Hidden from no one.

Tally's routing intelligence is crowd-sourced. Every telemetry event from every integration makes the bandit smarter — not just for you, but for everyone who routes similar task shapes. The more people use Tally, the better Tally gets, for all of them simultaneously. This is how infrastructure behaves. This is how TCP/IP learned. This is how roads get paved.

And unlike most systems that benefit from shared data, Tally is transparent about what it does with it. We do not hide the algorithm. We do not lock the routing logic behind an API you cannot inspect.

The SDK is open. The code is yours. Read every line. Understand exactly what gets sent, when, and why. We will never ask you to take our routing decisions on faith.

How Tally touches your calls

Your App

route()

→

Tally

recommendation

→

Your App

LLM / MCP call

→

Provider

Claude / GPT / etc

👁

Tally is a witness, not a proxy.

We never sit between you and your LLM or MCP provider. We never see your prompts. We never intercept your responses. We observe the shape of a call — its task type, token count, outcome — and we use that to get better. Your content is yours. Your providers are yours. Tally only ever watches from the side.

This is not just a privacy stance. It is an architectural commitment. Tally has no business being in the critical path of your inference calls. We give you a recommendation before the call. You report an outcome after it. Everything in between belongs entirely to you and your provider.

What we stand for

The principles that built this.

01

GPUs are a shared resource

Compute is finite. Every wasted cycle is a cycle someone else cannot use. We optimise for the whole, not just your instance.

02

Right model, right task, every time

The best model is the smallest one that gets the job done. Anything else is waste — your money, their energy, everyone's time.

03

We are a witness, never a proxy

We see shapes, not content. We observe outcomes, not responses. Your data never passes through us. That is not a feature — it is a promise.

04

Transparent by default

The SDK is open source. The algorithm is documented. You should understand exactly what Tally does before you trust it with a single call.

05

Smarter together

Every integration makes the routing better for all. This is crowd-sourced intelligence — the more who participate, the more everyone benefits.

06

Green is not optional

AI's environmental cost is real. Routing efficiently is the most immediate, most scalable response available. We make it automatic.

We are not a product.
We are infrastructure.

We have one aim,
and we all share in it.

Nobody needs a particle accelerator
to make fudge.

When Anthropic spent $10k building a C compiler,
they probably left $3,000 on the table.

Every turn is a fresh question.
Treat it like one.

Red on the outside.
Green to the core.

Routing intelligence is the most scalable green technology AI has ever had.

Smarter for everyone.
Hidden from no one.

The principles that built this.

GPUs are a shared resource

Right model, right task, every time

We are a witness, never a proxy

Transparent by default

Smarter together

Green is not optional

Be part of the infrastructure.

We are not a product.We are infrastructure.

We have one aim,and we all share in it.

Nobody needs a particle acceleratorto make fudge.

When Anthropic spent $10k building a C compiler,they probably left $3,000 on the table.

Every turn is a fresh question.Treat it like one.

Red on the outside.Green to the core.

Routing intelligence is the most scalable green technology AI has ever had.

Smarter for everyone.Hidden from no one.

The principles that built this.

GPUs are a shared resource

Right model, right task, every time

We are a witness, never a proxy

Transparent by default

Smarter together

Green is not optional

Be part of the infrastructure.

We are not a product.
We are infrastructure.

We have one aim,
and we all share in it.

Nobody needs a particle accelerator
to make fudge.

When Anthropic spent $10k building a C compiler,
they probably left $3,000 on the table.

Every turn is a fresh question.
Treat it like one.

Red on the outside.
Green to the core.

Smarter for everyone.
Hidden from no one.