Streaming vs Non-Streaming

The basics

What is streaming?

When you make a non-streaming API call, your code blocks and waits. The model generates the entire response token-by-token, then hands it to you all at once. If the response is 2,000 tokens, you wait for all 2,000 before seeing any of it.

When you enable streaming, the API uses Server-Sent Events (SSE) — an HTTP connection that stays open and delivers tokens (or small groups of tokens) as they are generated. Your code receives a stream of chunks you can process and display incrementally.

Streaming response — tokens arrive in chunks as they're generated

Tally routes each API call to the most cost -effective model .

Non-streaming response — all tokens arrive at once, after full generation

⏳ waiting… waiting… waiting…
(full generation complete)

"Tally routes each API call to the most cost-effective model."

The underlying computation is the same — the model generates tokens sequentially either way. The only difference is when those tokens are handed to you. Streaming reduces time to first token (TTFT) to near-zero. Non-streaming reduces TTFT to the full generation time.

The trade-offs

When each mode wins.

✓ Use streaming when…

A human is watching and waiting for the response
The output is long — articles, code files, analysis
You want progressive UI rendering (typewriter effect)
Perceived latency matters more than total latency
You are building a conversational interface
Output format is prose or freeform text

✓ Use non-streaming when…

Output is structured JSON — you can't parse partial JSON
A machine consumes the result, not a human
You are using tool calls or MCP — you need the full result before acting
Output is short — TTFT savings are negligible
You need accurate token counts mid-pipeline
You need to retry on failure — streams are harder to replay

The two modes are not interchangeable for every use case. Streaming forces you into a pipeline architecture — your code must handle partial state. Non-streaming gives you a complete, validated response you can act on atomically. The wrong choice creates either a poor user experience or fragile engineering.

Engineering reality

What streaming makes harder.

Streaming is the right choice for user-facing interfaces, but it comes with genuine engineering complexity that's easy to underestimate:

Token counting. With non-streaming, the response includes a usage object with exact input and output token counts. With streaming, you typically don't get this until the final chunk, and some providers deliver it in a separate terminal event. If you're doing cost tracking mid-stream, you need to count chunks yourself — and chunk boundaries don't align with tokens precisely.

Structured output parsing. If you've asked the model to respond in JSON, a streaming response gives you fragments of JSON that aren't parseable until the stream closes. Attempting to parse partial JSON mid-stream requires special tooling and is a common source of bugs. Non-streaming is strongly recommended for any structured output.

Error handling and retries. If a stream fails midway, you have a partial response — which may be worse than no response. Retry logic for streams is substantially more complex than for atomic responses. You need to decide: discard and retry from the start, or attempt to resume? Most implementations just discard and retry, which wastes work.

Tool / MCP calls always use non-streaming internally. Even if you request a streaming response, the model must complete its tool-use reasoning before the tool is invoked, and the tool result must be fully received before the model can synthesise a final response. Piping this through a stream adds complexity with no benefit.

Tally's role

The shape of the task tells Tally which to recommend.

When you call route(), Tally returns a routing recommendation that includes not just which model to use, but whether to use streaming. This recommendation is derived entirely from the semantic envelope — the task shape — not from the content of the prompt.

The signals Tally reads from the envelope:

📄

Output format

structure_type: "json" → non-streaming. "prose" or "code" → streaming preferred for long outputs.

👤

Downstream consumer

consumer: "human" → streaming. "machine" or "pipeline" → non-streaming.

📏

Expected output length

Short completions (under ~200 tokens) gain little from streaming. Long outputs gain substantially from progressive rendering.

🔧

Tool use

tools_used > 0 → non-streaming recommendation for the tool-call phases. The final synthesis turn may still stream.

The response from route() includes a stream_recommended boolean. You don't have to think about it — just use what Tally tells you.

Quick reference

Common scenarios at a glance.

Task	Output type	Consumer	Recommendation
Conversational chat UI	Prose	Human	Stream
Generate a long article	Prose	Human	Stream
Generate a large code file	Code	Human (editor)	Stream
Extract structured data to JSON	JSON	Machine	Don't Stream
Function / tool call	Tool args	Machine	Don't Stream
Classify or label input	Short text	Machine	Don't Stream
Summarise an email (short)	Short prose	Either	Either
MCP tool invocation + synthesis	Mixed	Human	Don't Stream tool phase, then Stream synthesis

One recommendation.
Model + streaming mode, together.

Tally doesn't just pick the cheapest model. It returns a complete routing decision: which model to use, whether to stream the response, and the confidence behind both. You implement the recommendation; Tally does the thinking.

What is streaming?

When each mode wins.

✓ Use streaming when…

✓ Use non-streaming when…

What streaming makes harder.

The shape of the task tells Tally which to recommend.

Output format

Downstream consumer

Expected output length

Tool use

Common scenarios at a glance.

One recommendation.Model + streaming mode, together.

Let Tally make the call.

One recommendation.
Model + streaming mode, together.