Delivery Mode

Streaming vs Non-Streaming

Two ways to receive a model's response — each with distinct trade-offs. Tally reads your task shape and tells you which to use before you make the call.

The basics

What is streaming?

When you make a non-streaming API call, your code blocks and waits. The model generates the entire response token-by-token, then hands it to you all at once. If the response is 2,000 tokens, you wait for all 2,000 before seeing any of it.

When you enable streaming, the API uses Server-Sent Events (SSE) — an HTTP connection that stays open and delivers tokens (or small groups of tokens) as they are generated. Your code receives a stream of chunks you can process and display incrementally.

Streaming response — tokens arrive in chunks as they're generated
Tally routes each API call to the most cost -effective model .
Non-streaming response — all tokens arrive at once, after full generation
⏳ waiting… waiting… waiting…
(full generation complete)

"Tally routes each API call to the most cost-effective model."

The underlying computation is the same — the model generates tokens sequentially either way. The only difference is when those tokens are handed to you. Streaming reduces time to first token (TTFT) to near-zero. Non-streaming reduces TTFT to the full generation time.

The trade-offs

When each mode wins.

✓ Use streaming when…

  • A human is watching and waiting for the response
  • The output is long — articles, code files, analysis
  • You want progressive UI rendering (typewriter effect)
  • Perceived latency matters more than total latency
  • You are building a conversational interface
  • Output format is prose or freeform text

✓ Use non-streaming when…

  • Output is structured JSON — you can't parse partial JSON
  • A machine consumes the result, not a human
  • You are using tool calls or MCP — you need the full result before acting
  • Output is short — TTFT savings are negligible
  • You need accurate token counts mid-pipeline
  • You need to retry on failure — streams are harder to replay

The two modes are not interchangeable for every use case. Streaming forces you into a pipeline architecture — your code must handle partial state. Non-streaming gives you a complete, validated response you can act on atomically. The wrong choice creates either a poor user experience or fragile engineering.

Engineering reality

What streaming makes harder.

Streaming is the right choice for user-facing interfaces, but it comes with genuine engineering complexity that's easy to underestimate:

Token counting. With non-streaming, the response includes a usage object with exact input and output token counts. With streaming, you typically don't get this until the final chunk, and some providers deliver it in a separate terminal event. If you're doing cost tracking mid-stream, you need to count chunks yourself — and chunk boundaries don't align with tokens precisely.

Structured output parsing. If you've asked the model to respond in JSON, a streaming response gives you fragments of JSON that aren't parseable until the stream closes. Attempting to parse partial JSON mid-stream requires special tooling and is a common source of bugs. Non-streaming is strongly recommended for any structured output.

Error handling and retries. If a stream fails midway, you have a partial response — which may be worse than no response. Retry logic for streams is substantially more complex than for atomic responses. You need to decide: discard and retry from the start, or attempt to resume? Most implementations just discard and retry, which wastes work.

Tool / MCP calls always use non-streaming internally. Even if you request a streaming response, the model must complete its tool-use reasoning before the tool is invoked, and the tool result must be fully received before the model can synthesise a final response. Piping this through a stream adds complexity with no benefit.

Tally's role

The shape of the task tells Tally which to recommend.

When you call route(), Tally returns a routing recommendation that includes not just which model to use, but whether to use streaming. This recommendation is derived entirely from the semantic envelope — the task shape — not from the content of the prompt.

The signals Tally reads from the envelope:

📄

Output format

structure_type: "json" → non-streaming. "prose" or "code" → streaming preferred for long outputs.

👤

Downstream consumer

consumer: "human" → streaming. "machine" or "pipeline" → non-streaming.

📏

Expected output length

Short completions (under ~200 tokens) gain little from streaming. Long outputs gain substantially from progressive rendering.

🔧

Tool use

tools_used > 0 → non-streaming recommendation for the tool-call phases. The final synthesis turn may still stream.

The response from route() includes a stream_recommended boolean. You don't have to think about it — just use what Tally tells you.

Quick reference

Common scenarios at a glance.

Task Output type Consumer Recommendation
Conversational chat UI Prose Human Stream
Generate a long article Prose Human Stream
Generate a large code file Code Human (editor) Stream
Extract structured data to JSON JSON Machine Don't Stream
Function / tool call Tool args Machine Don't Stream
Classify or label input Short text Machine Don't Stream
Summarise an email (short) Short prose Either Either
MCP tool invocation + synthesis Mixed Human Don't Stream tool phase, then Stream synthesis

One recommendation.
Model + streaming mode, together.

Tally doesn't just pick the cheapest model. It returns a complete routing decision: which model to use, whether to stream the response, and the confidence behind both. You implement the recommendation; Tally does the thinking.

Let Tally make the call.

Route smarter — model, streaming mode, and confidence. All from one API.

Next up

Shapes