How to Stream LLM Responses in Your Application

Without streaming, an LLM request works like a traditional API call: you send a prompt, wait for the model to generate the entire response, and receive it all at once. For a 500-token response from GPT-5.4, that might mean waiting 3-8 seconds staring at a blank screen before anything appears.

Streaming changes this. The model sends tokens as they’re generated, so the first word appears in milliseconds. The user starts reading while the model is still thinking. The total generation time is the same, but the perceived latency drops dramatically.

Every production LLM application should stream by default. Here’s how it works.

Why Streaming Matters

Two metrics define the responsiveness of an LLM application:

Time to First Token (TTFT) is how long the user waits before seeing any output. Without streaming, TTFT equals the full generation time. With streaming, TTFT is typically 100-500ms regardless of response length.

Time to Last Token (TTLT) is when the complete response is available. Streaming doesn’t change this. A 500-token response takes the same total time whether you stream it or not.

The UX difference is significant. Users perceive a streaming response as faster even when the total time is identical, because they’re reading content as it arrives rather than waiting for a blank screen to fill.

Server-Sent Events

Most LLM APIs use Server-Sent Events (SSE) for streaming. SSE is a simple protocol: the server sends a stream of data: lines over a long-lived HTTP connection. Each line contains a chunk of the response.

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":" world"}}]}

data: [DONE]

SSE is unidirectional (server to client), which is all you need for streaming completions. The client sends one request, then reads chunks as they arrive. No WebSocket complexity required.

Streaming with OpenAI

Add stream=True to any chat completion request:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Explain embeddings in two paragraphs."}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Each chunk contains a delta object with the incremental content. The delta replaces the message field you’d see in a non-streaming response. When generation is complete, the final chunk signals the end of the stream.

Streaming with Anthropic

Anthropic’s API uses a similar pattern with event types:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain embeddings in two paragraphs."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Anthropic’s streaming emits typed events: message_start, content_block_start, content_block_delta, content_block_stop, and message_stop. The text_stream helper abstracts over these, but you’ll need the raw events if you’re handling tool use or multiple content blocks.

Streaming Tool Calls

Streaming gets more complex when the model uses function calling. Instead of receiving complete JSON arguments at once, you receive them incrementally:

delta: {"tool_calls":[{"function":{"arguments":"{\"ci"}}]}
delta: {"tool_calls":[{"function":{"arguments":"ty\": "}}]}
delta: {"tool_calls":[{"function":{"arguments":"\"London\"}"}}]}

You need to accumulate the argument string across chunks and parse it only when the tool call is complete. Attempting to parse partial JSON on every chunk will fail. Buffer the argument fragments, detect completion (the stream moves to the next tool call or ends), then parse and execute.

import json

tool_args = ""
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        tool_args += delta.tool_calls[0].function.arguments or ""

args = json.loads(tool_args)

Forwarding Streams to the Browser

In a web application, you typically proxy the LLM stream through your backend to the client. The simplest approach is forwarding SSE directly:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.get("/chat")
async def chat(q: str):
    def generate():
        stream = client.chat.completions.create(
            model="gpt-5.4",
            messages=[{"role": "user", "content": q}],
            stream=True
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content or ""
            yield f"data: {content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

On the client side, use fetch with a readable stream to consume the events and append tokens to the DOM as they arrive:

const response = await fetch('/chat?q=Hello');
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const text = decoder.decode(value, { stream: true });
  const lines = text.split('\n').filter(line => line.startsWith('data: '));

  for (const line of lines) {
    const content = line.replace('data: ', '');
    if (content === '[DONE]') return;
    document.getElementById('output').textContent += content;
  }
}

The EventSource API also works but doesn’t support POST requests or custom headers, which limits its usefulness for LLM applications where you need to send conversation history in the request body. The fetch approach is more flexible.

Streaming with the Vercel AI SDK

If you’re building with Next.js or another JavaScript framework, the Vercel AI SDK (v5) abstracts away most of the streaming plumbing. It handles SSE parsing, React state updates, and connection management through a transport-based architecture:

import { useChat } from '@ai-sdk/react';

export default function Chat() {
  const { messages, input, setInput, sendMessage, status } = useChat();

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>{m.role}: {m.parts.map(p => p.text).join('')}</div>
      ))}
      <form onSubmit={e => { e.preventDefault(); sendMessage({ text: input }); setInput(''); }}>
        <input value={input} onChange={e => setInput(e.target.value)} />
      </form>
    </div>
  );
}

The useChat hook manages the streaming connection, accumulates tokens into messages, and re-renders the component as tokens arrive. It supports OpenAI, Anthropic, and other providers through a unified interface. For many web applications, this is the fastest path to a working streaming UI.

Production Considerations

Error handling. Streams can disconnect mid-response. Your client should detect incomplete responses and either retry or show the partial result with an error indicator. Don’t silently swallow dropped connections.

Token counting. With streaming, you don’t get the usage summary until the stream ends (or in a final chunk). If you need to track token usage for cost monitoring, capture the final chunk’s usage field or count tokens client-side as they arrive.

Rate limits. A streaming request holds a connection open for the duration of generation. If your model takes 10 seconds to produce a response, that’s 10 seconds of connection time. Plan your connection pool and timeout settings accordingly.

Buffering. Some reverse proxies (Nginx, Cloudflare) buffer responses by default, which defeats streaming. Configure X-Accel-Buffering: no or equivalent to ensure chunks flow through immediately.

Cancellation. If the user navigates away mid-stream, close the connection. Most SDKs support aborting a stream, which also stops the model from generating further tokens, saving you money on output tokens you’ll never use.

When Not to Stream

Streaming adds complexity. Skip it when:

You need the complete response before doing anything (e.g., parsing structured JSON, running validation)
The response is short enough that TTFT doesn’t matter (under 50 tokens)
You’re processing responses in batch, not showing them to a user

For structured output workflows where you need valid JSON before proceeding, non-streaming requests are simpler and less error-prone. For anything user-facing, stream by default.

How to Stream LLM Responses in Your Application

Why Streaming Matters

Server-Sent Events

Streaming with OpenAI

Streaming with Anthropic

Streaming Tool Calls

Forwarding Streams to the Browser

Streaming with the Vercel AI SDK

Production Considerations

When Not to Stream

Keep Reading

Claude Fable 5 Brings Mythos-Class AI to Developers

System Prompts: How to Write Effective LLM Instructions

How to Use AI for Code Review

What Is Vibe Coding? The Developer's Guide

Tool-Level Observability Hits Claude MCP Connectors