How do Server-Sent Events work behind the scenes?

Explore this question in depth in our interactive lesson on Handling API Errors and Streaming Responses.

What should I do if a stream disconnects mid-generation?

Explore this question in depth in our interactive lesson on Handling API Errors and Streaming Responses.

Are there specific libraries for re-connecting dropped API streams?

Explore this question in depth in our interactive lesson on Handling API Errors and Streaming Responses.

Does streaming increase the total cost of an API call?

Explore this question in depth in our interactive lesson on Handling API Errors and Streaming Responses.

How do I handle rate limits when using streaming responses?

Explore this question in depth in our interactive lesson on Handling API Errors and Streaming Responses.

Lesson 9

Handling API Errors and Streaming Responses

~15 min125 XP

Introduction

Welcome to the frontier of AI integration, where robust applications are defined by how they handle the unpredictability of language models. You will learn how to transition from simple request-response loops to sophisticated architectures that handle connectivity failures and deliver a fluid, "live-typed" user experience.

Understanding the Request-Response Lifecycle

When you interact with an API like OpenAI, your application sends an HTTPS request to a remote server. The server then performs a series of complex matrix multiplications, represented by the Transformer architecture, to generate a response. This process can take several seconds, or even minutes, during which your application might hang.

A common pitfall is synchronous waiting. If your server waits for the entire completion before displaying anything, the user perceives the application as "frozen." Furthermore, external APIs are subject to transient network issues, rate limits, and server-side timeouts. Relying on a single request means that if a connection drops at 99% generation, all that compute time and user patience is lost. Understanding that these connections are volatile is the first step toward building resilient AI interfaces.

Why is synchronous waiting (waiting for the full response) detrimental in AI chatbot applications?

Implementing Server-Side Streaming

To solve the "frozen" UI problem, we use Server-Sent Events (SSE) or streaming chunks. Instead of the API returning one massive JSON object at the end of the generation, the server pushes "tokens" (or fragments of text) as they are computed.

In your code, you need to iterate over this stream. Each chunk must be processed asynchronously. The logic typically involves a ReadableStream reader that processes incoming buffers and appends them to your chat interface in real-time. This creates the "typewriter effect," which is not just an aesthetic choice, but a usability feature that allows users to start reading the AI response immediately.

Resilience Through Exponential Backoff

APIs often return a 429 Too Many Requests or 503 Service Unavailable error when their systems are under high load. A naive approach would be to retry immediately, which often results in a "thundering herd" problem where your application inadvertently helps crash the API server.

Instead, you should implement Exponential Backoff. This strategy dictates that after a failed request, the application waits for a period $t$ before trying again. If it fails again, it waits $2t$ , then $4t$ , then $8t$ , and so on. Mathematically, the wait time $W$ after $n$ failed attempts can be represented as: $W = t \times 2^{n-1} + jitter$ The inclusion of jitter (a random small delay) is critical; it prevents multiple instances of your application from retrying at the exact same millisecond, spreading the load and increasing the success rate.

Adding 'jitter' to an exponential backoff algorithm helps prevent the Thundering Herd problem.

Advanced Error Handling and Graceful Degradation

Even with retries, some errors are terminal. For instance, a 401 Unauthorized error signifies an expired API key, which no amount of retrying will fix. Your application needs graceful degradation—the ability to provide a fallback experience when the AI fails.

Common strategies include:

Fallback to Cache: If the AI fails, show the user a previously saved answer for similar queries.
User Feedback Loops: Ask the user to "Regenerate" manually.
Graceful Error States: Replace the spinning loader with a clear message: "The model is currently busy. Please try again in a moment."

Note: Never expose raw stack traces or API error details to the end-user. Always map internal error codes to user-friendly notifications in your application’s UI layer.

___ is the practice of retrying a failed API request by increasing the delay between attempts to reduce server load.

Key Takeaways

Use streaming instead of synchronous waiting to prevent UI freezes and improve user engagement.
Always implement jitter in your retry logic to prevent synchronization spikes across your user base.
Implement exponential backoff to automatically handle transient network or rate-limiting errors without human intervention.
Design for failure by implementing graceful degradation strategies, such as offering a "regenerate" button rather than just displaying a generic error message.

Finding tutorial videos...

Go deeper

How do Server-Sent Events work behind the scenes?🔒
What should I do if a stream disconnects mid-generation?🔒
Are there specific libraries for re-connecting dropped API streams?🔒
Does streaming increase the total cost of an API call?🔒
How do I handle rate limits when using streaming responses?🔒