Works for 10, breaks at 10k

"Works for 10" and "works for 10k" are two different worlds.

At low traffic, a lot of things never go wrong. Race conditions are rare. Edge cases don't happen often enough to show up in monitoring. You can ship and move on. Then traffic grows, and suddenly you're chasing 5xx errors that nobody saw before.

We ran into one of those. I'm writing it down because the problem is generic — keep-alive connections, connection reuse, and what happens when the server closes a socket at the wrong moment — and the way we fixed it might be useful if you hit something similar.

What we saw

We had a spike of 5xx errors in our monitoring. Annoying, and bad for our SLI. After some digging, we found that a lot of them were coming from ECONNRESET on HTTP requests.

ECONNRESET means the other side of the TCP connection sent a RST packet and closed the connection abruptly — no normal TCP teardown. From the client's point of view, the connection was there one moment and gone the next.

That can happen for several reasons: idle socket timeout, buffer overflow, server restart, something in the network killing the connection, and so on. In our case, the important detail was when it happened: we were reusing keep-alive connections. So we'd send a request on a socket we thought was still good, and the server had already decided to close that socket (e.g. because it was idle). By the time our packet arrived, the connection was gone. We got RST. The HTTP spec even describes this: the server may close an "idle" connection while the client thinks a request is in progress. It doesn't say how to fix it — that's left to the client.

So: race condition on connection reuse. At low QPS you might be rarely hit the bad timing. At higher QPS and with more connection reuse, it shows up more often.

One solution (but not enough)

We made sure our idle timeout (client) was shorter than the upstream's keep-alive timeout where we could. That reduces the race — if our client drops idle connections before the server does, we're less likely to reuse a socket the server has already closed. But it doesn't remove the problem completely. The server or the network can still close the connection at an unfortunate time. Configuration alone wasn't enough; we will need to introduce some retry logic.

Why "just retry" isn't trivial

The obvious answer is: when you get ECONNRESET, retry the request.

The catch is safe retry. For idempotent requests (e.g. GET), retrying is usually fine. For non-idempotent ones (e.g. POST that creates something), retrying might double the side effect. So you need a policy: what do we retry, and who decides?

We had a few options and ended up combining them:

Retry only on reused sockets. In Node, you can know when a request used a reused socket. Retry only in that case — that's when the failure is "we thought the connection was alive but it wasn't," not "server processed our request and then died." We did this: we only retry when the failure was on a reused socket, so we're confident the server didn't already process the request.
Retry only idempotent requests. For GET and other safe methods, retry automatically. For non-idempotent requests, don't retry by default — we later added support for the Idempotency-Key header so upstream services can mark requests as safe to retry and handle them in a fault-tolerant way. From our side we only retry idempotent requests; for non-idempotent ones we either don't retry or let services opt in.
Return 502/503/504 instead of 500. For this class of error we return codes that allow the load balancer to retry on our behalf, so GETs and other idempotent requests get a retry from the LB even when we don't retry ourselves.
Let the service decide. Add something like a "retry on ECONNRESET" flag per route so each team can mark their endpoint as safe to retry. Start with ECONNRESET; later you can extend to other transient errors.

This combination got our 5xx rate back under control and aligned with how other clients (e.g. Chromium) handle the same situation: when you see this race, the client retries.

The takeaway

Nothing here is magic. The lesson is the same one as always: at small scale, many bugs are invisible. They depend on timing, load, or connection reuse. When you grow, they show up in your metrics.

So if you're building something that you expect to scale, it's worth thinking a bit ahead: what happens when connections get reused a lot? When do we retry, and for which methods? What do we return when we didn't get a response — 500 or something that allows retries (502/503/504)? You don't have to implement everything on day one, but having a mental model and a direction for "transient connection errors" will save you a late-night debugging session when the graph goes red.