While working on a load testing tool in Go, I ran into a situation where I was seeing tens of thousands of sockets in the TIME_WAIT
state.
Here are a few ways to get into this situation and how to fix each one.
Repro #1: Create excessive TIME_WAIT connections by forgetting to read the response body
Run the following code on a linux machine:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
and in a separate terminal while the program is running, run:
1
|
|
and you will see this number constantly growing:
1 2 3 4 5 6 7 8 9 |
|
Fix: Read Response Body
Update the startLoadTest()
method to add the following line of code (and related imports):
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Now when you re-run it, calling netstat -n | grep -i 8080 | grep -i time_wait | wc -l
while it’s running will return 0.
Repro #2: Create excessive TIME_WAIT connections by exceeding connection pool
Another way to end up with excessive connections in the TIME_WAIT
state is to consistently exceed the connnection pool and cause many short-lived connections to be opened.
Here’s some code which starts up 100 goroutines which are all trying to make requests concurrently, and each request has a 50 ms delay:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
In another shell run netstat
, note that the number of connections in the TIME_WAIT
state is growing again, even though the response is being read
1 2 3 4 5 6 7 8 9 |
|
To understand what’s going on, we’ll need to dig in a little deeper into the TIME_WAIT
state.
What is the socket TIME_WAIT
state anyway?
So what’s going on here?
What’s happening is that we are creating lots of short lived TCP connections, and the Linux kernel networking stack is keeping tabs on the closed connections to prevent certain problems.
From The TIME-WAIT state in TCP and Its Effect on Busy Servers:
The purpose of TIME-WAIT is to prevent delayed packets from one connection being accepted by a later connection. Concurrent connections are isolated by other mechanisms, primarily by addresses, ports, and sequence numbers[1].
Why so many TIME_WAIT sockets? What about connection re-use?
By default, the Golang HTTP client will do connection pooling. Rather than closing a socket connection after an HTTP request, it will add it to an idle connection pool, and if you try to make another HTTP request before the idle connection timeout (90 seconds by default), then it will re-use that existing connection rather than creating a new one.
This will keep the number of total socket connections low, as long as the pool doesn’t fill up. If the pool is full of established socket connections, then it will just create a new socket connection for the HTTP request and use that.
So how big is the connection pool? A quick look into transport.go tells us:
1 2 3 4 5 6 7 8 9 10 11 |
|
- The
MaxIdleConns: 100
setting sets the size of the connection pool to 100 connections, but with one major caveat: this is on a per-host basis. See the comments on theDefaultMaxIdleConnsPerHost
below for more details on the implications of this. - The
IdleConnTimeout
is set to 90 seconds, meaning that after a connection stays in the pool and is unused for 90 seconds, it will be removed from the pool and closed. - The
DefaultMaxIdleConnsPerHost = 2
setting below it. What this means is that even though the entire connection pool is set to 100, there is a per-host cap of only 2 connections!
In the above example, there are 100 goroutines trying to concurrently make requests to the same host, but the connection pool can only hold 2 sockets. So in the first “round” of the goroutines finishing their http request, 2 of the sockets will remain open in the pool, while the remaining 98 connections will be closed and end up in the TIME_WAIT
state.
Since this is happening in a loop, you will quickly accumulate thousands or tens of thousands of connections in the TIME_WAIT
state. Eventually, for that particular host at least, you will run out of ephemeral ports and not be able to open new client connections. For a load testing tool, this is bad news.
Fix: Tuning the http client to increase connection pool size
Here’s how to fix this issue.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
This bumps the total maximum idle connections (connection pool size) and the per-host connection pool size to 100.
Now when you run this and check the netstat
output, the number of TIME_WAIT
connections stays at 0
1 2 3 4 5 6 7 8 |
|
The problem is now fixed!
If you have higher concurrency requirements, you may want to bump this number to something higher than 100.