Headscale stops accepting connections after ~500 nodes (likely 512) (0.23-alpha2) #599

Originally created by @jwischka on GitHub (Dec 16, 2023).

Bug description

I have a large headscale instance (~550 nodes). 0.22.3 has extreme CPU usage (i.e., 30 cores), but is able to handle all clients. 0.23-alpha2 has substantially better CPU performance, but stops accepting connections after about 500 nodes. CLI shows "context exceeded" for all queries (e.g. "headscale nodes/users/routes list") and new clients are unable to join.

CPU usage after connections stop is relatively modest (<50%), and connected clients appear to be able to access each other (e.g. ping/login) as expected.

Environment

Headscale: 0.23-alpha2
Tailscale: various, but mostly 1.54+, mostly Linux, some Mac and Windows.
OS: Headscale installed in privileged Proxmox LXC container (Ubuntu 20.04.6), reverse proxied behind nginx per official docs
Kernel: 6.2.16
Resources: 36 cores, 8GB RAM, 4GB swap (ram usage is quite small)
DB: Using postgres backend (sqlite would die after about 100 clients on 0.22.X with similar symptoms)
Config: ACLs are in use based on users. Two users have access to all nodes. Most nodes have access only to a relatively small set of other nodes via ACL. Specifically, there are 7 nodes that have access to everything, but on almost all other nodes "tailscale status" will return only 8 devices.

Headscale is behind a (reverse) proxy
Headscale runs in a container

To Reproduce

I suspect difficult to do, but have large numbers of clients connect in.

Given that the behavior is a direct change from 0.22.3 -> 0.23-alpha2, I don't think the container or reverse proxy have anything to do with it. I can with some effort bypass the reverse proxy, and do a direct port-forward from a firewall instead, but running on bare metal will be substantially more difficult.

Because I'm in container, I can easily snap/test/revert possible fixes.

Originally created by @jwischka on GitHub (Dec 16, 2023). ## Bug description I have a large headscale instance (~550 nodes). 0.22.3 has extreme CPU usage (i.e., 30 cores), but is able to handle all clients. 0.23-alpha2 has substantially better CPU performance, but stops accepting connections after about 500 nodes. CLI shows "context exceeded" for all queries (e.g. "headscale nodes/users/routes list") and new clients are unable to join. CPU usage after connections stop is relatively modest (<50%), and connected clients appear to be able to access each other (e.g. ping/login) as expected. ## Environment Headscale: 0.23-alpha2 Tailscale: various, but mostly 1.54+, mostly Linux, some Mac and Windows. OS: Headscale installed in privileged Proxmox LXC container (Ubuntu 20.04.6), reverse proxied behind nginx per official docs Kernel: 6.2.16 Resources: 36 cores, 8GB RAM, 4GB swap (ram usage is quite small) DB: Using postgres backend (sqlite would die after about 100 clients on 0.22.X with similar symptoms) Config: ACLs are in use based on users. Two users have access to all nodes. Most nodes have access only to a relatively small set of other nodes via ACL. Specifically, there are 7 nodes that have access to everything, but on almost all other nodes "tailscale status" will return only 8 devices. - [X] Headscale is behind a (reverse) proxy - [X] Headscale runs in a container ## To Reproduce I suspect difficult to do, but have large numbers of clients connect in. Given that the behavior is a direct change from 0.22.3 -> 0.23-alpha2, I don't think the container or reverse proxy have anything to do with it. I can with some effort bypass the reverse proxy, and do a direct port-forward from a firewall instead, but running on bare metal will be substantially more difficult. Because I'm in container, I can easily snap/test/revert possible fixes.

adam added the bug label 2025-12-29 02:21:01 +01:00

adam closed this issue

adam commented

@jwischka commented on GitHub (Dec 19, 2023):

Update to this - I have 2 headscale instances behind my nginx proxy, and in splitting out the connection status, it looks like this is cutting off at 512 nodes - I'm not sure if that's helpful, but since it's a magic number of sorts it might be.

@jwischka commented on GitHub (Dec 19, 2023): Update to this - I have 2 headscale instances behind my nginx proxy, and in splitting out the connection status, it looks like this is cutting off at 512 nodes - I'm not sure if that's helpful, but since it's a magic number of sorts it might be.

adam commented

@TotoTheDragon commented on GitHub (Feb 5, 2024):

@jwischka Have you made sure you are not hitting your file descriptor limits.
If ulimit -n is still 1024, (awfully close to double when your issues start to arise)
try to raise it with ulimit -n 4096 and see if that fixes this issue

@TotoTheDragon commented on GitHub (Feb 5, 2024): @jwischka Have you made sure you are not hitting your file descriptor limits. If `ulimit -n` is still 1024, (awfully close to double when your issues start to arise) try to raise it with `ulimit -n 4096` and see if that fixes this issue

adam commented

@jwischka commented on GitHub (Feb 5, 2024):

@TotoTheDragon Negative, unfortunately. Just tried raising the ulimit to 16384 on alpha3, same issue persists. Almost immediately gets to

root@headscale:/home/user# headscale nodes list
2024-02-05T21:00:18Z TRC DNS configuration loaded dns_config={"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]}
Cannot get nodes: context deadline exceeded

@jwischka commented on GitHub (Feb 5, 2024): @TotoTheDragon Negative, unfortunately. Just tried raising the ulimit to 16384 on alpha3, same issue persists. Almost immediately gets to root@headscale:/home/user# headscale nodes list 2024-02-05T21:00:18Z TRC DNS configuration loaded dns_config={"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]} Cannot get nodes: context deadline exceeded

adam commented

@TotoTheDragon commented on GitHub (Feb 12, 2024):

@jwischka would you be able to test this with current version of main?

@TotoTheDragon commented on GitHub (Feb 12, 2024): @jwischka would you be able to test this with current version of main?

adam commented

@jwischka commented on GitHub (Feb 12, 2024):

@TotoTheDragon forgive my ignorance but is there a snapshot build available? I don't have a build env set up, and won't be able to create one on short notice.

@jwischka commented on GitHub (Feb 12, 2024): @TotoTheDragon forgive my ignorance but is there a snapshot build available? I don't have a build env set up, and won't be able to create one on short notice.

adam commented

@kradalby commented on GitHub (Feb 15, 2024):

@jwischka just released a new alpha4, please give that a go https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 and report back.

@kradalby commented on GitHub (Feb 15, 2024): @jwischka just released a new alpha4, please give that a go https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 and report back.

adam commented

@jwischka commented on GitHub (Feb 15, 2024):

@kradalby

After updating and fixing the database config section, I'm getting the following error:

FTL Migration failed: LastInsertId is not supported by this driver error="LastInsertId is not supported by this driver"

(postgres 12)

@jwischka commented on GitHub (Feb 15, 2024): @kradalby After updating and fixing the database config section, I'm getting the following error: FTL Migration failed: LastInsertId is not supported by this driver error="LastInsertId is not supported by this driver" (postgres 12)

adam commented

@kradalby commented on GitHub (Feb 16, 2024):

Yep, on it, it was a regression in a upstream dependency, see #1755

@kradalby commented on GitHub (Feb 16, 2024): Yep, on it, it was a regression in a upstream dependency, see #1755

adam commented

@kradalby commented on GitHub (Feb 19, 2024):

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

The postgres issues should now be resolved.

@kradalby commented on GitHub (Feb 19, 2024): Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ? The postgres issues should now be resolved.

adam commented

@jwischka commented on GitHub (Feb 21, 2024):

@kradalby I think this may be resolved. I had some issues updating the config file, but rebuilt. Memory usage is way up, but overall processor usage is substantially lower than 0.22.2. It takes a while for a client to connect in, but it does appear that clients are reliably able to connect even when there are a lot of them. I'll monitor and report back.

@jwischka commented on GitHub (Feb 21, 2024): @kradalby I think this may be resolved. I had some issues updating the config file, but rebuilt. Memory usage is way up, but overall processor usage is substantially lower than 0.22.2. It takes a while for a client to connect in, but it does appear that clients are reliably able to connect even when there are a lot of them. I'll monitor and report back.

adam commented

@kradalby commented on GitHub (Feb 21, 2024):

Thank you, could you elaborate on how much way up in terms of memory is? I would expect it to be up as we just keep more stuff around in memory but still nice to compare some numbers

@kradalby commented on GitHub (Feb 21, 2024): Thank you, could you elaborate on how much way up in terms of memory is? I would expect it to be up as we just keep more stuff around in memory but still nice to compare some numbers

adam commented

@jwischka commented on GitHub (Feb 22, 2024):

@kradalby I ran for months at about ~700MB/4G of memory in my container instance. It would vary between 500-900MB. After installing alpha 5 I'm at 8GB after bumping the free memory to 8GB. It's actually consuming so much memory I can't log in. Disk IO also increased precipitously, probably (possibly) trying to swap?

Initial usage doesn't appear to be that bad, but something blows up at some point. It goes from using about 200MB to about 2.4GB after 2-3 minutes. It jumps another 500MB or so a couple of minutes later. Periodically I get massive CPU spikes (2500%) with an accompanying 6GB or so usage. Every time this happens it seems to grow another 500MB or so.

Like mentioned in another thread, I'm getting a ton of errors like the following in the logs:

Feb 22 00:53:04 headscale-server-name headscale[407]: 2024-02-22T00:53:04Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientXXXX mkey=mkey:c691d490de423e2daccdd980f217e827b1431788c388ed0baf2c1c0c40413637 origin=poll-nodeupdate-onlinestatus
Feb 22 00:53:05 headscale-server-name headscale[407]: 2024-02-22T00:53:05Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientYYYY mkey=mkey:ff1cf9ccc345d336d3c41b47f060bfeafb38ec8a19fe0f5f97243f17c01ea77a origin=poll-nodeupdate-peers-patch

Also a lot of:

Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa205016 node_key=[n+gDM] omitPeers=false readOnly=false stream=true
Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa215180 node_key=[vIPnF] omitPeers=false readOnly=false stream=true

I may have to revert if this continues, since this is a semi-production machine.

@jwischka commented on GitHub (Feb 22, 2024): @kradalby I ran for months at about ~700MB/4G of memory in my container instance. It would vary between 500-900MB. After installing alpha 5 I'm at 8GB after bumping the free memory to 8GB. It's actually consuming so much memory I can't log in. Disk IO also increased precipitously, probably (possibly) trying to swap? Initial usage doesn't appear to be that bad, but something blows up at some point. It goes from using about 200MB to about 2.4GB after 2-3 minutes. It jumps another 500MB or so a couple of minutes later. Periodically I get massive CPU spikes (2500%) with an accompanying 6GB or so usage. Every time this happens it seems to grow another 500MB or so. Like mentioned in another thread, I'm getting a ton of errors like the following in the logs: Feb 22 00:53:04 headscale-server-name headscale[407]: 2024-02-22T00:53:04Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientXXXX mkey=mkey:c691d490de423e2daccdd980f217e827b1431788c388ed0baf2c1c0c40413637 origin=poll-nodeupdate-onlinestatus Feb 22 00:53:05 headscale-server-name headscale[407]: 2024-02-22T00:53:05Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientYYYY mkey=mkey:ff1cf9ccc345d336d3c41b47f060bfeafb38ec8a19fe0f5f97243f17c01ea77a origin=poll-nodeupdate-peers-patch Also a lot of: Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa205016 node_key=[n+gDM] omitPeers=false readOnly=false stream=true Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa215180 node_key=[vIPnF] omitPeers=false readOnly=false stream=true I may have to revert if this continues, since this is a semi-production machine.

adam commented

@jwischka commented on GitHub (Feb 22, 2024):

@kradalby Further update - it looks like once the memory runs out there are a lot of other connection issues - I go from ~500 units connected to between 100-250. Let me know if there's something else I can do to help debug this for you.

@jwischka commented on GitHub (Feb 22, 2024): @kradalby Further update - it looks like once the memory runs out there are a lot of other connection issues - I go from ~500 units connected to between 100-250. Let me know if there's something else I can do to help debug this for you.

adam commented

@kradalby commented on GitHub (Mar 4, 2024):

@jwischka I've started some experimental work in https://github.com/juanfont/headscale/pull/1791, which should both improve performance, and add some tunables for high traffic usage, but it isnt done yet so I would not recommend trying it in a prod env. If you have a non-prod env that is similar, you can.

@kradalby commented on GitHub (Mar 4, 2024): @jwischka I've started some experimental work in https://github.com/juanfont/headscale/pull/1791, which should both improve performance, and add some tunables for high traffic usage, but it isnt done yet so I would not recommend trying it in a prod env. If you have a non-prod env that is similar, you can.

adam commented

@jwischka commented on GitHub (Mar 4, 2024):

@kradalby Sounds good - let me know when you think it's semi-ready and I can give it a go. I've got the ability to snapshot the container and rollback, but obviously I don't want to do that a ton if there's a chance to break things.

@jwischka commented on GitHub (Mar 4, 2024): @kradalby Sounds good - let me know when you think it's semi-ready and I can give it a go. I've got the ability to snapshot the container and rollback, but obviously I don't want to do that a ton if there's a chance to break things.

adam commented

@ananthb commented on GitHub (Apr 10, 2024):

I'm looking to run a large cluster in about a year or so and I can contribute dev time for this effort. Any thing I can help with?

@ananthb commented on GitHub (Apr 10, 2024): I'm looking to run a large cluster in about a year or so and I can contribute dev time for this effort. Any thing I can help with?

adam commented

@kradalby commented on GitHub (Apr 17, 2024):

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby commented on GitHub (Apr 17, 2024): Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

adam commented

@ananthb commented on GitHub (Apr 17, 2024):

I can't find release binaries for alpha6 so I'm running alpha7. I'm already seeing reduced memory usage a couple of hours in. Anything in particular I should look for?

I see new log lines talking about partial updates.

@ananthb commented on GitHub (Apr 17, 2024): I can't find release binaries for alpha6 so I'm running alpha7. I'm already seeing reduced memory usage a couple of hours in. Anything in particular I should look for? I see new log lines talking about partial updates.

adam commented

@kradalby commented on GitHub (Apr 17, 2024):

Sorry, 6 quickly got replaced with 7 because of an error. In principle I would say that the "nodes stays connected over time" would be the main goal, that none of them loose connection. The main change for performance and resource usage is a change in how the updates are batched and sent to the clients.

@kradalby commented on GitHub (Apr 17, 2024): Sorry, 6 quickly got replaced with 7 because of an error. In principle I would say that the "nodes stays connected over time" would be the main goal, that none of them loose connection. The main change for performance and resource usage is a change in how the updates are batched and sent to the clients.

adam commented

@jwischka commented on GitHub (Apr 17, 2024):

@kradalby Is alpha 7 working with postgres? I'm getting errors.

Also, it's showing that Alpha 8 is out but not posted?

Thanks

@jwischka commented on GitHub (Apr 17, 2024): @kradalby Is alpha 7 working with postgres? I'm getting errors. Also, it's showing that Alpha 8 is out but not posted? Thanks

adam commented