Headscale stops accepting connections after ~500 nodes (likely 512) (0.23-alpha2) #599

Closed
opened 2025-12-29 02:21:01 +01:00 by adam · 24 comments
Owner

Originally created by @jwischka on GitHub (Dec 16, 2023).

Bug description

I have a large headscale instance (~550 nodes). 0.22.3 has extreme CPU usage (i.e., 30 cores), but is able to handle all clients. 0.23-alpha2 has substantially better CPU performance, but stops accepting connections after about 500 nodes. CLI shows "context exceeded" for all queries (e.g. "headscale nodes/users/routes list") and new clients are unable to join.

CPU usage after connections stop is relatively modest (<50%), and connected clients appear to be able to access each other (e.g. ping/login) as expected.

Environment

Headscale: 0.23-alpha2
Tailscale: various, but mostly 1.54+, mostly Linux, some Mac and Windows.
OS: Headscale installed in privileged Proxmox LXC container (Ubuntu 20.04.6), reverse proxied behind nginx per official docs
Kernel: 6.2.16
Resources: 36 cores, 8GB RAM, 4GB swap (ram usage is quite small)
DB: Using postgres backend (sqlite would die after about 100 clients on 0.22.X with similar symptoms)
Config: ACLs are in use based on users. Two users have access to all nodes. Most nodes have access only to a relatively small set of other nodes via ACL. Specifically, there are 7 nodes that have access to everything, but on almost all other nodes "tailscale status" will return only 8 devices.

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

To Reproduce

I suspect difficult to do, but have large numbers of clients connect in.

Given that the behavior is a direct change from 0.22.3 -> 0.23-alpha2, I don't think the container or reverse proxy have anything to do with it. I can with some effort bypass the reverse proxy, and do a direct port-forward from a firewall instead, but running on bare metal will be substantially more difficult.

Because I'm in container, I can easily snap/test/revert possible fixes.

Originally created by @jwischka on GitHub (Dec 16, 2023). ## Bug description I have a large headscale instance (~550 nodes). 0.22.3 has extreme CPU usage (i.e., 30 cores), but is able to handle all clients. 0.23-alpha2 has substantially better CPU performance, but stops accepting connections after about 500 nodes. CLI shows "context exceeded" for all queries (e.g. "headscale nodes/users/routes list") and new clients are unable to join. CPU usage after connections stop is relatively modest (<50%), and connected clients appear to be able to access each other (e.g. ping/login) as expected. ## Environment Headscale: 0.23-alpha2 Tailscale: various, but mostly 1.54+, mostly Linux, some Mac and Windows. OS: Headscale installed in privileged Proxmox LXC container (Ubuntu 20.04.6), reverse proxied behind nginx per official docs Kernel: 6.2.16 Resources: 36 cores, 8GB RAM, 4GB swap (ram usage is quite small) DB: Using postgres backend (sqlite would die after about 100 clients on 0.22.X with similar symptoms) Config: ACLs are in use based on users. Two users have access to all nodes. Most nodes have access only to a relatively small set of other nodes via ACL. Specifically, there are 7 nodes that have access to everything, but on almost all other nodes "tailscale status" will return only 8 devices. - [X] Headscale is behind a (reverse) proxy - [X] Headscale runs in a container ## To Reproduce I suspect difficult to do, but have large numbers of clients connect in. Given that the behavior is a direct change from 0.22.3 -> 0.23-alpha2, I don't think the container or reverse proxy have anything to do with it. I can with some effort bypass the reverse proxy, and do a direct port-forward from a firewall instead, but running on bare metal will be substantially more difficult. Because I'm in container, I can easily snap/test/revert possible fixes.
adam added the bug label 2025-12-29 02:21:01 +01:00
adam closed this issue 2025-12-29 02:21:01 +01:00
Author
Owner

@jwischka commented on GitHub (Dec 19, 2023):

Update to this - I have 2 headscale instances behind my nginx proxy, and in splitting out the connection status, it looks like this is cutting off at 512 nodes - I'm not sure if that's helpful, but since it's a magic number of sorts it might be.

@jwischka commented on GitHub (Dec 19, 2023): Update to this - I have 2 headscale instances behind my nginx proxy, and in splitting out the connection status, it looks like this is cutting off at 512 nodes - I'm not sure if that's helpful, but since it's a magic number of sorts it might be.
Author
Owner

@TotoTheDragon commented on GitHub (Feb 5, 2024):

@jwischka Have you made sure you are not hitting your file descriptor limits.
If ulimit -n is still 1024, (awfully close to double when your issues start to arise)
try to raise it with ulimit -n 4096 and see if that fixes this issue

@TotoTheDragon commented on GitHub (Feb 5, 2024): @jwischka Have you made sure you are not hitting your file descriptor limits. If `ulimit -n` is still 1024, (awfully close to double when your issues start to arise) try to raise it with `ulimit -n 4096` and see if that fixes this issue
Author
Owner

@jwischka commented on GitHub (Feb 5, 2024):

@TotoTheDragon Negative, unfortunately. Just tried raising the ulimit to 16384 on alpha3, same issue persists. Almost immediately gets to

root@headscale:/home/user# headscale nodes list
2024-02-05T21:00:18Z TRC DNS configuration loaded dns_config={"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]}
Cannot get nodes: context deadline exceeded

@jwischka commented on GitHub (Feb 5, 2024): @TotoTheDragon Negative, unfortunately. Just tried raising the ulimit to 16384 on alpha3, same issue persists. Almost immediately gets to root@headscale:/home/user# headscale nodes list 2024-02-05T21:00:18Z TRC DNS configuration loaded dns_config={"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]} Cannot get nodes: context deadline exceeded
Author
Owner

@TotoTheDragon commented on GitHub (Feb 12, 2024):

@jwischka would you be able to test this with current version of main?

@TotoTheDragon commented on GitHub (Feb 12, 2024): @jwischka would you be able to test this with current version of main?
Author
Owner

@jwischka commented on GitHub (Feb 12, 2024):

@TotoTheDragon forgive my ignorance but is there a snapshot build available? I don't have a build env set up, and won't be able to create one on short notice.

@jwischka commented on GitHub (Feb 12, 2024): @TotoTheDragon forgive my ignorance but is there a snapshot build available? I don't have a build env set up, and won't be able to create one on short notice.
Author
Owner

@kradalby commented on GitHub (Feb 15, 2024):

@jwischka just released a new alpha4, please give that a go https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 and report back.

@kradalby commented on GitHub (Feb 15, 2024): @jwischka just released a new alpha4, please give that a go https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 and report back.
Author
Owner

@jwischka commented on GitHub (Feb 15, 2024):

@kradalby

After updating and fixing the database config section, I'm getting the following error:

FTL Migration failed: LastInsertId is not supported by this driver error="LastInsertId is not supported by this driver"

(postgres 12)

@jwischka commented on GitHub (Feb 15, 2024): @kradalby After updating and fixing the database config section, I'm getting the following error: FTL Migration failed: LastInsertId is not supported by this driver error="LastInsertId is not supported by this driver" (postgres 12)
Author
Owner

@kradalby commented on GitHub (Feb 16, 2024):

Yep, on it, it was a regression in a upstream dependency, see #1755

@kradalby commented on GitHub (Feb 16, 2024): Yep, on it, it was a regression in a upstream dependency, see #1755
Author
Owner

@kradalby commented on GitHub (Feb 19, 2024):

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

The postgres issues should now be resolved.

@kradalby commented on GitHub (Feb 19, 2024): Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ? The postgres issues should now be resolved.
Author
Owner

@jwischka commented on GitHub (Feb 21, 2024):

@kradalby I think this may be resolved. I had some issues updating the config file, but rebuilt. Memory usage is way up, but overall processor usage is substantially lower than 0.22.2. It takes a while for a client to connect in, but it does appear that clients are reliably able to connect even when there are a lot of them. I'll monitor and report back.

@jwischka commented on GitHub (Feb 21, 2024): @kradalby I think this may be resolved. I had some issues updating the config file, but rebuilt. Memory usage is way up, but overall processor usage is substantially lower than 0.22.2. It takes a while for a client to connect in, but it does appear that clients are reliably able to connect even when there are a lot of them. I'll monitor and report back.
Author
Owner

@kradalby commented on GitHub (Feb 21, 2024):

Thank you, could you elaborate on how much way up in terms of memory is? I would expect it to be up as we just keep more stuff around in memory but still nice to compare some numbers

@kradalby commented on GitHub (Feb 21, 2024): Thank you, could you elaborate on how much way up in terms of memory is? I would expect it to be up as we just keep more stuff around in memory but still nice to compare some numbers
Author
Owner

@jwischka commented on GitHub (Feb 22, 2024):

@kradalby I ran for months at about ~700MB/4G of memory in my container instance. It would vary between 500-900MB. After installing alpha 5 I'm at 8GB after bumping the free memory to 8GB. It's actually consuming so much memory I can't log in. Disk IO also increased precipitously, probably (possibly) trying to swap?

Initial usage doesn't appear to be that bad, but something blows up at some point. It goes from using about 200MB to about 2.4GB after 2-3 minutes. It jumps another 500MB or so a couple of minutes later. Periodically I get massive CPU spikes (2500%) with an accompanying 6GB or so usage. Every time this happens it seems to grow another 500MB or so.

Like mentioned in another thread, I'm getting a ton of errors like the following in the logs:

Feb 22 00:53:04 headscale-server-name headscale[407]: 2024-02-22T00:53:04Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientXXXX mkey=mkey:c691d490de423e2daccdd980f217e827b1431788c388ed0baf2c1c0c40413637 origin=poll-nodeupdate-onlinestatus
Feb 22 00:53:05 headscale-server-name headscale[407]: 2024-02-22T00:53:05Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientYYYY mkey=mkey:ff1cf9ccc345d336d3c41b47f060bfeafb38ec8a19fe0f5f97243f17c01ea77a origin=poll-nodeupdate-peers-patch

Also a lot of:

Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa205016 node_key=[n+gDM] omitPeers=false readOnly=false stream=true
Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa215180 node_key=[vIPnF] omitPeers=false readOnly=false stream=true

I may have to revert if this continues, since this is a semi-production machine.

@jwischka commented on GitHub (Feb 22, 2024): @kradalby I ran for months at about ~700MB/4G of memory in my container instance. It would vary between 500-900MB. After installing alpha 5 I'm at 8GB after bumping the free memory to 8GB. It's actually consuming so much memory I can't log in. Disk IO also increased precipitously, probably (possibly) trying to swap? Initial usage doesn't appear to be that bad, but something blows up at some point. It goes from using about 200MB to about 2.4GB after 2-3 minutes. It jumps another 500MB or so a couple of minutes later. Periodically I get massive CPU spikes (2500%) with an accompanying 6GB or so usage. Every time this happens it seems to grow another 500MB or so. Like mentioned in another thread, I'm getting a ton of errors like the following in the logs: Feb 22 00:53:04 headscale-server-name headscale[407]: 2024-02-22T00:53:04Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientXXXX mkey=mkey:c691d490de423e2daccdd980f217e827b1431788c388ed0baf2c1c0c40413637 origin=poll-nodeupdate-onlinestatus Feb 22 00:53:05 headscale-server-name headscale[407]: 2024-02-22T00:53:05Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientYYYY mkey=mkey:ff1cf9ccc345d336d3c41b47f060bfeafb38ec8a19fe0f5f97243f17c01ea77a origin=poll-nodeupdate-peers-patch Also a lot of: Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa205016 node_key=[n+gDM] omitPeers=false readOnly=false stream=true Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa215180 node_key=[vIPnF] omitPeers=false readOnly=false stream=true I may have to revert if this continues, since this is a semi-production machine.
Author
Owner

@jwischka commented on GitHub (Feb 22, 2024):

@kradalby Further update - it looks like once the memory runs out there are a lot of other connection issues - I go from ~500 units connected to between 100-250. Let me know if there's something else I can do to help debug this for you.

@jwischka commented on GitHub (Feb 22, 2024): @kradalby Further update - it looks like once the memory runs out there are a lot of other connection issues - I go from ~500 units connected to between 100-250. Let me know if there's something else I can do to help debug this for you.
Author
Owner

@kradalby commented on GitHub (Mar 4, 2024):

@jwischka I've started some experimental work in https://github.com/juanfont/headscale/pull/1791, which should both improve performance, and add some tunables for high traffic usage, but it isnt done yet so I would not recommend trying it in a prod env. If you have a non-prod env that is similar, you can.

@kradalby commented on GitHub (Mar 4, 2024): @jwischka I've started some experimental work in https://github.com/juanfont/headscale/pull/1791, which should both improve performance, and add some tunables for high traffic usage, but it isnt done yet so I would not recommend trying it in a prod env. If you have a non-prod env that is similar, you can.
Author
Owner

@jwischka commented on GitHub (Mar 4, 2024):

@kradalby Sounds good - let me know when you think it's semi-ready and I can give it a go. I've got the ability to snapshot the container and rollback, but obviously I don't want to do that a ton if there's a chance to break things.

@jwischka commented on GitHub (Mar 4, 2024): @kradalby Sounds good - let me know when you think it's semi-ready and I can give it a go. I've got the ability to snapshot the container and rollback, but obviously I don't want to do that a ton if there's a chance to break things.
Author
Owner

@ananthb commented on GitHub (Apr 10, 2024):

I'm looking to run a large cluster in about a year or so and I can contribute dev time for this effort. Any thing I can help with?

@ananthb commented on GitHub (Apr 10, 2024): I'm looking to run a large cluster in about a year or so and I can contribute dev time for this effort. Any thing I can help with?
Author
Owner

@kradalby commented on GitHub (Apr 17, 2024):

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby commented on GitHub (Apr 17, 2024): Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?
Author
Owner

@ananthb commented on GitHub (Apr 17, 2024):

I can't find release binaries for alpha6 so I'm running alpha7. I'm already seeing reduced memory usage a couple of hours in. Anything in particular I should look for?

I see new log lines talking about partial updates.

@ananthb commented on GitHub (Apr 17, 2024): I can't find release binaries for alpha6 so I'm running alpha7. I'm already seeing reduced memory usage a couple of hours in. Anything in particular I should look for? I see new log lines talking about partial updates.
Author
Owner

@kradalby commented on GitHub (Apr 17, 2024):

Sorry, 6 quickly got replaced with 7 because of an error. In principle I would say that the "nodes stays connected over time" would be the main goal, that none of them loose connection. The main change for performance and resource usage is a change in how the updates are batched and sent to the clients.

@kradalby commented on GitHub (Apr 17, 2024): Sorry, 6 quickly got replaced with 7 because of an error. In principle I would say that the "nodes stays connected over time" would be the main goal, that none of them loose connection. The main change for performance and resource usage is a change in how the updates are batched and sent to the clients.
Author
Owner

@jwischka commented on GitHub (Apr 17, 2024):

@kradalby Is alpha 7 working with postgres? I'm getting errors.

Also, it's showing that Alpha 8 is out but not posted?

Thanks

@jwischka commented on GitHub (Apr 17, 2024): @kradalby Is alpha 7 working with postgres? I'm getting errors. Also, it's showing that Alpha 8 is out but not posted? Thanks
Author
Owner

@jwischka commented on GitHub (Apr 17, 2024):

@kradalby I got alpha 8 to work.

At least so far things look like they are stable, but it looks like the issue (~500 nodes) still persists. When I try joining a client machine that is requesting the entire tailnet (i.e., a machine that should be able to connect to all 500 nodes), I get the following errors after things have settled down:

Apr 17 14:59:11 headscale headscale[21834]: 2024-04-17T14:59:11Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > a node sending a MapRequest with Noise protocol node=x13s node.id=582 omitPeers=false readOnly=false stream=true
Apr 17 14:59:11 headscale headscale[21834]: 2024-04-17T14:59:11Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > aquiring lock to check stream node=x13s node.id=582 omitPeers=false readOnly=false stream=true
Apr 17 14:59:14 headscale headscale[21834]: 2024-04-17T14:59:14Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > a node sending a MapRequest with Noise protocol node=x13s node.id=582 omitPeers=false readOnly=false stream=true
Apr 17 14:59:14 headscale headscale[21834]: 2024-04-17T14:59:14Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > aquiring lock to check stream node=x13s node.id=582 omitPeers=false readOnly=false stream=true

If I kill the process and the node is one of the first to try to join the tailnet, it joins quite quickly. You may understand these errors better than I do, but it looks to me like the client is timing out and requesting things again. These errors appear every few seconds. I'm also getting a lot of these (thousands) messages in the log files.

So it looks like memory usage, processor usage, etc are better, but the connection issues may still remain. It looks like I've leveled out at 538 connections. I can try to add a few different clients here in a bit and see if it goes up, or how long it takes to connect clients. Is there anything I can supply you that would be helpful?

@jwischka commented on GitHub (Apr 17, 2024): @kradalby I got alpha 8 to work. At least so far things look like they are stable, but it looks like the issue (~500 nodes) still persists. When I try joining a client machine that is requesting the entire tailnet (i.e., a machine that should be able to connect to all 500 nodes), I get the following errors after things have settled down: ``` Apr 17 14:59:11 headscale headscale[21834]: 2024-04-17T14:59:11Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > a node sending a MapRequest with Noise protocol node=x13s node.id=582 omitPeers=false readOnly=false stream=true Apr 17 14:59:11 headscale headscale[21834]: 2024-04-17T14:59:11Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > aquiring lock to check stream node=x13s node.id=582 omitPeers=false readOnly=false stream=true Apr 17 14:59:14 headscale headscale[21834]: 2024-04-17T14:59:14Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > a node sending a MapRequest with Noise protocol node=x13s node.id=582 omitPeers=false readOnly=false stream=true Apr 17 14:59:14 headscale headscale[21834]: 2024-04-17T14:59:14Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > aquiring lock to check stream node=x13s node.id=582 omitPeers=false readOnly=false stream=true ``` If I kill the process and the node is one of the first to try to join the tailnet, it joins quite quickly. You may understand these errors better than I do, but it looks to me like the client is timing out and requesting things again. These errors appear every few seconds. I'm also getting *a lot* of these (thousands) messages in the log files. So it looks like memory usage, processor usage, etc are better, but the connection issues may still remain. It looks like I've leveled out at 538 connections. I can try to add a few different clients here in a bit and see if it goes up, or how long it takes to connect clients. Is there anything I can supply you that would be helpful?
Author
Owner

@kradalby commented on GitHub (Apr 17, 2024):

There are two new undocumented tuning options (please note the might change and/or removed) which could helps us figure out maybe the holdup:

https://github.com/juanfont/headscale/blob/main/hscontrol/types/config.go#L235-L236

The first is the max amount of time it waits for updates before bundling them and sending them to the node.

The seconds is how many updates a node can receive before it will block.

Can you experiment with these numbers?

I would not set the wait for higher than 1s, but lower will flush to client more often. I would expect lower wait to be lower memory, higher cpu.

For the batcher,you might see a super tiny increase in memory, but I doubt it as it's mostly just more ints allowed on a channel, but you could try that for quite some different large numbers and report back: 100, 300, 500, 1000 even.

It would help if you could try different combinations for these and try to make a table with your observation, it's hard for me to simulate environments like yours.

Thanks.

@kradalby commented on GitHub (Apr 17, 2024): There are two new undocumented tuning options (please note the might change and/or removed) which could helps us figure out maybe the holdup: https://github.com/juanfont/headscale/blob/main/hscontrol/types/config.go#L235-L236 The first is the max amount of time it waits for updates before bundling them and sending them to the node. The seconds is how many updates a node can receive before it will block. Can you experiment with these numbers? I would not set the wait for higher than 1s, but lower will flush to client more often. I would expect lower wait to be lower memory, higher cpu. For the batcher,you might see a super tiny increase in memory, but I doubt it as it's mostly just more ints allowed on a channel, but you could try that for quite some different large numbers and report back: 100, 300, 500, 1000 even. It would help if you could try different combinations for these and try to make a table with your observation, it's hard for me to simulate environments like yours. Thanks.
Author
Owner

@jwischka commented on GitHub (Apr 17, 2024):

Some updates -

As soon as I start to increase node_mapsession_buffered_chan_size, I immediately get a lot of crashes. Even a value of 35 will result in reliable crashes. The error message is longer than my console log, but it ends with:

Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Rows).initContextClose.gowrap1()
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977
Apr 17 19:53:27 headscale headscale[3521]: runtime.goexit({})
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/runtime/asm_amd64.s:1695 +0x1
Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*Rows).initContextClose in goroutine 4029
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 +0x150
Apr 17 19:53:27 headscale headscale[3521]: goroutine 15485 [chan receive]:
Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Tx).awaitDone(0xc004717a00)
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2202 +0x2b
Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*DB).beginDC in goroutine 4029
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:1915 +0x20d
Apr 17 19:53:27 headscale headscale[3521]: goroutine 15486 [runnable]:
Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Rows).initContextClose.gowrap1()
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977
Apr 17 19:53:27 headscale headscale[3521]: runtime.goexit({})
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/runtime/asm_amd64.s:1695 +0x1
Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*Rows).initContextClose in goroutine 3521
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 +0x150
Apr 17 19:53:27 headscale systemd[1]: headscale.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 17 19:53:27 headscale systemd[1]: headscale.service: Failed with result 'exit-code'.
Apr 17 19:53:27 headscale systemd[1]: headscale.service: Consumed 13.084s CPU time.

Setting the value back down to 30 seems to calm things down.

Interestingly, if I increase the number of postgres max_open_conns I can get similar errors, and I can also generate http related errors:

Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlhttp/server.go:193 +0x5b
Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).readNLocked(0xc002e94d20, 0x3)
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:115 +0xe2
Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).decryptOneLocked(0xc002e94d20)
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:223 +0x1f4
Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).Read(0xc002e94d20, {0xc00278b0f8, 0x18, 0x458401?})
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:253 +0x117
Apr 17 19:58:12 headscale headscale[6319]: io.ReadAtLeast({0x22811c0, 0xc002e94d20}, {0xc00278b0f8, 0x18, 0x18}, 0x18)
Apr 17 19:58:12 headscale headscale[6319]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/io/io.go:335 +0x90
Apr 17 19:58:12 headscale headscale[6319]: io.ReadFull(...)
Apr 17 19:58:12 headscale headscale[6319]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/io/io.go:354
Apr 17 19:58:12 headscale headscale[6319]: golang.org/x/net/http2.(*serverConn).readPreface.func1()
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/golang.org/x/net@v0.22.0/http2/server.go:1055 +0x87
Apr 17 19:58:12 headscale headscale[6319]: created by golang.org/x/net/http2.(*serverConn).readPreface in goroutine 49462
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/golang.org/x/net@v0.22.0/http2/server.go:1052 +0xae
Apr 17 19:58:12 headscale systemd[1]: headscale.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 17 19:58:12 headscale systemd[1]: headscale.service: Failed with result 'exit-code'.
Apr 17 19:58:12 headscale systemd[1]: headscale.service: Consumed 32.458s CPU time.

With the 0.23 alphas I've been leaving the node update check interval at 10, but I bumped that to 30 and things seem to behave better (I had it at 35s on 0.22).

I'm not really having any cpu/memory issues at the moment, but the real problem is that when I try to connect with one of my "main" users that ought to be able to see all of the other nodes (via ACL policy) it's never able to join. It just hangs for minutes without joining.

Also worth noting because it's probably relevant, but ACLs seem to be broken on this instance (unless there was a change to the ACL behavior that I missed), and all of my 500+ nodes can see all of my other 500+ nodes, which is... well, kind of bad. Probably bad enough I need to revert. I'm not exactly sure what the issue is because I have another instance that I also updated to alpha8 and the ACLs work fine there. Both instances use a similar acl structure, so it's somewhat odd that it works in one place and not the other.

Thoughts? I saved the config file this time, so it's pretty easy to try new stuff at this point.

@jwischka commented on GitHub (Apr 17, 2024): Some updates - As soon as I start to increase node_mapsession_buffered_chan_size, I immediately get a lot of crashes. Even a value of 35 will result in reliable crashes. The error message is longer than my console log, but it ends with: ``` Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Rows).initContextClose.gowrap1() Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 Apr 17 19:53:27 headscale headscale[3521]: runtime.goexit({}) Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/runtime/asm_amd64.s:1695 +0x1 Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*Rows).initContextClose in goroutine 4029 Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 +0x150 Apr 17 19:53:27 headscale headscale[3521]: goroutine 15485 [chan receive]: Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Tx).awaitDone(0xc004717a00) Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2202 +0x2b Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*DB).beginDC in goroutine 4029 Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:1915 +0x20d Apr 17 19:53:27 headscale headscale[3521]: goroutine 15486 [runnable]: Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Rows).initContextClose.gowrap1() Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 Apr 17 19:53:27 headscale headscale[3521]: runtime.goexit({}) Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/runtime/asm_amd64.s:1695 +0x1 Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*Rows).initContextClose in goroutine 3521 Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 +0x150 Apr 17 19:53:27 headscale systemd[1]: headscale.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Apr 17 19:53:27 headscale systemd[1]: headscale.service: Failed with result 'exit-code'. Apr 17 19:53:27 headscale systemd[1]: headscale.service: Consumed 13.084s CPU time. ``` Setting the value back down to 30 seems to calm things down. Interestingly, if I increase the number of postgres max_open_conns I can get similar errors, and I can also generate http related errors: ``` Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlhttp/server.go:193 +0x5b Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).readNLocked(0xc002e94d20, 0x3) Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:115 +0xe2 Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).decryptOneLocked(0xc002e94d20) Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:223 +0x1f4 Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).Read(0xc002e94d20, {0xc00278b0f8, 0x18, 0x458401?}) Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:253 +0x117 Apr 17 19:58:12 headscale headscale[6319]: io.ReadAtLeast({0x22811c0, 0xc002e94d20}, {0xc00278b0f8, 0x18, 0x18}, 0x18) Apr 17 19:58:12 headscale headscale[6319]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/io/io.go:335 +0x90 Apr 17 19:58:12 headscale headscale[6319]: io.ReadFull(...) Apr 17 19:58:12 headscale headscale[6319]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/io/io.go:354 Apr 17 19:58:12 headscale headscale[6319]: golang.org/x/net/http2.(*serverConn).readPreface.func1() Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/golang.org/x/net@v0.22.0/http2/server.go:1055 +0x87 Apr 17 19:58:12 headscale headscale[6319]: created by golang.org/x/net/http2.(*serverConn).readPreface in goroutine 49462 Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/golang.org/x/net@v0.22.0/http2/server.go:1052 +0xae Apr 17 19:58:12 headscale systemd[1]: headscale.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Apr 17 19:58:12 headscale systemd[1]: headscale.service: Failed with result 'exit-code'. Apr 17 19:58:12 headscale systemd[1]: headscale.service: Consumed 32.458s CPU time. ``` With the 0.23 alphas I've been leaving the node update check interval at 10, but I bumped that to 30 and things seem to behave better (I had it at 35s on 0.22). I'm not really having any cpu/memory issues at the moment, but the real problem is that when I try to connect with one of my "main" users that ought to be able to see all of the other nodes (via ACL policy) it's never able to join. It just hangs for minutes without joining. Also worth noting because it's probably relevant, but ACLs seem to be broken on this instance (unless there was a change to the ACL behavior that I missed), and all of my 500+ nodes can see all of my other 500+ nodes, which is... well, kind of bad. Probably bad enough I need to revert. I'm not exactly sure what the issue is because I have another instance that I also updated to alpha8 and the ACLs work fine there. Both instances use a similar acl structure, so it's somewhat odd that it works in one place and not the other. Thoughts? I saved the config file this time, so it's pretty easy to try new stuff at this point.
Author
Owner

@kradalby commented on GitHub (May 24, 2024):

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

@kradalby commented on GitHub (May 24, 2024): I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#599