[Bug] 'headscale' commands unusable under load #981

Closed
opened 2025-12-29 02:27:00 +01:00 by adam · 3 comments
Owner

Originally created by @arduino43 on GitHub (Mar 19, 2025).

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian.

Headscale server (dedicated)
CPU: AMD EPYC 7313
Memory: 128GB
Network: 5Gbps
Headscale version : v0.25.1

1.) Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed.

3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled.

I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing.

Expected Behavior

System runs without issue

Steps To Reproduce

1.) Add clients to servers,after +300 system stops functioning correctly

Environment

- OS: Debian 12
- Headscale version: v0.25.1
- Tailscale version: 1.80.3

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Debug information

Node

Originally created by @arduino43 on GitHub (Mar 19, 2025). ### Is this a support request? - [x] This is not a support request ### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian. Headscale server (dedicated) CPU: AMD EPYC 7313 Memory: 128GB Network: 5Gbps Headscale version : v0.25.1 1.) Running headscale cli results in _"Cannot get nodes: context deadline exceeded"_ 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client. 2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed. 3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin. 4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled. I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing. ### Expected Behavior System runs without issue ### Steps To Reproduce 1.) Add clients to servers,after +300 system stops functioning correctly ### Environment ```markdown - OS: Debian 12 - Headscale version: v0.25.1 - Tailscale version: 1.80.3 ``` ### Runtime environment - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ### Debug information Node
adam added the stalebugperformance labels 2025-12-29 02:27:00 +01:00
adam closed this issue 2025-12-29 02:27:01 +01:00
Author
Owner

@kradalby commented on GitHub (Mar 20, 2025):

After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed

Headscale just isnt made for this, throwing more hardware at the problem only works to a certain point.

After some discussions in discord, I wrote up "Scaling / How many clients does Headscale support?".

But well, if you say 300 is the limit, then my example with 1000 might be too much.

Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

I'll try to break this up:

Cannot get nodes: context deadline exceeded: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable.

The server is sitting avg 45% CPU usage with no traffic: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar.

only a few Mb per day is passed to each client: Not that relevant since the traffic goes directly between the clients.

node_update_check_interval

This option does not exist anymore.

I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future.

I would say this isnt as much a bug as "not a feature", at least not yet.

@kradalby commented on GitHub (Mar 20, 2025): > After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed Headscale just isnt made for this, throwing more hardware at the problem only works to a certain point. After some discussions in discord, I wrote up "[Scaling / How many clients does Headscale support?](https://headscale.net/development/about/faq/#scaling-how-many-clients-does-headscale-support)". But well, if you say 300 is the limit, then my example with 1000 might be too much. > Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client. I'll try to break this up: `Cannot get nodes: context deadline exceeded`: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable. `The server is sitting avg 45% CPU usage with no traffic`: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar. `only a few Mb per day is passed to each client`: Not that relevant since the traffic goes directly between the clients. > node_update_check_interval This option does not exist anymore. > I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin. There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future. I would say this isnt as much a bug as "not a feature", at least not yet.
Author
Owner

@github-actions[bot] commented on GitHub (Jun 23, 2025):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Jun 23, 2025): This issue is stale because it has been open for 90 days with no activity.
Author
Owner

@github-actions[bot] commented on GitHub (Jun 30, 2025):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Jun 30, 2025): This issue was closed because it has been inactive for 14 days since being marked as stale.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#981