[Bug] 'headscale' commands unusable under load #981

New Issue

adam · 2025-12-29T02:27:00+01:00

adam commented

2025-12-29 02:27:00 +01:00

Originally created by @arduino43 on GitHub (Mar 19, 2025).

Is this a support request?

This is not a support request

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian.

Headscale server (dedicated)
CPU: AMD EPYC 7313
Memory: 128GB
Network: 5Gbps
Headscale version : v0.25.1

1.) Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed.

3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled.

I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing.

Expected Behavior

System runs without issue

Steps To Reproduce

1.) Add clients to servers,after +300 system stops functioning correctly

Environment

- OS: Debian 12
- Headscale version: v0.25.1
- Tailscale version: 1.80.3

Runtime environment

Headscale is behind a (reverse) proxy
Headscale runs in a container

Debug information

Node

Originally created by @arduino43 on GitHub (Mar 19, 2025). ### Is this a support request? - [x] This is not a support request ### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior I've noticed that Headscale becomes nearly unusable after 300+ clients; I have 554 after switching to a more powerful system and this is now completely maxed out. All clients are identical hardware specs running Debian. Headscale server (dedicated) CPU: AMD EPYC 7313 Memory: 128GB Network: 5Gbps Headscale version : v0.25.1 1.) Running headscale cli results in _"Cannot get nodes: context deadline exceeded"_ 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client. 2.) After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed. 3.) I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin. 4.) config is basic, only addition being "node_update_check_interval: 90s" in an attempt to minimize load, but im not seeing much difference enabled/disabled. I did see a few issues regarding CPU usage, but most were resolved by updates. I do realize this is a large number of clients, however I was expecting much lower load and intermittent issues with no traffic passing. ### Expected Behavior System runs without issue ### Steps To Reproduce 1.) Add clients to servers,after +300 system stops functioning correctly ### Environment ```markdown - OS: Debian 12 - Headscale version: v0.25.1 - Tailscale version: 1.80.3 ``` ### Runtime environment - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ### Debug information Node

adam added the stale bug performance labels 2025-12-29 02:27:00 +01:00

adam closed this issue

2025-12-29 02:27:01 +01:00

adam commented

2025-12-29 02:27:01 +01:00

@kradalby commented on GitHub (Mar 20, 2025):

After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed

Headscale just isnt made for this, throwing more hardware at the problem only works to a certain point.

After some discussions in discord, I wrote up "Scaling / How many clients does Headscale support?".

But well, if you say 300 is the limit, then my example with 1000 might be too much.

Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client.

I'll try to break this up:

Cannot get nodes: context deadline exceeded: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable.

The server is sitting avg 45% CPU usage with no traffic: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar.

only a few Mb per day is passed to each client: Not that relevant since the traffic goes directly between the clients.

node_update_check_interval

This option does not exist anymore.

I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin.

There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future.

I would say this isnt as much a bug as "not a feature", at least not yet.

@kradalby commented on GitHub (Mar 20, 2025): > After 300+ nodes, pinging becomes finicky and results in 1/2 of the nodes not responding. They do respond once systems are removed Headscale just isnt made for this, throwing more hardware at the problem only works to a certain point. After some discussions in discord, I wrote up "[Scaling / How many clients does Headscale support?](https://headscale.net/development/about/faq/#scaling-how-many-clients-does-headscale-support)". But well, if you say 300 is the limit, then my example with 1000 might be too much. > Running headscale cli results in "Cannot get nodes: context deadline exceeded" 9/10 times. The server is sitting avg 45% CPU usage with no traffic; the clients connected are for management only, only a few Mb per day is passed to each client. I'll try to break this up: `Cannot get nodes: context deadline exceeded`: The server is probably pretty busy, and you hit some lock which the CLI has to wait for, and it takes longer than the gRPC timeout. It does not look like we are exposing an option to configure that, but a longer timeout might give you an answer "eventually". PRs welcome to make it configurable. `The server is sitting avg 45% CPU usage with no traffic`: Traffic isnt really relevant, that should go node to node, It might be spinning on some continuous small change that needs to go to the clients. CPU usage isnt really something you can associate with the internal state of the app, it might be stuck on a lock or similar. `only a few Mb per day is passed to each client`: Not that relevant since the traffic goes directly between the clients. > node_update_check_interval This option does not exist anymore. > I have 1 very simple ACL to allow admins access to all nodes, and only one node is admin. There are no particular optimalisations for ACLs, so it should not matter too much. Surprisingly, when we potentially start adding them, then simpler might be worse for performance, but thats something we can only say in the future. I would say this isnt as much a bug as "not a feature", at least not yet.

adam commented

2025-12-29 02:27:01 +01:00

@github-actions[bot] commented on GitHub (Jun 23, 2025):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Jun 23, 2025): This issue is stale because it has been open for 90 days with no activity.

adam commented

2025-12-29 02:27:01 +01:00

@github-actions[bot] commented on GitHub (Jun 30, 2025):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Jun 30, 2025): This issue was closed because it has been inactive for 14 days since being marked as stale.

adam referenced this issue

2025-12-29 02:31:43 +01:00

[PR #981] [MERGED] Fix remote CLI when there is no config file present #1790

Sign in to join this conversation.

Branches Tags

main

gh-pages

update_flake_lock_action

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#981