Expotential CPU usage from allowed peer checks #381

New Issue

adam · 2025-12-29T01:27:57+01:00

adam commented

2025-12-29 01:27:57 +01:00

Originally created by @jblackwood-fes on GitHub (Nov 24, 2022).

Bug description

CPU usage grows exponentially as the number of peers grows, to the point where headscale cannot respond to updates fast enough for clients to remain connected.

This appears to be due to recalculating allowable peers for every update, which is an O(n) operation, for n peers. The allowed peer list should be static except when new peers are added, so updating the peer list once for each new peer would be a huge performance win.

Enabling ACLs makes this worse because there is more work per peer to check if it's valid, but the namespace only checks do eventually cause performance issues too with 1000s of peers.

To Reproduce

Create a network with 400-600 peers, the exact number where the performance curve becomes a problem depends on the system specs, but with a 4 core server 600 is usually enough to overwhelm the system.

Context info

Originally created by @jblackwood-fes on GitHub (Nov 24, 2022). **Bug description** CPU usage grows exponentially as the number of peers grows, to the point where headscale cannot respond to updates fast enough for clients to remain connected. This appears to be due to recalculating allowable peers for every update, which is an O(n) operation, for n peers. The allowed peer list should be static except when new peers are added, so updating the peer list once for each new peer would be a huge performance win. Enabling ACLs makes this worse because there is more work per peer to check if it's valid, but the namespace only checks do eventually cause performance issues too with 1000s of peers. **To Reproduce** Create a network with 400-600 peers, the exact number where the performance curve becomes a problem depends on the system specs, but with a 4 core server 600 is usually enough to overwhelm the system. **Context info** <!-- Please add relevant information about your system. For example: - Headscale: 0.16.0, 0.17.0-beta4 - Tailscale: 1.32.3 - Ubuntu 20.04, 22.04

adam added the bug label 2025-12-29 01:27:57 +01:00

adam closed this issue

2025-12-29 01:27:57 +01:00

adam commented

2025-12-29 01:27:57 +01:00

@rjmalagon commented on GitHub (Nov 26, 2022):

Even 200 peers exceeds a healthy CPU quota.

@rjmalagon commented on GitHub (Nov 26, 2022): Even 200 peers exceeds a healthy CPU quota.

adam commented

2025-12-29 01:27:57 +01:00

@kradalby commented on GitHub (Nov 29, 2022):

While we are flattered that people use this for larger installations, our current scope is probably homelabs/small teams, and performance work will come after correctness. We will ofc keep this issue around, but I think it is worth clarifying that this isnt really a "bug" as we have not attempted to make things efficient, just "correct".

@kradalby commented on GitHub (Nov 29, 2022): While we are flattered that people use this for larger installations, our current scope is probably homelabs/small teams, and performance work will come after correctness. We will ofc keep this issue around, but I think it is worth clarifying that this isnt really a "bug" as we have not attempted to make things efficient, just "correct".

adam commented

2025-12-29 01:27:58 +01:00

@jblackwood-fes commented on GitHub (Dec 2, 2022):

I think keeping performance in mind helps to make sure the design can grow/scale.

I've done some testing and caching peer lists until they need to change (new peers, or just not loaded) can make a huge difference in performance. My code's a bit hack-ish, but happy to share it with someone as a reference for a better fix.

@jblackwood-fes commented on GitHub (Dec 2, 2022): I think keeping performance in mind helps to make sure the design can grow/scale. I've done some testing and caching peer lists until they need to change (new peers, or just not loaded) can make a huge difference in performance. My code's a bit hack-ish, but happy to share it with someone as a reference for a better fix.

adam commented

2025-12-29 01:27:58 +01:00

@magkopian commented on GitHub (Jan 22, 2023):

Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter.

With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience.

@magkopian commented on GitHub (Jan 22, 2023): Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter. With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience.

adam commented

2025-12-29 01:27:59 +01:00

@qzydustin commented on GitHub (Mar 31, 2023):

Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter.

With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience.

This experience helps me a lot. Thank you for your share.

@qzydustin commented on GitHub (Mar 31, 2023): > Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter. > > With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience. This experience helps me a lot. Thank you for your share.

adam commented

2025-12-29 01:27:59 +01:00

@kradalby commented on GitHub (May 10, 2023):

This should be resolved in the next release.

@kradalby commented on GitHub (May 10, 2023): This should be resolved in the next release.

adam commented

2025-12-29 01:27:59 +01:00

@magkopian commented on GitHub (May 12, 2023):

This should be resolved in the next release.

I'm not sure what is going on, but I just updated yesterday to 0.22.2 and the CPU usage actually jumped from around 40% that it has been for weeks, to close to 100%. And it has been like that for over 16 hours so far.

Today, I updated to 0.22.3 in the hopes that the issue is fixed but unfortunately nothing changed. Any guidance on how to troubleshoot this? Also, would it be safe to downgrade back to 0.22.1?

@magkopian commented on GitHub (May 12, 2023): > This should be resolved in the next release. I'm not sure what is going on, but I just updated yesterday to 0.22.2 and the CPU usage actually jumped from around 40% that it has been for weeks, to close to 100%. And it has been like that for over 16 hours so far. Today, I updated to 0.22.3 in the hopes that the issue is fixed but unfortunately nothing changed. Any guidance on how to troubleshoot this? Also, would it be safe to downgrade back to 0.22.1?

adam commented

2025-12-29 01:28:00 +01:00

@kradalby commented on GitHub (May 12, 2023):

Running 0.22.1 should not be a problem, there has not been any database migrations.

Could you capture a profile of the cpu usage and upload it? b01f1f1867/cmd/headscale/headscale.go (L15-L16)

@kradalby commented on GitHub (May 12, 2023): Running 0.22.1 should not be a problem, there has not been any database migrations. Could you capture a profile of the cpu usage and upload it? https://github.com/juanfont/headscale/blob/b01f1f1867136d9b2d7b1392776eb363b482c525/cmd/headscale/headscale.go#L15-L16

adam commented

2025-12-29 01:28:00 +01:00

@magkopian commented on GitHub (May 13, 2023):

I created a directory /var/log/headscale/profiling/ and added the following inside the headscale.service file,

Environment="HEADSCALE_PROFILING_ENABLED=1"
Environment="HEADSCALE_PROFILING_PATH=/var/log/headscale/profiling"

However, when I tried restarting headscale I got the following error,

May 14 02:51:42 headscale headscale[54542]: 2023/05/14 02:51:42 profile: could not create cpu profile "/var/log/headscale/profiling/cpu.pprof": open /var/log/headscale/profiling/cpu.pprof: read-only file system

Have I misunderstood what you were asking me to do?

@magkopian commented on GitHub (May 13, 2023): I created a directory `/var/log/headscale/profiling/` and added the following inside the `headscale.service` file, ``` Environment="HEADSCALE_PROFILING_ENABLED=1" Environment="HEADSCALE_PROFILING_PATH=/var/log/headscale/profiling" ``` However, when I tried restarting headscale I got the following error, ``` May 14 02:51:42 headscale headscale[54542]: 2023/05/14 02:51:42 profile: could not create cpu profile "/var/log/headscale/profiling/cpu.pprof": open /var/log/headscale/profiling/cpu.pprof: read-only file system ``` Have I misunderstood what you were asking me to do?

adam referenced this issue

2025-12-29 02:29:59 +01:00

[PR #381] [MERGED] docs(README): update contributors #1415

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#381