Expotential CPU usage from allowed peer checks #381

Closed
opened 2025-12-29 01:27:57 +01:00 by adam · 9 comments
Owner

Originally created by @jblackwood-fes on GitHub (Nov 24, 2022).

Bug description

CPU usage grows exponentially as the number of peers grows, to the point where headscale cannot respond to updates fast enough for clients to remain connected.

This appears to be due to recalculating allowable peers for every update, which is an O(n) operation, for n peers. The allowed peer list should be static except when new peers are added, so updating the peer list once for each new peer would be a huge performance win.

Enabling ACLs makes this worse because there is more work per peer to check if it's valid, but the namespace only checks do eventually cause performance issues too with 1000s of peers.

To Reproduce

Create a network with 400-600 peers, the exact number where the performance curve becomes a problem depends on the system specs, but with a 4 core server 600 is usually enough to overwhelm the system.

Context info

Originally created by @jblackwood-fes on GitHub (Nov 24, 2022). **Bug description** CPU usage grows exponentially as the number of peers grows, to the point where headscale cannot respond to updates fast enough for clients to remain connected. This appears to be due to recalculating allowable peers for every update, which is an O(n) operation, for n peers. The allowed peer list should be static except when new peers are added, so updating the peer list once for each new peer would be a huge performance win. Enabling ACLs makes this worse because there is more work per peer to check if it's valid, but the namespace only checks do eventually cause performance issues too with 1000s of peers. **To Reproduce** Create a network with 400-600 peers, the exact number where the performance curve becomes a problem depends on the system specs, but with a 4 core server 600 is usually enough to overwhelm the system. **Context info** <!-- Please add relevant information about your system. For example: - Headscale: 0.16.0, 0.17.0-beta4 - Tailscale: 1.32.3 - Ubuntu 20.04, 22.04
adam added the bug label 2025-12-29 01:27:57 +01:00
adam closed this issue 2025-12-29 01:27:57 +01:00
Author
Owner

@rjmalagon commented on GitHub (Nov 26, 2022):

Even 200 peers exceeds a healthy CPU quota.

@rjmalagon commented on GitHub (Nov 26, 2022): Even 200 peers exceeds a healthy CPU quota.
Author
Owner

@kradalby commented on GitHub (Nov 29, 2022):

While we are flattered that people use this for larger installations, our current scope is probably homelabs/small teams, and performance work will come after correctness. We will ofc keep this issue around, but I think it is worth clarifying that this isnt really a "bug" as we have not attempted to make things efficient, just "correct".

@kradalby commented on GitHub (Nov 29, 2022): While we are flattered that people use this for larger installations, our current scope is probably homelabs/small teams, and performance work will come after correctness. We will ofc keep this issue around, but I think it is worth clarifying that this isnt really a "bug" as we have not attempted to make things efficient, just "correct".
Author
Owner

@jblackwood-fes commented on GitHub (Dec 2, 2022):

I think keeping performance in mind helps to make sure the design can grow/scale.

I've done some testing and caching peer lists until they need to change (new peers, or just not loaded) can make a huge difference in performance. My code's a bit hack-ish, but happy to share it with someone as a reference for a better fix.

@jblackwood-fes commented on GitHub (Dec 2, 2022): I think keeping performance in mind helps to make sure the design can grow/scale. I've done some testing and caching peer lists until they need to change (new peers, or just not loaded) can make a huge difference in performance. My code's a bit hack-ish, but happy to share it with someone as a reference for a better fix.
Author
Owner

@magkopian commented on GitHub (Jan 22, 2023):

Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter.

With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience.

@magkopian commented on GitHub (Jan 22, 2023): Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter. With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience.
Author
Owner

@qzydustin commented on GitHub (Mar 31, 2023):

Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter.

With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience.

This experience helps me a lot. Thank you for your share.

@qzydustin commented on GitHub (Mar 31, 2023): > Just wanted to say that we are experiencing the same issue and we have around 150 devices on our Tailnet. While we were at around 130 devices, Headscale was barely consuming any CPU. Now I see fluctuating between 60% - 80%. I recently updated to v0.18.0 but haven't noticed much of a change on that matter. > > With that being said, this is on a VPS with a single CPU and 512 MB of RAM. So, we can definitely add more resources to it if needed. I just thought it would be a good idea to share my own experience. This experience helps me a lot. Thank you for your share.
Author
Owner

@kradalby commented on GitHub (May 10, 2023):

This should be resolved in the next release.

@kradalby commented on GitHub (May 10, 2023): This should be resolved in the next release.
Author
Owner

@magkopian commented on GitHub (May 12, 2023):

This should be resolved in the next release.

I'm not sure what is going on, but I just updated yesterday to 0.22.2 and the CPU usage actually jumped from around 40% that it has been for weeks, to close to 100%. And it has been like that for over 16 hours so far.

Today, I updated to 0.22.3 in the hopes that the issue is fixed but unfortunately nothing changed. Any guidance on how to troubleshoot this? Also, would it be safe to downgrade back to 0.22.1?

@magkopian commented on GitHub (May 12, 2023): > This should be resolved in the next release. I'm not sure what is going on, but I just updated yesterday to 0.22.2 and the CPU usage actually jumped from around 40% that it has been for weeks, to close to 100%. And it has been like that for over 16 hours so far. Today, I updated to 0.22.3 in the hopes that the issue is fixed but unfortunately nothing changed. Any guidance on how to troubleshoot this? Also, would it be safe to downgrade back to 0.22.1?
Author
Owner

@kradalby commented on GitHub (May 12, 2023):

Running 0.22.1 should not be a problem, there has not been any database migrations.

Could you capture a profile of the cpu usage and upload it? b01f1f1867/cmd/headscale/headscale.go (L15-L16)

@kradalby commented on GitHub (May 12, 2023): Running 0.22.1 should not be a problem, there has not been any database migrations. Could you capture a profile of the cpu usage and upload it? https://github.com/juanfont/headscale/blob/b01f1f1867136d9b2d7b1392776eb363b482c525/cmd/headscale/headscale.go#L15-L16
Author
Owner

@magkopian commented on GitHub (May 13, 2023):

I created a directory /var/log/headscale/profiling/ and added the following inside the headscale.service file,

Environment="HEADSCALE_PROFILING_ENABLED=1"
Environment="HEADSCALE_PROFILING_PATH=/var/log/headscale/profiling"

However, when I tried restarting headscale I got the following error,

May 14 02:51:42 headscale headscale[54542]: 2023/05/14 02:51:42 profile: could not create cpu profile "/var/log/headscale/profiling/cpu.pprof": open /var/log/headscale/profiling/cpu.pprof: read-only file system

Have I misunderstood what you were asking me to do?

@magkopian commented on GitHub (May 13, 2023): I created a directory `/var/log/headscale/profiling/` and added the following inside the `headscale.service` file, ``` Environment="HEADSCALE_PROFILING_ENABLED=1" Environment="HEADSCALE_PROFILING_PATH=/var/log/headscale/profiling" ``` However, when I tried restarting headscale I got the following error, ``` May 14 02:51:42 headscale headscale[54542]: 2023/05/14 02:51:42 profile: could not create cpu profile "/var/log/headscale/profiling/cpu.pprof": open /var/log/headscale/profiling/cpu.pprof: read-only file system ``` Have I misunderstood what you were asking me to do?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#381