0.22.1 uses way, way more memory than 0.21? #491

Closed
opened 2025-12-29 02:19:02 +01:00 by adam · 13 comments
Owner

Originally created by @linsomniac on GitHub (Apr 28, 2023).

I was running 0.21 on a instance with 2GB of RAM. I upgraded to 0.22.1 and it immediately thrashed itself to death. I upgraded the instance to 4GB and it still is pretty quickly thrashing. At the moment AWS won't let me upgrade it to 16GB. I have ~100 nodes in my headscale.

Is this known and expected?

Originally created by @linsomniac on GitHub (Apr 28, 2023). <!-- Headscale is a multinational community across the globe. Our common language is English. Please consider raising the issue in this language. --> <!-- If you have a question, please consider using our Discord for asking questions --> I was running 0.21 on a instance with 2GB of RAM. I upgraded to 0.22.1 and it immediately thrashed itself to death. I upgraded the instance to 4GB and it still is pretty quickly thrashing. At the moment AWS won't let me upgrade it to 16GB. I have ~100 nodes in my headscale. Is this known and expected?
adam added the bug label 2025-12-29 02:19:02 +01:00
adam closed this issue 2025-12-29 02:19:02 +01:00
Author
Owner

@loprima-l commented on GitHub (Apr 28, 2023):

As far as I know, I'm only aware of CPU trouble on large installations, but that sounds "normal" as Headscale isn't optimized yet for large installations.

Are you sure that the problem came from RAM ?

@loprima-l commented on GitHub (Apr 28, 2023): As far as I know, I'm only aware of CPU trouble on large installations, but that sounds "normal" as Headscale isn't optimized yet for large installations. Are you sure that the problem came from RAM ?
Author
Owner

@linsomniac commented on GitHub (Apr 28, 2023):

Yep, I'm sure the problem was RAM. I was getting OOM messages on the console.

I've regularly run into memory issues, originally was running on a 1GB machine, but started having both CPU and RAM issues when I added ~100 nodes. So I upped it. I do have fairly high disc I/O, I had to reduce my update time, I think I went from 10s to 30s.

vmstat during this is showing memory free (free+buf+cache) going down to ~100MB, and after I kill headscale it goes back up to 3.6GB free. During that time it was doing super heavy "block in" and wait cpu time was ~80%, so heavy disc activity, heavy read, heavy memory use.

Rememinder: I was running 0.21 in 2GB on this system, installed 0.22.1 and restarted headscale, and started getting OOMs. Doubled RAM and also was getting OOMs. Switched back to 0.21 and now have been running several hours and have 3GB free, 470M in buff/cache, 396MB in "used".

Seems to point to 0.22.1 having some dramatically higher memory use.

@linsomniac commented on GitHub (Apr 28, 2023): Yep, I'm sure the problem was RAM. I was getting OOM messages on the console. I've regularly run into memory issues, originally was running on a 1GB machine, but started having both CPU and RAM issues when I added ~100 nodes. So I upped it. I do have fairly high disc I/O, I had to reduce my update time, I think I went from 10s to 30s. vmstat during this is showing memory free (free+buf+cache) going down to ~100MB, and after I kill headscale it goes back up to 3.6GB free. During that time it was doing super heavy "block in" and wait cpu time was ~80%, so heavy disc activity, heavy read, heavy memory use. Rememinder: I was running 0.21 in 2GB on this system, installed 0.22.1 and restarted headscale, and started getting OOMs. Doubled RAM and also was getting OOMs. Switched back to 0.21 and now have been running several hours and have 3GB free, 470M in buff/cache, 396MB in "used". Seems to point to 0.22.1 having some dramatically higher memory use.
Author
Owner

@loprima-l commented on GitHub (Apr 28, 2023):

Have you successfully backed up your system to the previous version ? I think it's the better option now.

I think your issue is related to another issue that we choose to not fix yet.
Fixing those performance issues means a lot to me as big environnement made easier to find bugs but can't be our priority. I'm gonna check it when as son as possible.

@loprima-l commented on GitHub (Apr 28, 2023): Have you successfully backed up your system to the previous version ? I think it's the better option now. I think your issue is related to another issue that we choose to not fix yet. Fixing those performance issues means a lot to me as big environnement made easier to find bugs but can't be our priority. I'm gonna check it when as son as possible.
Author
Owner

@loprima-l commented on GitHub (Apr 28, 2023):

Also, can you introduce a bit more to your Headscale instance, like why are you using Headscale, and what are your users ? Is it a prod environment ? Ect...

I'm interested to know what type of large infra are using Headscale

@loprima-l commented on GitHub (Apr 28, 2023): Also, can you introduce a bit more to your Headscale instance, like why are you using Headscale, and what are your users ? Is it a prod environment ? Ect... I'm interested to know what type of large infra are using Headscale
Author
Owner

@loprima-l commented on GitHub (Apr 29, 2023):

Hi, I think you should give #1377 a try if you have a bunch of ACLs, because I think with 100+ machines you must have a lot of ACLs

@loprima-l commented on GitHub (Apr 29, 2023): Hi, I think you should give #1377 a try if you have a bunch of ACLs, because I think with 100+ machines you must have a lot of ACLs
Author
Owner

@linsomniac commented on GitHub (Apr 29, 2023):

Yes, I have successfully returned to 0.21, I just had to wait for the OOM killer to make the system responsive enough to get a window to stop headscale and revert.

Why am I using headscale? I couldn't get buy in to purchase tailscale.

Size of ACLs: I have 3 groups, 5 subnets, 22 ACL rules, my entire acls.yaml is ~170 lines.

"headscale node list | wc" is 115 lines.

My environment is dev, staging, and production, mostly virtual machines and some AWS EC2 instances, mostly Linux. I deployed tailscale to all the dev/stg instances, and a handful of production instances (mostly administrative things and the firewalls as subnet routers). The users primarily are me and one of the other operations people, I'm still in a proof of concept mode. The longer term plan would be to bring on the ~8 developers and maybe a couple Q&A people, maybe up to 10 more.

@linsomniac commented on GitHub (Apr 29, 2023): Yes, I have successfully returned to 0.21, I just had to wait for the OOM killer to make the system responsive enough to get a window to stop headscale and revert. Why am I using headscale? I couldn't get buy in to purchase tailscale. Size of ACLs: I have 3 groups, 5 subnets, 22 ACL rules, my entire acls.yaml is ~170 lines. "headscale node list | wc" is 115 lines. My environment is dev, staging, and production, mostly virtual machines and some AWS EC2 instances, mostly Linux. I deployed tailscale to all the dev/stg instances, and a handful of production instances (mostly administrative things and the firewalls as subnet routers). The users primarily are me and one of the other operations people, I'm still in a proof of concept mode. The longer term plan would be to bring on the ~8 developers and maybe a couple Q&A people, maybe up to 10 more.
Author
Owner

@linsomniac commented on GitHub (Apr 29, 2023):

I've switched my EC2 instance to a t3a.xlarge with 16GB of RAM, and restarted headscale with 0.22.1, and watched as the free memory dipped down to 2GB, then it gradually returned to 14GB. Here's a sampling of vmstat output during this run:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 15365232  50972 477732    0    0 23264   100 2989 3572  3  3 86  7  1
 3  0      0 14937296  51044 478832    0    0    80   380 6240 2122 30  4 63  1  2
 4  0      0 13252048  51164 479000    0    0     0    88 77691 1149 85 15  0  0  0
 4  0      0 10792780  51328 479028    0    0     0     0 66285 2253 87 12  0  0  0
 4  0      0 8397076  51516 479520    0    0     0     0 55892  977 92  8  0  0  0
 4  0      0 6178920  51604 479560    0    0     0    80 49133  925 92  8  0  0  0
 4  0      0 3857496  51776 479688    0    0     0    28 82437 1706 86 13  1  0  0
 4  0      0 2100900  52272 479644    0    0     0   696 2430 1715 97  2  0  1  0
 3  1      0 3500568  53408 480084    0    0     0   376 3453 1316 98  1  0  1  0
 5  0      0 4365412  53964 480512    0    0     0   192 4178 1425 97  3  0  0  0
 4  0      0 5963896  54880 480664    0    0     0   416 2551 1712 97  2  1  0  0
 4  0      0 6821424  55412 480696    0    0     0   392 3425 1373 98  2  0  1  0
 5  0      0 12377728  56956 480728    0    0     0  1944 7442 4131 83  3 11  3  0
 1  0      0 14483708  58000 480820    0    0     0   348 3359 1662 36  3 59  1  2
 0  0      0 14493176  58372 480868    0    0     0   180 1142 1550  3  1 95  0  0

Looks like it does that every time I restart it (was wondering if there was a one-time housekeeping).

Maybe it's some combination of 110-ish hosts and 20-ish ACLs? But something changed between 0.21, which I've been able to successfully run in 2GB of RAM, and 0.22.1, which is requiring ~14GB.

@linsomniac commented on GitHub (Apr 29, 2023): I've switched my EC2 instance to a t3a.xlarge with 16GB of RAM, and restarted headscale with 0.22.1, and watched as the free memory dipped down to 2GB, then it gradually returned to 14GB. Here's a sampling of vmstat output during this run: procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 15365232 50972 477732 0 0 23264 100 2989 3572 3 3 86 7 1 3 0 0 14937296 51044 478832 0 0 80 380 6240 2122 30 4 63 1 2 4 0 0 13252048 51164 479000 0 0 0 88 77691 1149 85 15 0 0 0 4 0 0 10792780 51328 479028 0 0 0 0 66285 2253 87 12 0 0 0 4 0 0 8397076 51516 479520 0 0 0 0 55892 977 92 8 0 0 0 4 0 0 6178920 51604 479560 0 0 0 80 49133 925 92 8 0 0 0 4 0 0 3857496 51776 479688 0 0 0 28 82437 1706 86 13 1 0 0 4 0 0 2100900 52272 479644 0 0 0 696 2430 1715 97 2 0 1 0 3 1 0 3500568 53408 480084 0 0 0 376 3453 1316 98 1 0 1 0 5 0 0 4365412 53964 480512 0 0 0 192 4178 1425 97 3 0 0 0 4 0 0 5963896 54880 480664 0 0 0 416 2551 1712 97 2 1 0 0 4 0 0 6821424 55412 480696 0 0 0 392 3425 1373 98 2 0 1 0 5 0 0 12377728 56956 480728 0 0 0 1944 7442 4131 83 3 11 3 0 1 0 0 14483708 58000 480820 0 0 0 348 3359 1662 36 3 59 1 2 0 0 0 14493176 58372 480868 0 0 0 180 1142 1550 3 1 95 0 0 Looks like it does that every time I restart it (was wondering if there was a one-time housekeeping). Maybe it's some combination of 110-ish hosts and 20-ish ACLs? But something changed between 0.21, which I've been able to successfully run in 2GB of RAM, and 0.22.1, which is requiring ~14GB.
Author
Owner

@loprima-l commented on GitHub (Apr 30, 2023):

Thanks for your reply, have u tried the patch in #1337 ?

@loprima-l commented on GitHub (Apr 30, 2023): Thanks for your reply, have u tried the patch in #1337 ?
Author
Owner

@linsomniac commented on GitHub (May 1, 2023):

I haven't, I will give it a try probably this evening.

@linsomniac commented on GitHub (May 1, 2023): I haven't, I will give it a try probably this evening.
Author
Owner

@linsomniac commented on GitHub (May 2, 2023):

It looks like #1377 is merged into main, so I grabbed that and built it and it does indeed seem to have solved the memory issue.

@linsomniac commented on GitHub (May 2, 2023): It looks like #1377 is merged into main, so I grabbed that and built it and it does indeed seem to have solved the memory issue.
Author
Owner

@loprima-l commented on GitHub (May 2, 2023):

Súper ! Are the performance better or érode than on 0.21 ?

@loprima-l commented on GitHub (May 2, 2023): Súper ! Are the performance better or érode than on 0.21 ?
Author
Owner

@linsomniac commented on GitHub (May 2, 2023):

I only ran it a little bit, but performance seemed similar to 0.21. I really didn't do much testing of it. I had kind of a janky build, built against libraries in /nix, and I decided to go back to running 0.21 at the moment until the next release comes out. I couldn't seem to get the build to work, or at least couldn't find the resulting binary, when I did "go build", it wasn't writing to ~/go/bin like I was expecting.

@linsomniac commented on GitHub (May 2, 2023): I only ran it a little bit, but performance seemed similar to 0.21. I really didn't do much testing of it. I had kind of a janky build, built against libraries in /nix, and I decided to go back to running 0.21 at the moment until the next release comes out. I couldn't seem to get the build to work, or at least couldn't find the resulting binary, when I did "go build", it wasn't writing to ~/go/bin like I was expecting.
Author
Owner

@kradalby commented on GitHub (May 10, 2023):

We will release with #1377 in a bit, please test that and reopen if it still is an issue.

@kradalby commented on GitHub (May 10, 2023): We will release with #1377 in a bit, please test that and reopen if it still is an issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#491