headscale server stopped answering after a day of uptime despite listening of all the ports #565

@rjmalagon commented on GitHub (Oct 28, 2023):

+1. After ~20 minutes headscale is stuck with timeouts while updating nodes.

@rjmalagon commented on GitHub (Oct 28, 2023): +1. After ~20 minutes headscale is stuck with timeouts while updating nodes.

adam commented

@rjmalagon commented on GitHub (Oct 28, 2023):

After carefully checks about my headscale VMs stats, I found that is very ram sensitive.
In a serve ~200+ nodes with a 2Gb RAM / 4 CPU VM.
Sometimes, headscale pushes a very brief RAM peak and usually stays below 1Gb RAM usage.
I just added a little swapfile and worked well. I am now experimenting with a zram swap compression (it seems to work as well)

@rjmalagon commented on GitHub (Oct 28, 2023): After carefully checks about my headscale VMs stats, I found that is very ram sensitive. In a serve ~200+ nodes with a 2Gb RAM / 4 CPU VM. Sometimes, headscale pushes a very brief RAM peak and usually stays below 1Gb RAM usage. I just added a little swapfile and worked well. I am now experimenting with a zram swap compression (it seems to work as well)

adam commented

@rjmalagon commented on GitHub (Oct 30, 2023):

I am wrong on this, headscale sometimes gets stuck, even with enough memory and CPU.
Usually, it gets stuck on multiples "waiting for update on stream channel".
I set up a Caddy reverse proxy to get the connection logs, and I see many 502 on the tailscale-control-protocol upgrade when headscale stops responding.

@rjmalagon commented on GitHub (Oct 30, 2023): I am wrong on this, headscale sometimes gets stuck, even with enough memory and CPU. Usually, it gets stuck on multiples "waiting for update on stream channel". I set up a Caddy reverse proxy to get the connection logs, and I see many 502 on the tailscale-control-protocol upgrade when headscale stops responding.

adam commented

@Nickiel12 commented on GitHub (Nov 29, 2023):

+1 Running on a nixos server with 24 GB or RAM (so ram isn't the issue). Headscale randomly (from what I can tell so far) get's stuck using 3% of the CPU, and takes 5 minutes for systemctl to restart, and doesn't allow new connections.

@Nickiel12 commented on GitHub (Nov 29, 2023): +1 Running on a nixos server with 24 GB or RAM (so ram isn't the issue). Headscale randomly (from what I can tell so far) get's stuck using 3% of the CPU, and takes 5 minutes for systemctl to restart, and doesn't allow new connections.

adam commented

@kradalby commented on GitHub (Nov 30, 2023):

I think this might be fixed with the tip of #1564, could you test?

@kradalby commented on GitHub (Nov 30, 2023): I think this might be fixed with the tip of #1564, could you test?

adam commented

@Nickiel12 commented on GitHub (Dec 1, 2023):

@kradalby I would like to test the change that you have made - however I'm having an awful time trying to use the headscale flake. I've been using the nixpkgs version, and I don't have my head wrapped around how flakes and nix quite all work. I saw that you use the headscale flake in your configuration, and was wondering how your configuration uses the flake version of headscale instead of the nixpkgs version? I'm assuming it has to do with the overlay, but I tried putting overlay = [ headscale.overlay ] in my flake.nix (after adding it as an input) and it did not change what version was installed.

~~If I can figure out how to switch to the flake version, I will be happy to test this change.~~

I wasn't inheriting the overlay-ed pkgs to the server host. I'm switching to the patched version of the headscale server and will let you know how it works in a few days.

@Nickiel12 commented on GitHub (Dec 1, 2023): ~~@kradalby I would like to test the change that you have made - however I'm having an awful time trying to use the headscale flake. I've been using the nixpkgs version, and I don't have my head wrapped around how flakes and nix quite all work. I saw that you use the headscale flake in your configuration, and was wondering how your configuration uses the flake version of headscale instead of the nixpkgs version? I'm assuming it has to do with the overlay, but I tried putting `overlay = [ headscale.overlay ]` in my flake.nix (after adding it as an input) and it did not change what version was installed.~~ ~~If I can figure out how to switch to the flake version, I will be happy to test this change.~~ I wasn't inheriting the overlay-ed pkgs to the server host. I'm switching to the patched version of the headscale server and will let you know how it works in a few days.

adam commented

@Nickiel12 commented on GitHub (Dec 4, 2023):

I'm not sure if this is the right thread for this, but I noticed while watch systemctl status headscale and systemctl restart headscale it appears to have closed the DB (sqlite for my instance) before shutting down other threads properly, and it took almost two minutes for headscale to restart - I don't know if this is related, but if the sqlite database can disconnect before the rest of the application shuts down it might be causing this issue.

@Nickiel12 commented on GitHub (Dec 4, 2023): I'm not sure if this is the right thread for this, but I noticed while `watch systemctl status headscale` and `systemctl restart headscale` it appears to have closed the DB (sqlite for my instance) before shutting down other threads properly, and it took almost two minutes for headscale to restart - I don't know if this is related, but if the sqlite database can disconnect before the rest of the application shuts down it might be causing this issue.

adam commented

@kradalby commented on GitHub (Dec 10, 2023):

0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

@kradalby commented on GitHub (Dec 10, 2023): [0.23.0-alpha2](https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha2) addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

adam commented

@Nickiel12 commented on GitHub (Dec 16, 2023):

@kradalby The issue of the headscale server not responding has gone away. I've noticed a weird issue with the app where I need to log out, change the server and save it, then log back in before I can connect. But I have not had to restart the headscale server at all since I switched to the alpha release.

@Nickiel12 commented on GitHub (Dec 16, 2023): @kradalby The issue of the headscale server not responding has gone away. I've noticed a weird issue with the app where I need to log out, change the server and save it, then log back in before I can connect. But I have not had to restart the headscale server at all since I switched to the alpha release.

adam commented

@Nickiel12 commented on GitHub (Dec 16, 2023):

well, scratch that, now I keep getting prompted to re-register my phone when I try to connect to the server. But I'm not getting the same issue as before where I couldn't log on at all and the headscale service would hang.

@Nickiel12 commented on GitHub (Dec 16, 2023): well, scratch that, now I keep getting prompted to re-register my phone when I try to connect to the server. But I'm not getting the same issue as before where I couldn't log on at all and the headscale service would hang.

adam commented

@rjmalagon commented on GitHub (Dec 19, 2023):

0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

Hi, the stability and responsiveness are much better. Alpha 1 was easily stuck with some stress (+300 nodes) in just minutes. This new alpha is robust enough to handle the same workload 20+ hours without issues. Thanks @kradalby for the advances and the follow of this issue.

I can share any info if needed to help with the development of v0.23.

@rjmalagon commented on GitHub (Dec 19, 2023): > [0.23.0-alpha2](https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha2) addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist. Hi, the stability and responsiveness are much better. Alpha 1 was easily stuck with some stress (+300 nodes) in just minutes. This new alpha is robust enough to handle the same workload 20+ hours without issues. Thanks @kradalby for the advances and the follow of this issue. I can share any info if needed to help with the development of v0.23.

adam commented

@jwischka commented on GitHub (Dec 20, 2023):

I'm experiencing something similar on 0.23-alpha2 - I haven't had time to debug yet, but I've had to restart headscale twice on a moderately sized install (~30 nodes) in the last two days.

@jwischka commented on GitHub (Dec 20, 2023): I'm experiencing something similar on 0.23-alpha2 - I haven't had time to debug yet, but I've had to restart headscale twice on a moderately sized install (~30 nodes) in the last two days.

adam commented

@Nickiel12 commented on GitHub (Dec 23, 2023):

I deleted my phone node, re-registered it, and I have not had any issues connecting in two days since!

@Nickiel12 commented on GitHub (Dec 23, 2023): I deleted my phone node, re-registered it, and I have not had any issues connecting in two days since!

adam commented

@cfouche3005 commented on GitHub (Dec 27, 2023):

Same issue on 0.23-alpha2 after one hour, I will check if the issue still reappears

@cfouche3005 commented on GitHub (Dec 27, 2023): Same issue on 0.23-alpha2 after one hour, I will check if the issue still reappears

adam commented

@cfouche3005 commented on GitHub (Dec 27, 2023):

I can confirm the issue is still present, after one hour or so, no response from grpc or http api but headscale still work (I haven't tested if I can register new nodes)

@cfouche3005 commented on GitHub (Dec 27, 2023): I can confirm the issue is still present, after one hour or so, no response from grpc or http api but headscale still work (I haven't tested if I can register new nodes)

adam commented

@cfouche3005 commented on GitHub (Dec 28, 2023):

And I also confirm that new devices or disconnected devices cannot join/reconnect to the tailnet

@cfouche3005 commented on GitHub (Dec 28, 2023): And I also confirm that new devices or disconnected devices cannot join/reconnect to the tailnet

adam commented

@cfouche3005 commented on GitHub (Dec 29, 2023):

I have to add this bug is very inconsistent, today it did not appear but yesterday, it was here.

@cfouche3005 commented on GitHub (Dec 29, 2023): I have to add this bug is very inconsistent, today it did not appear but yesterday, it was here.

adam commented

@dustinblackman commented on GitHub (Feb 7, 2024):

Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like https://github.com/juanfont/headscale/pull/1701 may solve it.

In the mean time, I've put together a small/lazy systemd service that acts as a healthcheck, rebooting Headscale if it locks up. Taking yesterday's data, this fired 9 times with 30 nodes in a for-fun containerized testing environment.

headscale-health

#!/usr/bin/env bash

set -e

while true; do
	echo "Checking headscale health"
	timeout 5 headscale nodes list >/dev/null ||
		(echo "Failed to get nodes list, rebooting headscale" && timeout 3 systemctl restart headscale) ||
		(echo "Failed to restart, killing process" && kill -9 "$(ps aux | grep 'headscale serve' | grep -v grep | awk '{print $2}')")

	sleep 10
done

headscale-health.service

[Unit]
After=syslog.target
After=network.target
Description=headscale health monitor

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/bin/headscale-health
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

@dustinblackman commented on GitHub (Feb 7, 2024): Finding myself in a similar situation where Headscale randomly locks running master branch from [this](https://github.com/juanfont/headscale/commit/cbf57e27a78922c88edd8de8f02c99810a653786) commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like https://github.com/juanfont/headscale/pull/1701 may solve it. In the mean time, I've put together a small/lazy systemd service that acts as a healthcheck, rebooting Headscale if it locks up. Taking yesterday's data, this fired 9 times with 30 nodes in a for-fun containerized testing environment. __headscale-health__ ```bash #!/usr/bin/env bash set -e while true; do echo "Checking headscale health" timeout 5 headscale nodes list >/dev/null || (echo "Failed to get nodes list, rebooting headscale" && timeout 3 systemctl restart headscale) || (echo "Failed to restart, killing process" && kill -9 "$(ps aux | grep 'headscale serve' | grep -v grep | awk '{print $2}')") sleep 10 done ``` __headscale-health.service__ ``` [Unit] After=syslog.target After=network.target Description=headscale health monitor [Service] Type=simple User=root Group=root ExecStart=/usr/bin/headscale-health Restart=always RestartSec=1 [Install] WantedBy=multi-user.target ```

adam commented

@kradalby commented on GitHub (Feb 7, 2024):

Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like https://github.com/juanfont/headscale/pull/1701 may solve it.

@TotoTheDragon Could you have a look at this? I wont have time to get around to it for a bit more.

@kradalby commented on GitHub (Feb 7, 2024): > Finding myself in a similar situation where Headscale randomly locks running master branch from [this](https://github.com/juanfont/headscale/commit/cbf57e27a78922c88edd8de8f02c99810a653786) commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like https://github.com/juanfont/headscale/pull/1701 may solve it. @TotoTheDragon Could you have a look at this? I wont have time to get around to it for a bit more.

adam commented

@TotoTheDragon commented on GitHub (Feb 7, 2024):

@kradalby I have been looking into this issue for the past day and agree with dustin that #1701 is a good contender for solving this issue and #1656 . This issue was present before the commit referenced.

I do not have the infrastructure to test with 30 nodes or so to recreate the issue, but we could make a build that includes the changes from #1701 and hopefully dustin is able to test that to see if it makes any difference.

@TotoTheDragon commented on GitHub (Feb 7, 2024): @kradalby I have been looking into this issue for the past day and agree with dustin that #1701 is a good contender for solving this issue and #1656 . This issue was present before the commit referenced. I do not have the infrastructure to test with 30 nodes or so to recreate the issue, but we could make a build that includes the changes from #1701 and hopefully dustin is able to test that to see if it makes any difference.

adam commented

@dustinblackman commented on GitHub (Feb 7, 2024):

I'm down! If https://github.com/juanfont/headscale/pull/1701 is considered complete in it's current state, I can ship it and see how it does.

@dustinblackman commented on GitHub (Feb 7, 2024): I'm down! If https://github.com/juanfont/headscale/pull/1701 is considered complete in it's current state, I can ship it and see how it does.

adam commented

@kradalby commented on GitHub (Feb 7, 2024):

I would say that it is complete, but complete as in tip of main, not tested sufficiently to release as a version. But I interpret your current running of main as your risk appetite is fine with that.

@kradalby commented on GitHub (Feb 7, 2024): I would say that it is complete, but complete as in tip of main, not tested sufficiently to release as a version. But I interpret your current running of main as your risk appetite is fine with that.

adam commented

@dustinblackman commented on GitHub (Feb 7, 2024):

your risk appetite is fine with that.

More or less, it's the fact codebase is easy to read, so at least I know what I'm bringing in off main. I'll give this a shot either this week or next :)

@dustinblackman commented on GitHub (Feb 7, 2024): > your risk appetite is fine with that. More or less, it's the fact codebase is easy to read, so at least I know what I'm bringing in off main. I'll give this a shot either this week or next :)

adam commented

@TotoTheDragon commented on GitHub (Feb 12, 2024):

@dustinblackman Would you be able to test with current version of main?

@TotoTheDragon commented on GitHub (Feb 12, 2024): @dustinblackman Would you be able to test with current version of main?

adam commented

@dustinblackman commented on GitHub (Feb 12, 2024):

@TotoTheDragon I've been running from 83769ba715 for the last four days. At first it looked like all was good, but looking at the logs I'm still seeing lockups, but less. I can try from the latest master later in the week.

I also have a set of scripts for a local cluster I had written for https://github.com/juanfont/headscale/issues/1725. I can look to PR them if you think they'd be helpful in debugging this.

@dustinblackman commented on GitHub (Feb 12, 2024): @TotoTheDragon I've been running from https://github.com/juanfont/headscale/commit/83769ba715408c05cc5defc1562e0bfe1d368de6 for the last four days. At first it looked like all was good, but looking at the logs I'm still seeing lockups, but less. I can try from the latest master later in the week. I also have a set of scripts for a [local cluster](https://github.com/dustinblackman/headscale/compare/main...ephemeral-debug) I had written for https://github.com/juanfont/headscale/issues/1725. I can look to PR them if you think they'd be helpful in debugging this.

adam commented

@TotoTheDragon commented on GitHub (Feb 13, 2024):

@dustinblackman seeing as headscale nodes list gets a context exceeded, maybe we can add a bunch of traces within the command and see where it gets stuck. This will help rule some stuff out.

@TotoTheDragon commented on GitHub (Feb 13, 2024): @dustinblackman seeing as `headscale nodes list` gets a context exceeded, maybe we can add a bunch of traces within the command and see where it gets stuck. This will help rule some stuff out.

adam commented

@kradalby commented on GitHub (Feb 15, 2024):

Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

@kradalby commented on GitHub (Feb 15, 2024): Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

adam commented

@dustinblackman commented on GitHub (Feb 16, 2024):

Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

@kradalby I ran this for about two hours, I saw no reboots, but I experienced issues where some newly added ephemeral nodes were unable to communicate over the network (port 443 requests), even with tailscale ping showing a direct connection. I'm wondering if nodes are not always being notified when a new node joins the network.

I'm going to try again in a localized environment and see if I can repro it.

@dustinblackman commented on GitHub (Feb 16, 2024): > Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 @kradalby I ran this for about two hours, I saw no reboots, but I experienced issues where _some_ newly added ephemeral nodes were unable to communicate over the network (port 443 requests), even with tailscale ping showing a direct connection. I'm wondering if nodes are not always being notified when a new node joins the network. I'm going to try again in a localized environment and see if I can repro it.

adam commented

@kradalby commented on GitHub (Feb 17, 2024):

thank you @dustinblackman, thats helpful, it does sounds like there is some missing updates. There is a debug env flag you can turn on which will dump all the mapresponses sent, if you can repro, that would potentially be helpful info, but it produces a lot of data and might not be suitable if you have a lot of nodes.

You can play around with that by setting HEADSCALE_DEBUG_DUMP_MAPRESPONSE_PATH to somewhere on your system.

@kradalby commented on GitHub (Feb 17, 2024): thank you @dustinblackman, thats helpful, it does sounds like there is some missing updates. There is a debug env flag you can turn on which will dump all the mapresponses sent, if you can repro, that would potentially be helpful info, but it produces a lot of data and might not be suitable if you have a lot of nodes. You can play around with that by setting `HEADSCALE_DEBUG_DUMP_MAPRESPONSE_PATH` to somewhere on your system.

adam commented

@kradalby commented on GitHub (Feb 19, 2024):

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

@kradalby commented on GitHub (Feb 19, 2024): Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

adam commented

@dustinblackman commented on GitHub (Feb 21, 2024):

@kradalby No reboots again, but after 30 minutes I get several lines such as the following. Couldn't prove they were actually causing issues. I'll test further.

{"level":"error","error":"context deadline exceeded","mkey":"mkey:0bdfa4de8f7c1f14d15fff04f39993c581c6ba131278ffsessdf5b0d","origin":"poll-nodeupdate-peers-patch","hostname":"mylaptop","time":1708543963,"message":"update not sent, context cancelled"}

@dustinblackman commented on GitHub (Feb 21, 2024): @kradalby No reboots again, but after 30 minutes I get several lines such as the following. Couldn't prove they were actually causing issues. I'll test further. ``` {"level":"error","error":"context deadline exceeded","mkey":"mkey:0bdfa4de8f7c1f14d15fff04f39993c581c6ba131278ffsessdf5b0d","origin":"poll-nodeupdate-peers-patch","hostname":"mylaptop","time":1708543963,"message":"update not sent, context cancelled"} ```

adam commented

@dustinblackman commented on GitHub (Feb 23, 2024):

I'm unable to repro this in a local cluster. :(

@dustinblackman commented on GitHub (Feb 23, 2024): I'm unable to repro this in a local cluster. :(

adam commented

@kradalby commented on GitHub (Apr 17, 2024):

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby commented on GitHub (Apr 17, 2024): Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

adam commented

@kradalby commented on GitHub (Apr 29, 2024):

I think the latest alpha should have improved this a lot, can someone experiencing this give it a try?

@kradalby commented on GitHub (Apr 29, 2024): I think the latest alpha should have improved this a lot, can someone experiencing this give it a try?

adam commented