Headscale service restart timing out #566

Closed
opened 2025-12-29 02:20:29 +01:00 by adam · 8 comments
Owner

Originally created by @lgrn on GitHub (Oct 12, 2023).

Bug description

After running systemctl restart headscale:

headscale[7497]: 2023-10-12T09:05:33Z INF Received signal to stop, shutting down gracefully signal=terminated
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF Headscale stopped
systemd[1]: Stopping headscale coordination server for Tailscale...

These types of logs show up immediately afterward:

headscale[7497]: 2023-10-12T09:05:33Z ERR Failed to fetch machine from the database with node key: nodekey: (...)
(...)
headscale[7497]: 2023-10-12T09:05:36Z ERR error getting routes error="sql: database is closed"
headscale[7497]: 2023-10-12T09:05:36Z ERR Error listing users error="sql: database is closed"

This finally ends with:

Oct 12 09:07:03 (...) systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7497 (headscale) with signal SIGKILL.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7499 (headscale) with signal SIGKILL.
(...)
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Failed with result 'timeout'.
Oct 12 09:07:03 (...) systemd[1]: Stopped headscale coordination server for Tailscale.
(...)

So for some reason SIGKILL is finally required and the restart takes about 90 seconds.

Is this a known issue?

Environment

# headscale version
v0.22.3
# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
# uname -a
Linux (...) 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

To Reproduce

N/A

Originally created by @lgrn on GitHub (Oct 12, 2023). ## Bug description After running `systemctl restart headscale`: ``` headscale[7497]: 2023-10-12T09:05:33Z INF Received signal to stop, shutting down gracefully signal=terminated headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...) headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...) headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...) headscale[7497]: 2023-10-12T09:05:33Z INF Headscale stopped systemd[1]: Stopping headscale coordination server for Tailscale... ``` These types of logs show up immediately afterward: ``` headscale[7497]: 2023-10-12T09:05:33Z ERR Failed to fetch machine from the database with node key: nodekey: (...) (...) headscale[7497]: 2023-10-12T09:05:36Z ERR error getting routes error="sql: database is closed" headscale[7497]: 2023-10-12T09:05:36Z ERR Error listing users error="sql: database is closed" ``` This finally ends with: ``` Oct 12 09:07:03 (...) systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing. Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7497 (headscale) with signal SIGKILL. Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7499 (headscale) with signal SIGKILL. (...) Oct 12 09:07:03 (...) systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL Oct 12 09:07:03 (...) systemd[1]: headscale.service: Failed with result 'timeout'. Oct 12 09:07:03 (...) systemd[1]: Stopped headscale coordination server for Tailscale. (...) ``` So for some reason SIGKILL is finally required and the restart takes about 90 seconds. Is this a known issue? ## Environment ``` # headscale version v0.22.3 ``` ``` # cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS" ``` ``` # uname -a Linux (...) 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ``` - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ## To Reproduce N/A
adam added the bug label 2025-12-29 02:20:29 +01:00
adam closed this issue 2025-12-29 02:20:29 +01:00
Author
Owner

@harvey-git commented on GitHub (Oct 12, 2023):

Judging from the error log, the main program cannot fetch machine information from the database.
because headscale uses the sqlite3 database.
This error message means three possibilities: 1. The database is broken; 2. There are other processes locking the headscale database;3. database access The permissions have changed. If headscale can start normally after restarting the server, you can exclude 1 and 3. Then it should be that when headscale is restarted, its process cannot be closed and exited normally.
Is your server hardware configuration very low?

@harvey-git commented on GitHub (Oct 12, 2023): Judging from the error log, the main program cannot fetch machine information from the database. because headscale uses the sqlite3 database. This error message means three possibilities: 1. The database is broken; 2. There are other processes locking the headscale database;3. database access The permissions have changed. If headscale can start normally after restarting the server, you can exclude 1 and 3. Then it should be that when headscale is restarted, its process cannot be closed and exited normally. Is your server hardware configuration very low?
Author
Owner

@lgrn commented on GitHub (Oct 12, 2023):

Hi @harvey-git ,

Headscale restarts just fine, and it seems to work. My interpretation of the logs is that the headscale process for some reason hangs around after receiving SIGTERM from systemd and attempts to access the database, even though the database is closed (presumably because it's closed as a part of shutting down headscale?) -- until systemd loses patience and kills it.

I'm not sure if it's related but I've also noticed that operations like deleting a route (headscale route delete -r N) also takes a pretty long time, maybe 5-10 seconds, but that feels more like database performance while this issue has to do with the process not actually terminating and attempting to access the (closed) database.

Regarding the specs, this is a VM running in OpenStack with 2 cores, 4G of memory and nvme backed storage.

In terms of a workaround, I'm not sure about the impact (i.e. whether it's safe to kill it sooner) but restarts can be done faster by setting TimeoutStopSec:

systemctl edit headscale

Add:

[Service]
TimeoutStopSec=5

Save and do systemctl daemon-reload.

Having to do SIGKILL at all doesn't feel great but I'm not sure whether it matters doing it on a 5s or 90s timeout.

Tangentially I also backed up the .sqlite database, opened it up and ran:

sqlite> VACUUM;
sqlite> PRAGMA integrity_check;
ok

This seems to have made commands like headscale users list a lot faster, but I can't say for sure. It's kind of strange why it would, because there's not a lot of data there (about 60K).

@lgrn commented on GitHub (Oct 12, 2023): Hi @harvey-git , Headscale restarts just fine, and it seems to work. My interpretation of the logs is that the headscale process for some reason hangs around after receiving `SIGTERM` from systemd and attempts to access the database, even though the database is closed (presumably because it's closed as a part of shutting down headscale?) -- until systemd loses patience and kills it. I'm not sure if it's related but I've also noticed that operations like deleting a route (`headscale route delete -r N`) also takes a pretty long time, maybe 5-10 seconds, but that feels more like database performance while this issue has to do with the process not actually terminating and attempting to access the (closed) database. Regarding the specs, this is a VM running in OpenStack with 2 cores, 4G of memory and nvme backed storage. In terms of a workaround, I'm not sure about the impact (i.e. whether it's safe to kill it sooner) but restarts can be done faster by setting `TimeoutStopSec`: ``` systemctl edit headscale ``` Add: ``` [Service] TimeoutStopSec=5 ``` Save and do `systemctl daemon-reload`. Having to do `SIGKILL` at all doesn't feel great but I'm not sure whether it matters doing it on a 5s or 90s timeout. Tangentially I also backed up the `.sqlite` database, opened it up and ran: ``` sqlite> VACUUM; sqlite> PRAGMA integrity_check; ok ``` This seems to have made commands like `headscale users list` a lot faster, but I can't say for sure. It's kind of strange why it would, because there's not a lot of data there (about 60K).
Author
Owner

@bryanjhv commented on GitHub (Oct 12, 2023):

Same as #1461 I think, should be fixed by #1480 in v0.23.0-alpha1.

@bryanjhv commented on GitHub (Oct 12, 2023): Same as #1461 I think, should be fixed by #1480 in v0.23.0-alpha1.
Author
Owner

@lgrn commented on GitHub (Oct 16, 2023):

I can confirm that headscale.service seems to restart instantly on v0.23.0-alpha1 -- thanks!

@lgrn commented on GitHub (Oct 16, 2023): I can confirm that headscale.service seems to restart instantly on v0.23.0-alpha1 -- thanks!
Author
Owner

@joshuataylor commented on GitHub (Dec 3, 2023):

Wanted to chip in to comment that a smaller timeout in this case is fine is acceptable. As noted systemd waits 120s for shutdown and other parts as well, this can be changed on a global level or by package (do it by package for a server, especially if you have other things that SHOULD NOT be killed after 5 seconds :)).

The KDE Wiki also recommends this (for desktops, not servers -- you probably do NOT want this globally on a server).

@joshuataylor commented on GitHub (Dec 3, 2023): Wanted to chip in to comment that a smaller timeout in this case is fine is acceptable. As noted systemd waits 120s for shutdown and other parts as well, this can be changed on a global level or by package (do it by package for a server, especially if you have other things that SHOULD NOT be killed after 5 seconds :)). The [KDE Wiki](https://community.kde.org/Distributions/Packaging_Recommendations#Systemd_configuration) also recommends this (for desktops, not servers -- you probably do NOT want this globally on a server).
Author
Owner

@MulverineX commented on GitHub (Jul 22, 2024):

I'm still experiencing this issue in the 0.23 alpha

@MulverineX commented on GitHub (Jul 22, 2024): I'm still experiencing this issue in the 0.23 alpha
Author
Owner

@lgrn commented on GitHub (Jul 23, 2024):

I'm still experiencing this issue in the 0.23 alpha

It looks like the beta was released today, try that first.

@lgrn commented on GitHub (Jul 23, 2024): > I'm still experiencing this issue in the 0.23 alpha It looks like the beta was released today, try that first.
Author
Owner

@SuperSandro2000 commented on GitHub (Aug 9, 2024):

I am on 0.23.0-beta1 and also still experience this issue.

@SuperSandro2000 commented on GitHub (Aug 9, 2024): I am on 0.23.0-beta1 and also still experience this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#566