Headscale service restart timing out #566

New Issue

adam · 2025-12-29T02:20:29+01:00

adam commented

2025-12-29 02:20:29 +01:00

Originally created by @lgrn on GitHub (Oct 12, 2023).

Bug description

After running systemctl restart headscale:

headscale[7497]: 2023-10-12T09:05:33Z INF Received signal to stop, shutting down gracefully signal=terminated
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF Headscale stopped
systemd[1]: Stopping headscale coordination server for Tailscale...

These types of logs show up immediately afterward:

headscale[7497]: 2023-10-12T09:05:33Z ERR Failed to fetch machine from the database with node key: nodekey: (...)
(...)
headscale[7497]: 2023-10-12T09:05:36Z ERR error getting routes error="sql: database is closed"
headscale[7497]: 2023-10-12T09:05:36Z ERR Error listing users error="sql: database is closed"

This finally ends with:

Oct 12 09:07:03 (...) systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7497 (headscale) with signal SIGKILL.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7499 (headscale) with signal SIGKILL.
(...)
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Failed with result 'timeout'.
Oct 12 09:07:03 (...) systemd[1]: Stopped headscale coordination server for Tailscale.
(...)

So for some reason SIGKILL is finally required and the restart takes about 90 seconds.

Is this a known issue?

Environment

# headscale version
v0.22.3

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

# uname -a
Linux (...) 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Headscale is behind a (reverse) proxy
Headscale runs in a container

To Reproduce

N/A

Originally created by @lgrn on GitHub (Oct 12, 2023). ## Bug description After running `systemctl restart headscale`: ``` headscale[7497]: 2023-10-12T09:05:33Z INF Received signal to stop, shutting down gracefully signal=terminated headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...) headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...) headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...) headscale[7497]: 2023-10-12T09:05:33Z INF Headscale stopped systemd[1]: Stopping headscale coordination server for Tailscale... ``` These types of logs show up immediately afterward: ``` headscale[7497]: 2023-10-12T09:05:33Z ERR Failed to fetch machine from the database with node key: nodekey: (...) (...) headscale[7497]: 2023-10-12T09:05:36Z ERR error getting routes error="sql: database is closed" headscale[7497]: 2023-10-12T09:05:36Z ERR Error listing users error="sql: database is closed" ``` This finally ends with: ``` Oct 12 09:07:03 (...) systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing. Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7497 (headscale) with signal SIGKILL. Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7499 (headscale) with signal SIGKILL. (...) Oct 12 09:07:03 (...) systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL Oct 12 09:07:03 (...) systemd[1]: headscale.service: Failed with result 'timeout'. Oct 12 09:07:03 (...) systemd[1]: Stopped headscale coordination server for Tailscale. (...) ``` So for some reason SIGKILL is finally required and the restart takes about 90 seconds. Is this a known issue? ## Environment ``` # headscale version v0.22.3 ``` ``` # cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS" ``` ``` # uname -a Linux (...) 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ``` - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ## To Reproduce N/A

adam added the bug label 2025-12-29 02:20:29 +01:00

adam closed this issue

2025-12-29 02:20:29 +01:00

adam commented

2025-12-29 02:20:30 +01:00

@harvey-git commented on GitHub (Oct 12, 2023):

Judging from the error log, the main program cannot fetch machine information from the database.
because headscale uses the sqlite3 database.
This error message means three possibilities: 1. The database is broken; 2. There are other processes locking the headscale database;3. database access The permissions have changed. If headscale can start normally after restarting the server, you can exclude 1 and 3. Then it should be that when headscale is restarted, its process cannot be closed and exited normally.
Is your server hardware configuration very low?

@harvey-git commented on GitHub (Oct 12, 2023): Judging from the error log, the main program cannot fetch machine information from the database. because headscale uses the sqlite3 database. This error message means three possibilities: 1. The database is broken; 2. There are other processes locking the headscale database;3. database access The permissions have changed. If headscale can start normally after restarting the server, you can exclude 1 and 3. Then it should be that when headscale is restarted, its process cannot be closed and exited normally. Is your server hardware configuration very low?

adam commented

2025-12-29 02:20:30 +01:00

@lgrn commented on GitHub (Oct 12, 2023):

Hi @harvey-git ,

Headscale restarts just fine, and it seems to work. My interpretation of the logs is that the headscale process for some reason hangs around after receiving SIGTERM from systemd and attempts to access the database, even though the database is closed (presumably because it's closed as a part of shutting down headscale?) -- until systemd loses patience and kills it.

I'm not sure if it's related but I've also noticed that operations like deleting a route (headscale route delete -r N) also takes a pretty long time, maybe 5-10 seconds, but that feels more like database performance while this issue has to do with the process not actually terminating and attempting to access the (closed) database.

Regarding the specs, this is a VM running in OpenStack with 2 cores, 4G of memory and nvme backed storage.

In terms of a workaround, I'm not sure about the impact (i.e. whether it's safe to kill it sooner) but restarts can be done faster by setting TimeoutStopSec:

systemctl edit headscale

Add:

[Service]
TimeoutStopSec=5

Save and do systemctl daemon-reload.

Having to do SIGKILL at all doesn't feel great but I'm not sure whether it matters doing it on a 5s or 90s timeout.

Tangentially I also backed up the .sqlite database, opened it up and ran:

sqlite> VACUUM;
sqlite> PRAGMA integrity_check;
ok

This seems to have made commands like headscale users list a lot faster, but I can't say for sure. It's kind of strange why it would, because there's not a lot of data there (about 60K).

@lgrn commented on GitHub (Oct 12, 2023): Hi @harvey-git , Headscale restarts just fine, and it seems to work. My interpretation of the logs is that the headscale process for some reason hangs around after receiving `SIGTERM` from systemd and attempts to access the database, even though the database is closed (presumably because it's closed as a part of shutting down headscale?) -- until systemd loses patience and kills it. I'm not sure if it's related but I've also noticed that operations like deleting a route (`headscale route delete -r N`) also takes a pretty long time, maybe 5-10 seconds, but that feels more like database performance while this issue has to do with the process not actually terminating and attempting to access the (closed) database. Regarding the specs, this is a VM running in OpenStack with 2 cores, 4G of memory and nvme backed storage. In terms of a workaround, I'm not sure about the impact (i.e. whether it's safe to kill it sooner) but restarts can be done faster by setting `TimeoutStopSec`: ``` systemctl edit headscale ``` Add: ``` [Service] TimeoutStopSec=5 ``` Save and do `systemctl daemon-reload`. Having to do `SIGKILL` at all doesn't feel great but I'm not sure whether it matters doing it on a 5s or 90s timeout. Tangentially I also backed up the `.sqlite` database, opened it up and ran: ``` sqlite> VACUUM; sqlite> PRAGMA integrity_check; ok ``` This seems to have made commands like `headscale users list` a lot faster, but I can't say for sure. It's kind of strange why it would, because there's not a lot of data there (about 60K).

adam commented

2025-12-29 02:20:30 +01:00

@bryanjhv commented on GitHub (Oct 12, 2023):

Same as #1461 I think, should be fixed by #1480 in v0.23.0-alpha1.

@bryanjhv commented on GitHub (Oct 12, 2023): Same as #1461 I think, should be fixed by #1480 in v0.23.0-alpha1.

adam commented

2025-12-29 02:20:30 +01:00

@lgrn commented on GitHub (Oct 16, 2023):

I can confirm that headscale.service seems to restart instantly on v0.23.0-alpha1 -- thanks!

@lgrn commented on GitHub (Oct 16, 2023): I can confirm that headscale.service seems to restart instantly on v0.23.0-alpha1 -- thanks!

adam commented

2025-12-29 02:20:31 +01:00

@joshuataylor commented on GitHub (Dec 3, 2023):

Wanted to chip in to comment that a smaller timeout in this case is fine is acceptable. As noted systemd waits 120s for shutdown and other parts as well, this can be changed on a global level or by package (do it by package for a server, especially if you have other things that SHOULD NOT be killed after 5 seconds :)).

The KDE Wiki also recommends this (for desktops, not servers -- you probably do NOT want this globally on a server).

@joshuataylor commented on GitHub (Dec 3, 2023): Wanted to chip in to comment that a smaller timeout in this case is fine is acceptable. As noted systemd waits 120s for shutdown and other parts as well, this can be changed on a global level or by package (do it by package for a server, especially if you have other things that SHOULD NOT be killed after 5 seconds :)). The [KDE Wiki](https://community.kde.org/Distributions/Packaging_Recommendations#Systemd_configuration) also recommends this (for desktops, not servers -- you probably do NOT want this globally on a server).

adam commented

2025-12-29 02:20:31 +01:00

@MulverineX commented on GitHub (Jul 22, 2024):

I'm still experiencing this issue in the 0.23 alpha

@MulverineX commented on GitHub (Jul 22, 2024): I'm still experiencing this issue in the 0.23 alpha

adam commented

2025-12-29 02:20:31 +01:00

@lgrn commented on GitHub (Jul 23, 2024):

I'm still experiencing this issue in the 0.23 alpha

It looks like the beta was released today, try that first.

@lgrn commented on GitHub (Jul 23, 2024): > I'm still experiencing this issue in the 0.23 alpha It looks like the beta was released today, try that first.

adam commented

2025-12-29 02:20:32 +01:00

@SuperSandro2000 commented on GitHub (Aug 9, 2024):

I am on 0.23.0-beta1 and also still experience this issue.

@SuperSandro2000 commented on GitHub (Aug 9, 2024): I am on 0.23.0-beta1 and also still experience this issue.

adam referenced this issue

2025-12-29 02:30:15 +01:00

[PR #566] [MERGED] Do not setLastStateChangeToNow every expireEphemeralNodesWorker call #1485

Sign in to join this conversation.

Branches Tags

main

gh-pages

update_flake_lock_action

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#566