mirror of
https://github.com/juanfont/headscale.git
synced 2026-01-11 20:00:28 +01:00
headscale server stopped answering after a day of uptime despite listening of all the ports #565
Closed
opened 2025-12-29 02:20:29 +01:00 by adam
·
37 comments
No Branch/Tag Specified
main
update_flake_lock_action
gh-pages
kradalby/release-v0.27.2
dependabot/go_modules/golang.org/x/crypto-0.45.0
dependabot/go_modules/github.com/opencontainers/runc-1.3.3
copilot/investigate-headscale-issue-2788
copilot/investigate-visibility-issue-2788
copilot/investigate-issue-2833
copilot/debug-issue-2846
copilot/fix-issue-2847
dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0
dependabot/go_modules/github.com/docker/docker-28.3.3incompatible
kradalby/cli-experiement3
doc/0.26.1
doc/0.25.1
doc/0.25.0
doc/0.24.3
doc/0.24.2
doc/0.24.1
doc/0.24.0
kradalby/build-docker-on-pr
topic/docu-versioning
topic/docker-kos
juanfont/fix-crash-node-id
juanfont/better-disclaimer
update-contributors
topic/prettier
revert-1893-add-test-stage-to-docs
add-test-stage-to-docs
remove-node-check-interval
fix-empty-prefix
fix-ephemeral-reusable
bug_report-debuginfo
autogroups
logs-to-stderr
revert-1414-topic/fix_unix_socket
rename-machine-node
port-embedded-derp-tests-v2
port-derp-tests
duplicate-word-linter
update-tailscale-1.36
warn-against-apache
ko-fi-link
more-acl-tests
fix-typo-standalone
parallel-nolint
tparallel-fix
rerouting
ssh-changelog-docs
oidc-cleanup
web-auth-flow-tests
kradalby-gh-runner
fix-proto-lint
remove-funding-links
go-1.19
enable-1.30-in-tests
0.16.x
cosmetic-changes-integration
tmp-fix-integration-docker
fix-integration-docker
configurable-update-interval
show-nodes-online
hs2021
acl-syntax-fixes
ts2021-implementation
fix-spurious-updates
unstable-integration-tests
mandatory-stun
embedded-derp
prtemplate-fix
v0.28.0-beta.1
v0.27.2-rc.1
v0.27.1
v0.27.0
v0.27.0-beta.2
v0.27.0-beta.1
v0.26.1
v0.26.0
v0.26.0-beta.2
v0.26.0-beta.1
v0.25.1
v0.25.0
v0.25.0-beta.2
v0.24.3
v0.25.0-beta.1
v0.24.2
v0.24.1
v0.24.0
v0.24.0-beta.2
v0.24.0-beta.1
v0.23.0
v0.23.0-rc.1
v0.23.0-beta.5
v0.23.0-beta.4
v0.23.0-beta3
v0.23.0-beta2
v0.23.0-beta1
v0.23.0-alpha12
v0.23.0-alpha11
v0.23.0-alpha10
v0.23.0-alpha9
v0.23.0-alpha8
v0.23.0-alpha7
v0.23.0-alpha6
v0.23.0-alpha5
v0.23.0-alpha4
v0.23.0-alpha4-docker-ko-test9
v0.23.0-alpha4-docker-ko-test8
v0.23.0-alpha4-docker-ko-test7
v0.23.0-alpha4-docker-ko-test6
v0.23.0-alpha4-docker-ko-test5
v0.23.0-alpha-docker-release-test-debug2
v0.23.0-alpha-docker-release-test-debug
v0.23.0-alpha4-docker-ko-test4
v0.23.0-alpha4-docker-ko-test3
v0.23.0-alpha4-docker-ko-test2
v0.23.0-alpha4-docker-ko-test
v0.23.0-alpha3
v0.23.0-alpha2
v0.23.0-alpha1
v0.22.3
v0.22.2
v0.23.0-alpha-docker-release-test
v0.22.1
v0.22.0
v0.22.0-alpha3
v0.22.0-alpha2
v0.22.0-alpha1
v0.22.0-nfpmtest
v0.21.0
v0.20.0
v0.19.0
v0.19.0-beta2
v0.19.0-beta1
v0.18.0
v0.18.0-beta4
v0.18.0-beta3
v0.18.0-beta2
v0.18.0-beta1
v0.17.1
v0.17.0
v0.17.0-beta5
v0.17.0-beta4
v0.17.0-beta3
v0.17.0-beta2
v0.17.0-beta1
v0.17.0-alpha4
v0.17.0-alpha3
v0.17.0-alpha2
v0.17.0-alpha1
v0.16.4
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.16.0-beta7
v0.16.0-beta6
v0.16.0-beta5
v0.16.0-beta4
v0.16.0-beta3
v0.16.0-beta2
v0.16.0-beta1
v0.15.0
v0.15.0-beta6
v0.15.0-beta5
v0.15.0-beta4
v0.15.0-beta3
v0.15.0-beta2
v0.15.0-beta1
v0.14.0
v0.14.0-beta2
v0.14.0-beta1
v0.13.0
v0.13.0-beta3
v0.13.0-beta2
v0.13.0-beta1
upstream/v0.12.4
v0.12.4
v0.12.3
v0.12.2
v0.12.2-beta1
v0.12.1
v0.12.0-beta2
v0.12.0-beta1
v0.11.0
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.1
v0.8.0
v0.7.1
v0.7.0
v0.6.1
v0.6.0
v0.5.2
v0.5.1
v0.5.0
v0.4.0
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.1
v0.1.0
Labels
Clear labels
CLI
DERP
DNS
Nix
OIDC
SSH
bug
database
documentation
duplicate
enhancement
faq
good first issue
grants
help wanted
might-come
needs design doc
needs investigation
no-stale-bot
out of scope
performance
policy 📝
pull-request
question
regression
routes
stale
tags
tailscale-feature-gap
well described ❤️
wontfix
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/headscale#565
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @axxonadmin on GitHub (Oct 11, 2023).
Went to a local shop and tried to connect to my remote headscale server v0.23.0-alpha1 I've got working several days ago.
Tailscale client (1.48.2 android) stuck at "connecting"
then I logged out and tried to log in, it stuck at this stage also with no output or warning
I got home, tried to connect from my desktop computer (1.51 windows 11) with no success (it stuck openning listen_addr and port in browser to authenticate)
Then I connected to my server over ssh and ran 'headscale apikeys list' and 'headscale nodes list' with no success, then I restarted headscale server with 'service headscale stop' / 'service headscale start' and everything started working just fine.
Can I do better for you to invetigate this issue if I get it next time?
What I've found so far:
tcpdump on port 8080 (which is my 'listen_addr') showed smth like that:
I tried to restart headscale with 'service headscale stop' and logs went like that:
Before that logs were full of:
Oct 9 20:30:15 server headscale[482461]: 2023-10-09T20:30:15Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:33 > Waiting for update on stream channel node=MikroTik node_key=removed_key_hash_data
removed_key_hash_data noise=true omitPeers=false readOnly=false stream=true
@rjmalagon commented on GitHub (Oct 28, 2023):
+1. After ~20 minutes headscale is stuck with timeouts while updating nodes.
@rjmalagon commented on GitHub (Oct 28, 2023):
After carefully checks about my headscale VMs stats, I found that is very ram sensitive.
In a serve ~200+ nodes with a 2Gb RAM / 4 CPU VM.
Sometimes, headscale pushes a very brief RAM peak and usually stays below 1Gb RAM usage.
I just added a little swapfile and worked well. I am now experimenting with a zram swap compression (it seems to work as well)
@rjmalagon commented on GitHub (Oct 30, 2023):
I am wrong on this, headscale sometimes gets stuck, even with enough memory and CPU.
Usually, it gets stuck on multiples "waiting for update on stream channel".
I set up a Caddy reverse proxy to get the connection logs, and I see many 502 on the tailscale-control-protocol upgrade when headscale stops responding.
@Nickiel12 commented on GitHub (Nov 29, 2023):
+1 Running on a nixos server with 24 GB or RAM (so ram isn't the issue). Headscale randomly (from what I can tell so far) get's stuck using 3% of the CPU, and takes 5 minutes for systemctl to restart, and doesn't allow new connections.
@kradalby commented on GitHub (Nov 30, 2023):
I think this might be fixed with the tip of #1564, could you test?
@Nickiel12 commented on GitHub (Dec 1, 2023):
@kradalby I would like to test the change that you have made - however I'm having an awful time trying to use the headscale flake. I've been using the nixpkgs version, and I don't have my head wrapped around how flakes and nix quite all work. I saw that you use the headscale flake in your configuration, and was wondering how your configuration uses the flake version of headscale instead of the nixpkgs version? I'm assuming it has to do with the overlay, but I tried puttingoverlay = [ headscale.overlay ]in my flake.nix (after adding it as an input) and it did not change what version was installed.If I can figure out how to switch to the flake version, I will be happy to test this change.I wasn't inheriting the overlay-ed pkgs to the server host. I'm switching to the patched version of the headscale server and will let you know how it works in a few days.
@Nickiel12 commented on GitHub (Dec 4, 2023):
I'm not sure if this is the right thread for this, but I noticed while
watch systemctl status headscaleandsystemctl restart headscaleit appears to have closed the DB (sqlite for my instance) before shutting down other threads properly, and it took almost two minutes for headscale to restart - I don't know if this is related, but if the sqlite database can disconnect before the rest of the application shuts down it might be causing this issue.@kradalby commented on GitHub (Dec 10, 2023):
0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.
@Nickiel12 commented on GitHub (Dec 16, 2023):
@kradalby The issue of the headscale server not responding has gone away. I've noticed a weird issue with the app where I need to log out, change the server and save it, then log back in before I can connect. But I have not had to restart the headscale server at all since I switched to the alpha release.
@Nickiel12 commented on GitHub (Dec 16, 2023):
well, scratch that, now I keep getting prompted to re-register my phone when I try to connect to the server. But I'm not getting the same issue as before where I couldn't log on at all and the headscale service would hang.
@rjmalagon commented on GitHub (Dec 19, 2023):
Hi, the stability and responsiveness are much better. Alpha 1 was easily stuck with some stress (+300 nodes) in just minutes. This new alpha is robust enough to handle the same workload 20+ hours without issues. Thanks @kradalby for the advances and the follow of this issue.
I can share any info if needed to help with the development of v0.23.
@jwischka commented on GitHub (Dec 20, 2023):
I'm experiencing something similar on 0.23-alpha2 - I haven't had time to debug yet, but I've had to restart headscale twice on a moderately sized install (~30 nodes) in the last two days.
@Nickiel12 commented on GitHub (Dec 23, 2023):
I deleted my phone node, re-registered it, and I have not had any issues connecting in two days since!
@cfouche3005 commented on GitHub (Dec 27, 2023):
Same issue on 0.23-alpha2 after one hour, I will check if the issue still reappears
@cfouche3005 commented on GitHub (Dec 27, 2023):
I can confirm the issue is still present, after one hour or so, no response from grpc or http api but headscale still work (I haven't tested if I can register new nodes)
@cfouche3005 commented on GitHub (Dec 28, 2023):
And I also confirm that new devices or disconnected devices cannot join/reconnect to the tailnet
@cfouche3005 commented on GitHub (Dec 29, 2023):
I have to add this bug is very inconsistent, today it did not appear but yesterday, it was here.
@dustinblackman commented on GitHub (Feb 7, 2024):
Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like https://github.com/juanfont/headscale/pull/1701 may solve it.
In the mean time, I've put together a small/lazy systemd service that acts as a healthcheck, rebooting Headscale if it locks up. Taking yesterday's data, this fired 9 times with 30 nodes in a for-fun containerized testing environment.
headscale-health
headscale-health.service
@kradalby commented on GitHub (Feb 7, 2024):
@TotoTheDragon Could you have a look at this? I wont have time to get around to it for a bit more.
@TotoTheDragon commented on GitHub (Feb 7, 2024):
@kradalby I have been looking into this issue for the past day and agree with dustin that #1701 is a good contender for solving this issue and #1656 . This issue was present before the commit referenced.
I do not have the infrastructure to test with 30 nodes or so to recreate the issue, but we could make a build that includes the changes from #1701 and hopefully dustin is able to test that to see if it makes any difference.
@dustinblackman commented on GitHub (Feb 7, 2024):
I'm down! If https://github.com/juanfont/headscale/pull/1701 is considered complete in it's current state, I can ship it and see how it does.
@kradalby commented on GitHub (Feb 7, 2024):
I would say that it is complete, but complete as in tip of main, not tested sufficiently to release as a version. But I interpret your current running of main as your risk appetite is fine with that.
@dustinblackman commented on GitHub (Feb 7, 2024):
More or less, it's the fact codebase is easy to read, so at least I know what I'm bringing in off main. I'll give this a shot either this week or next :)
@TotoTheDragon commented on GitHub (Feb 12, 2024):
@dustinblackman Would you be able to test with current version of main?
@dustinblackman commented on GitHub (Feb 12, 2024):
@TotoTheDragon I've been running from
83769ba715for the last four days. At first it looked like all was good, but looking at the logs I'm still seeing lockups, but less. I can try from the latest master later in the week.I also have a set of scripts for a local cluster I had written for https://github.com/juanfont/headscale/issues/1725. I can look to PR them if you think they'd be helpful in debugging this.
@TotoTheDragon commented on GitHub (Feb 13, 2024):
@dustinblackman seeing as
headscale nodes listgets a context exceeded, maybe we can add a bunch of traces within the command and see where it gets stuck. This will help rule some stuff out.@kradalby commented on GitHub (Feb 15, 2024):
Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4
@dustinblackman commented on GitHub (Feb 16, 2024):
@kradalby I ran this for about two hours, I saw no reboots, but I experienced issues where some newly added ephemeral nodes were unable to communicate over the network (port 443 requests), even with tailscale ping showing a direct connection. I'm wondering if nodes are not always being notified when a new node joins the network.
I'm going to try again in a localized environment and see if I can repro it.
@kradalby commented on GitHub (Feb 17, 2024):
thank you @dustinblackman, thats helpful, it does sounds like there is some missing updates. There is a debug env flag you can turn on which will dump all the mapresponses sent, if you can repro, that would potentially be helpful info, but it produces a lot of data and might not be suitable if you have a lot of nodes.
You can play around with that by setting
HEADSCALE_DEBUG_DUMP_MAPRESPONSE_PATHto somewhere on your system.@kradalby commented on GitHub (Feb 19, 2024):
Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?
@dustinblackman commented on GitHub (Feb 21, 2024):
@kradalby No reboots again, but after 30 minutes I get several lines such as the following. Couldn't prove they were actually causing issues. I'll test further.
@dustinblackman commented on GitHub (Feb 23, 2024):
I'm unable to repro this in a local cluster. :(
@kradalby commented on GitHub (Apr 17, 2024):
Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?
@kradalby commented on GitHub (Apr 29, 2024):
I think the latest alpha should have improved this a lot, can someone experiencing this give it a try?
@dustinblackman commented on GitHub (May 7, 2024):
I'll look to give this a spin this week if I can slot it in :)
@dustinblackman commented on GitHub (May 10, 2024):
Been running this for a little over a day with no issues! Amazing work, thank you! I appreciate all the effort.
@ohdearaugustin commented on GitHub (May 18, 2024):
Will close this as fixed.