mirror of
https://github.com/juanfont/headscale.git
synced 2026-01-11 20:00:28 +01:00
[Bug] online status doesn't change if connection is interrupted #790
Open
opened 2025-12-29 02:24:02 +01:00 by adam
·
18 comments
No Branch/Tag Specified
main
update_flake_lock_action
gh-pages
kradalby/release-v0.27.2
dependabot/go_modules/golang.org/x/crypto-0.45.0
dependabot/go_modules/github.com/opencontainers/runc-1.3.3
copilot/investigate-headscale-issue-2788
copilot/investigate-visibility-issue-2788
copilot/investigate-issue-2833
copilot/debug-issue-2846
copilot/fix-issue-2847
dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0
dependabot/go_modules/github.com/docker/docker-28.3.3incompatible
kradalby/cli-experiement3
doc/0.26.1
doc/0.25.1
doc/0.25.0
doc/0.24.3
doc/0.24.2
doc/0.24.1
doc/0.24.0
kradalby/build-docker-on-pr
topic/docu-versioning
topic/docker-kos
juanfont/fix-crash-node-id
juanfont/better-disclaimer
update-contributors
topic/prettier
revert-1893-add-test-stage-to-docs
add-test-stage-to-docs
remove-node-check-interval
fix-empty-prefix
fix-ephemeral-reusable
bug_report-debuginfo
autogroups
logs-to-stderr
revert-1414-topic/fix_unix_socket
rename-machine-node
port-embedded-derp-tests-v2
port-derp-tests
duplicate-word-linter
update-tailscale-1.36
warn-against-apache
ko-fi-link
more-acl-tests
fix-typo-standalone
parallel-nolint
tparallel-fix
rerouting
ssh-changelog-docs
oidc-cleanup
web-auth-flow-tests
kradalby-gh-runner
fix-proto-lint
remove-funding-links
go-1.19
enable-1.30-in-tests
0.16.x
cosmetic-changes-integration
tmp-fix-integration-docker
fix-integration-docker
configurable-update-interval
show-nodes-online
hs2021
acl-syntax-fixes
ts2021-implementation
fix-spurious-updates
unstable-integration-tests
mandatory-stun
embedded-derp
prtemplate-fix
v0.28.0-beta.1
v0.27.2-rc.1
v0.27.1
v0.27.0
v0.27.0-beta.2
v0.27.0-beta.1
v0.26.1
v0.26.0
v0.26.0-beta.2
v0.26.0-beta.1
v0.25.1
v0.25.0
v0.25.0-beta.2
v0.24.3
v0.25.0-beta.1
v0.24.2
v0.24.1
v0.24.0
v0.24.0-beta.2
v0.24.0-beta.1
v0.23.0
v0.23.0-rc.1
v0.23.0-beta.5
v0.23.0-beta.4
v0.23.0-beta3
v0.23.0-beta2
v0.23.0-beta1
v0.23.0-alpha12
v0.23.0-alpha11
v0.23.0-alpha10
v0.23.0-alpha9
v0.23.0-alpha8
v0.23.0-alpha7
v0.23.0-alpha6
v0.23.0-alpha5
v0.23.0-alpha4
v0.23.0-alpha4-docker-ko-test9
v0.23.0-alpha4-docker-ko-test8
v0.23.0-alpha4-docker-ko-test7
v0.23.0-alpha4-docker-ko-test6
v0.23.0-alpha4-docker-ko-test5
v0.23.0-alpha-docker-release-test-debug2
v0.23.0-alpha-docker-release-test-debug
v0.23.0-alpha4-docker-ko-test4
v0.23.0-alpha4-docker-ko-test3
v0.23.0-alpha4-docker-ko-test2
v0.23.0-alpha4-docker-ko-test
v0.23.0-alpha3
v0.23.0-alpha2
v0.23.0-alpha1
v0.22.3
v0.22.2
v0.23.0-alpha-docker-release-test
v0.22.1
v0.22.0
v0.22.0-alpha3
v0.22.0-alpha2
v0.22.0-alpha1
v0.22.0-nfpmtest
v0.21.0
v0.20.0
v0.19.0
v0.19.0-beta2
v0.19.0-beta1
v0.18.0
v0.18.0-beta4
v0.18.0-beta3
v0.18.0-beta2
v0.18.0-beta1
v0.17.1
v0.17.0
v0.17.0-beta5
v0.17.0-beta4
v0.17.0-beta3
v0.17.0-beta2
v0.17.0-beta1
v0.17.0-alpha4
v0.17.0-alpha3
v0.17.0-alpha2
v0.17.0-alpha1
v0.16.4
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.16.0-beta7
v0.16.0-beta6
v0.16.0-beta5
v0.16.0-beta4
v0.16.0-beta3
v0.16.0-beta2
v0.16.0-beta1
v0.15.0
v0.15.0-beta6
v0.15.0-beta5
v0.15.0-beta4
v0.15.0-beta3
v0.15.0-beta2
v0.15.0-beta1
v0.14.0
v0.14.0-beta2
v0.14.0-beta1
v0.13.0
v0.13.0-beta3
v0.13.0-beta2
v0.13.0-beta1
upstream/v0.12.4
v0.12.4
v0.12.3
v0.12.2
v0.12.2-beta1
v0.12.1
v0.12.0-beta2
v0.12.0-beta1
v0.11.0
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.1
v0.8.0
v0.7.1
v0.7.0
v0.6.1
v0.6.0
v0.5.2
v0.5.1
v0.5.0
v0.4.0
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.1
v0.1.0
Labels
Clear labels
CLI
DERP
DNS
Nix
OIDC
SSH
bug
database
documentation
duplicate
enhancement
faq
good first issue
grants
help wanted
might-come
needs design doc
needs investigation
no-stale-bot
out of scope
performance
policy 📝
pull-request
question
regression
routes
stale
tags
tailscale-feature-gap
well described ❤️
wontfix
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/headscale#790
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @moserpjm on GitHub (Sep 12, 2024).
Is this a support request?
Is there an existing issue for this?
Current Behavior
If the connection of a client is interrupted (pull the cable, disconnect from wifi) headscale never changes its status to offline.
The state changes to offline when I restart headscale or caddy. (which terminates all connections)
I found the bug while working on an OPNSense plugin for Tailscale. It has CARP support which downs tailscale on the slave node. This should trigger a failover of the routes but it didn't. During the failover the internet connection of the slave firewall gets interrupted for a few seconds. My CARP hook executed tailscale down in this time window. -> Both firewalls show up as online, the routes are not failing over.
Expected Behavior
The node should go offline after some time.
Steps To Reproduce
Connect a device and inerrupt it's internet connection.
Environment
Runtime environment
Anything else?
I can provide further logs and dumps if this problem does not appear in another setup.
@kradalby commented on GitHub (Sep 12, 2024):
Can you please confirm this issue without the reverse proxy?
@moserpjm commented on GitHub (Sep 13, 2024):
Yes I can but it's going to take some time to build a lab setup. BTW I checked the behaviour of the reverse proxy. It doesn't reuse connections.
@moserpjm commented on GitHub (Sep 13, 2024):
OK. I did a quick and dirty setup. No fancy stuff. Ubuntu 24.04, deb package, no firewall, no proxy, no OIDC.
Connected my Android phone and disabled WiFi and 5G.
Those are the relevant lines in the log:
It set the node offline after 18 Minutes. I'm not sure if I waited that long on my fancy setup. Going to test it. Should this take so long? For subnet router HA this is a very long time.
I think it should be pretty easy for you to replicate this setup and attach a debugger. ;)
If you need a server I can quickly do some Ansible magic.
@moserpjm commented on GitHub (Sep 13, 2024):
OK. Same behaviour on my fancy system. It took 16 minutes.
@kradalby commented on GitHub (Sep 13, 2024):
Thanks, I'll set up a integration tests to reproduce too, I had a few minutes yesterday and think I managed to see this with a RPi. I suspect there is something wrong with how the keep alive is sent, which should trigger offline by failing to send.
@kradalby commented on GitHub (Sep 13, 2024):
I have research this issue, and I am starting to suspect that it has always been broken, I am seeing around 16m in my integrations tests and it seems to be out of Go's control.
So I found this blogpost that describes this behaviour, but from the client side.
I am unsure if this can be solved without implementing some other way to discover if the client is still online. I have an idea, but it requires quite some re-engineering and I think it will have to come in later version.
Could you please try to see if this behaviour is present in v0.22.3 in the lab you set up?
@kradalby commented on GitHub (Sep 13, 2024):
I suspect what we need to do is figure out how we can use something like PingRequest (https://github.com/tailscale/tailscale/blob/main/tailcfg/tailcfg.go#L1663) to check if a node is there.
I'm going to try to confirm if this is new behaviour, and if it is not, I will say this is out of scope for 0.23.0, and move it to next.
@moserpjm commented on GitHub (Sep 13, 2024):
I know this behaviour very well. Serial port libs also retry for ages. The solution is what JDBC connection pools do for ages. If there's no traffic for x seconds send a keepalive. If max idle time is reached kill the connection and hope that the underlyig library and OS really close it.
@kradalby commented on GitHub (Sep 13, 2024):
The thing is that we send keepalives every minute~, but in go, flushing these messages to a gone connection does not produce any errors.
So we would need a keepalive variant that calls back, which we do not have and it will require an effort. Since this is likely the current behaviour (trying to verify but have no lab yet, so please help), I will not hold up this release and try to work it in toa future one.
@kradalby commented on GitHub (Sep 13, 2024):
I've confirmed that this issue occurs in 0.22.3, so I will push this to next, we should def solve it, but it requires more thought that something we should add just before a upcoming release.
@Zeashh commented on GitHub (Sep 16, 2024):
I've noticed this issue when networks change, for instance on a phone when I change from cellular data to wifi and vice versa. After changing the network I have to reconnect.
@erueda1 commented on GitHub (Apr 14, 2025):
Still happening in the latest version. Any plans to incorporate a solution in the next release?
@kradalby commented on GitHub (Apr 14, 2025):
Yes, this isnt directly a bug in our software, it is more a common networking issue.
To solve it is a lot more involved than I first anticipated, the Tailscale client implements a mechanism for this where the control server pings client to check if they are still there (HA nodes only really).
The involved bit is that we need to implement the control 2 node (c2n) interface where the control server can ask the client stuff, which we have not, and it is a lot more work. So the ping requests are easy when c2n is done, but c2n isnt started.
@vinifrancosilva commented on GitHub (Sep 13, 2025):
Any news in this issue? The software is incredible! It's more cosmetic than a bug but it would be nice to have a reliable indication of the status of the nodes
@kradalby commented on GitHub (Sep 13, 2025):
Its not cosmetic, its actually a problem we cannot determine in headscale so we need to implement a lot more to get it working. Currently this work is planned to the attached milestone.
@MarcelWaldvogel commented on GitHub (Oct 15, 2025):
At the risk of stating the obvious, has anyone tried getting TCP to close those connections faster than 16 minutes, if it is hard to detect at the application layer?
The cardinal approach would be to use TCP Keepalives. On Linux they can be set both system-wide (
tcp_keepalive_*in procfs) or on a per-connection basis (TCP_KEEP*with setsockopt).Setting both idle and interval values to 15 seconds and the count to 3 should get the connection closed after 60 seconds if the remote side does not answer.
Alternatively, setting
tcp_retries2to something lower than 15 (e.g., 7, see calculations here), should also get the thing done, but will impact everything else on the system, with likely unwanted results.@kradalby commented on GitHub (Oct 16, 2025):
I'm sure you can tweak those things to achieve it, but implementing c2n seems more reasonable as it is built for it and allows us to use other neat features from that than speculate in modifying tcp sockets.
@moserpjm commented on GitHub (Oct 16, 2025):
We "solved" the problem half a year ago by setting net.ipv4.tcp_retries2 = 6 on our two caddy reverse proxies which proxy headscale and like 20 other services. Stale Tailscale connections now die after roughly two minutes. So when OPNSense does a CARP failover Tailscale starts working again after a maximum of three minutes. We haven't seen any side effects on the other services. (S3 storage, Zitadel, Portainer, Rancher, Gitea...)
I think for the stable networks we have nowadays 15 is much to high. RedHat suggest setting it to 3 for HA solutions which have to detect connection loss rather quickly.