Frequent "offline" status causing subnet router re-election and connection disruptions #528

New Issue

adam · 2025-12-29T02:19:33+01:00

adam commented

2025-12-29 02:19:33 +01:00

Originally created by @vsychov on GitHub (Jun 30, 2023).

Hello,

I have noticed a recurring issue where I often see console messages from headscale indicating that a machine has gone "offline", even though the machine is actually online and has no issues with its internet connection. As I am using tailscale as a subnet router, this results in re-election of the "primary route" if such a machine was being used as the "primary route", leading to connection disruptions.

It appears that the problem lies in how a machine is set to "offline" mode, using the last_seen field in the database. A machine goes offline when the last_seen field reaches a value of 60 seconds (keepAliveInterval). Therefore, even a slight delay of just an extra second can make the machine go offline, leading to a new subnet router being elected.

It looks like field last_seen updated in keepAliveTicker and few other places, and it's happens each 40-60 seconds in my setup, that's not enough.

From what I can see, this problem could be solved by updating the last_seen field in the updateCheckerTicker (which by default occurs every 10 seconds - NodeUpdateCheckInterval), simply by adding:

machine.LastSeen = &now

right after:
fe75b71620/hscontrol/poll.go (L561)

I hope this suggestion is helpful and look forward to any feedback.

Thank you

Originally created by @vsychov on GitHub (Jun 30, 2023). Hello, I have noticed a recurring issue where I often see console messages from headscale indicating that a machine has gone "offline", even though the machine is actually online and has no issues with its internet connection. As I am using tailscale as a subnet router, this results in re-election of the "primary route" if such a machine was being used as the "primary route", leading to connection disruptions. It appears that the problem lies in how a machine is set to "offline" mode, using the `last_seen` field in the database. A machine goes offline when the `last_seen` field reaches a value of 60 seconds (`keepAliveInterval`). Therefore, even a slight delay of just an extra second can make the machine go offline, leading to a new subnet router being elected. It looks like field `last_seen` updated in `keepAliveTicker` and few other places, and it's happens each 40-60 seconds in my setup, that's not enough. From what I can see, this problem could be solved by updating the `last_seen` field in the `updateCheckerTicker` (which by default occurs every 10 seconds - `NodeUpdateCheckInterval`), simply by adding: ```go machine.LastSeen = &now ``` right after: https://github.com/juanfont/headscale/blob/fe75b716201a2d31bd8fe2531100e93ff7bfb4f1/hscontrol/poll.go#L561 I hope this suggestion is helpful and look forward to any feedback. Thank you

adam added the bug well described ❤️ labels 2025-12-29 02:19:33 +01:00

adam closed this issue

2025-12-29 02:19:33 +01:00

adam commented

2025-12-29 02:19:34 +01:00

@kradalby commented on GitHub (Jul 7, 2023):

This might be fixed, or we might have the base to fix this when #1492 land, it starts looking at the Online field, and sends update in a different way. It might not have been directly addressed, but should be easier to fix.

@kradalby commented on GitHub (Jul 7, 2023): This might be fixed, or we might have the base to fix this when #1492 land, it starts looking at the Online field, and sends update in a different way. It might not have been directly addressed, but should be easier to fix.

adam commented

2025-12-29 02:19:34 +01:00

@github-actions[bot] commented on GitHub (Dec 24, 2023):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Dec 24, 2023): This issue is stale because it has been open for 90 days with no activity.

adam commented

2025-12-29 02:19:34 +01:00

@github-actions[bot] commented on GitHub (Dec 31, 2023):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Dec 31, 2023): This issue was closed because it has been inactive for 14 days since being marked as stale.

adam commented

2025-12-29 02:19:34 +01:00

@andreyrd commented on GitHub (Jan 17, 2024):

This is still an active issue in the latest stable version.
Is this fixed in the latest alpha and is the latest alpha ready for use in a prod-like environment?

@andreyrd commented on GitHub (Jan 17, 2024): This is still an active issue in the latest stable version. Is this fixed in the latest alpha and is the latest alpha ready for use in a prod-like environment?

adam commented

2025-12-29 02:19:34 +01:00

@kradalby commented on GitHub (Jan 19, 2024):

@andreyrd we follow common software release practices and alpha software is not recommended to use in production, we need help testing it so we release it under a alpha/beta label to imply that you need to be cautious using this.

I believe the issue has been solved, but we need people who encounter the problem to test it, if you have the opportunity, that would be great.

@kradalby commented on GitHub (Jan 19, 2024): @andreyrd we follow common software release practices and alpha software is not recommended to use in production, we need help testing it so we release it under a alpha/beta label to imply that you need to be cautious using this. I believe the issue has been solved, but we need people who encounter the problem to test it, if you have the opportunity, that would be great.

adam commented

2025-12-29 02:19:34 +01:00

@kradalby commented on GitHub (Feb 19, 2024):

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

@kradalby commented on GitHub (Feb 19, 2024): Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

adam commented

2025-12-29 02:19:34 +01:00

@eNdiD commented on GitHub (Feb 27, 2024):

@kradalby with the latest 0.23.5-alpha5 there is an odd behavior. I constantly see my android clients go offline while they continue to work fine with the tailnet. But Headscale seems to stop sending updates to them. To make them become online again they need to send some updates by themselves, like moving to a different network, or if I manually restart Tailscale connection on them.

Once I've seen the very same on my Raspberry Pi, but only once, and I'm not sure what the cause was. Other linux clients stay online without an issue.

Update: Going offline is not instant. The android nodes stay online for some time, like hours. More interestingly, "offline nodes" may have kinda fresh last seen value, like one minute ago.

Update2: I believe it can be reproduced by switching networks. Like the next scenario:

Activate Tailscale on Android while being on the home Wi-Fi. Node stays online
Turn off Wi-Fi, forcing the phone to switch to the mobile connection. Node stays online
Turn on Wi-Fi. Node goes offline, last seen value continues to update

@eNdiD commented on GitHub (Feb 27, 2024): @kradalby with the latest 0.23.5-alpha5 there is an odd behavior. I constantly see my android clients go offline while they continue to work fine with the tailnet. But Headscale seems to stop sending updates to them. To make them become online again they need to send some updates by themselves, like moving to a different network, or if I manually restart Tailscale connection on them. Once I've seen the very same on my Raspberry Pi, but only once, and I'm not sure what the cause was. Other linux clients stay online without an issue. Update: Going offline is not instant. The android nodes stay online for some time, like hours. More interestingly, "offline nodes" may have kinda fresh `last seen` value, like one minute ago. Update2: I believe it can be reproduced by switching networks. Like the next scenario: 1. Activate Tailscale on Android while being on the home Wi-Fi. Node stays online 2. Turn off Wi-Fi, forcing the phone to switch to the mobile connection. Node stays online 3. Turn on Wi-Fi. Node goes offline, last seen value continues to update

adam commented

2025-12-29 02:19:34 +01:00

@fortitudepub commented on GitHub (Mar 19, 2024):

I also found this issue with the 0.23.5+ version, by some investigation, I think it may be caused by existing connection to controller have been reset (by switching the router /wifi because it may switch the NAT outside address or other reasons) and meanwhile the new connection established quickly, in that case, in poll.go the old connection's defer action may be executed after e the new conection being added because the online status is now a map index by node key.

@fortitudepub commented on GitHub (Mar 19, 2024): I also found this issue with the 0.23.5+ version, by some investigation, I think it may be caused by existing connection to controller have been reset (by switching the router /wifi because it may switch the NAT outside address or other reasons) and meanwhile the new connection established quickly, in that case, in poll.go the old connection's defer action may be executed after e the new conection being added because the online status is now a map index by node key.

adam commented

2025-12-29 02:19:34 +01:00

@kradalby commented on GitHub (Apr 17, 2024):

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby commented on GitHub (Apr 17, 2024): Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

adam commented

2025-12-29 02:19:35 +01:00

@vsychov commented on GitHub (Apr 18, 2024):

Thanks @kradalby , I'll make tests today or tomorrow

@vsychov commented on GitHub (Apr 18, 2024): Thanks @kradalby , I'll make tests today or tomorrow

adam commented

2025-12-29 02:19:35 +01:00

@kradalby commented on GitHub (May 24, 2024):

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

@kradalby commented on GitHub (May 24, 2024): I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

adam referenced this issue

2025-12-29 02:30:09 +01:00

[PR #528] [MERGED] docs(README): update contributors #1458

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#528