Frequent "offline" status causing subnet router re-election and connection disruptions #528

Closed
opened 2025-12-29 02:19:33 +01:00 by adam · 11 comments
Owner

Originally created by @vsychov on GitHub (Jun 30, 2023).

Hello,

I have noticed a recurring issue where I often see console messages from headscale indicating that a machine has gone "offline", even though the machine is actually online and has no issues with its internet connection. As I am using tailscale as a subnet router, this results in re-election of the "primary route" if such a machine was being used as the "primary route", leading to connection disruptions.

It appears that the problem lies in how a machine is set to "offline" mode, using the last_seen field in the database. A machine goes offline when the last_seen field reaches a value of 60 seconds (keepAliveInterval). Therefore, even a slight delay of just an extra second can make the machine go offline, leading to a new subnet router being elected.

It looks like field last_seen updated in keepAliveTicker and few other places, and it's happens each 40-60 seconds in my setup, that's not enough.

From what I can see, this problem could be solved by updating the last_seen field in the updateCheckerTicker (which by default occurs every 10 seconds - NodeUpdateCheckInterval), simply by adding:

machine.LastSeen = &now

right after:
fe75b71620/hscontrol/poll.go (L561)

I hope this suggestion is helpful and look forward to any feedback.

Thank you

Originally created by @vsychov on GitHub (Jun 30, 2023). Hello, I have noticed a recurring issue where I often see console messages from headscale indicating that a machine has gone "offline", even though the machine is actually online and has no issues with its internet connection. As I am using tailscale as a subnet router, this results in re-election of the "primary route" if such a machine was being used as the "primary route", leading to connection disruptions. It appears that the problem lies in how a machine is set to "offline" mode, using the `last_seen` field in the database. A machine goes offline when the `last_seen` field reaches a value of 60 seconds (`keepAliveInterval`). Therefore, even a slight delay of just an extra second can make the machine go offline, leading to a new subnet router being elected. It looks like field `last_seen` updated in `keepAliveTicker` and few other places, and it's happens each 40-60 seconds in my setup, that's not enough. From what I can see, this problem could be solved by updating the `last_seen` field in the `updateCheckerTicker` (which by default occurs every 10 seconds - `NodeUpdateCheckInterval`), simply by adding: ```go machine.LastSeen = &now ``` right after: https://github.com/juanfont/headscale/blob/fe75b716201a2d31bd8fe2531100e93ff7bfb4f1/hscontrol/poll.go#L561 I hope this suggestion is helpful and look forward to any feedback. Thank you
adam added the bugwell described ❤️ labels 2025-12-29 02:19:33 +01:00
adam closed this issue 2025-12-29 02:19:33 +01:00
Author
Owner

@kradalby commented on GitHub (Jul 7, 2023):

This might be fixed, or we might have the base to fix this when #1492 land, it starts looking at the Online field, and sends update in a different way. It might not have been directly addressed, but should be easier to fix.

@kradalby commented on GitHub (Jul 7, 2023): This might be fixed, or we might have the base to fix this when #1492 land, it starts looking at the Online field, and sends update in a different way. It might not have been directly addressed, but should be easier to fix.
Author
Owner

@github-actions[bot] commented on GitHub (Dec 24, 2023):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Dec 24, 2023): This issue is stale because it has been open for 90 days with no activity.
Author
Owner

@github-actions[bot] commented on GitHub (Dec 31, 2023):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Dec 31, 2023): This issue was closed because it has been inactive for 14 days since being marked as stale.
Author
Owner

@andreyrd commented on GitHub (Jan 17, 2024):

This is still an active issue in the latest stable version.
Is this fixed in the latest alpha and is the latest alpha ready for use in a prod-like environment?

@andreyrd commented on GitHub (Jan 17, 2024): This is still an active issue in the latest stable version. Is this fixed in the latest alpha and is the latest alpha ready for use in a prod-like environment?
Author
Owner

@kradalby commented on GitHub (Jan 19, 2024):

@andreyrd we follow common software release practices and alpha software is not recommended to use in production, we need help testing it so we release it under a alpha/beta label to imply that you need to be cautious using this.

I believe the issue has been solved, but we need people who encounter the problem to test it, if you have the opportunity, that would be great.

@kradalby commented on GitHub (Jan 19, 2024): @andreyrd we follow common software release practices and alpha software is not recommended to use in production, we need help testing it so we release it under a alpha/beta label to imply that you need to be cautious using this. I believe the issue has been solved, but we need people who encounter the problem to test it, if you have the opportunity, that would be great.
Author
Owner

@kradalby commented on GitHub (Feb 19, 2024):

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

@kradalby commented on GitHub (Feb 19, 2024): Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?
Author
Owner

@eNdiD commented on GitHub (Feb 27, 2024):

@kradalby with the latest 0.23.5-alpha5 there is an odd behavior. I constantly see my android clients go offline while they continue to work fine with the tailnet. But Headscale seems to stop sending updates to them. To make them become online again they need to send some updates by themselves, like moving to a different network, or if I manually restart Tailscale connection on them.

Once I've seen the very same on my Raspberry Pi, but only once, and I'm not sure what the cause was. Other linux clients stay online without an issue.

Update: Going offline is not instant. The android nodes stay online for some time, like hours. More interestingly, "offline nodes" may have kinda fresh last seen value, like one minute ago.

Update2: I believe it can be reproduced by switching networks. Like the next scenario:

  1. Activate Tailscale on Android while being on the home Wi-Fi. Node stays online
  2. Turn off Wi-Fi, forcing the phone to switch to the mobile connection. Node stays online
  3. Turn on Wi-Fi. Node goes offline, last seen value continues to update
@eNdiD commented on GitHub (Feb 27, 2024): @kradalby with the latest 0.23.5-alpha5 there is an odd behavior. I constantly see my android clients go offline while they continue to work fine with the tailnet. But Headscale seems to stop sending updates to them. To make them become online again they need to send some updates by themselves, like moving to a different network, or if I manually restart Tailscale connection on them. Once I've seen the very same on my Raspberry Pi, but only once, and I'm not sure what the cause was. Other linux clients stay online without an issue. Update: Going offline is not instant. The android nodes stay online for some time, like hours. More interestingly, "offline nodes" may have kinda fresh `last seen` value, like one minute ago. Update2: I believe it can be reproduced by switching networks. Like the next scenario: 1. Activate Tailscale on Android while being on the home Wi-Fi. Node stays online 2. Turn off Wi-Fi, forcing the phone to switch to the mobile connection. Node stays online 3. Turn on Wi-Fi. Node goes offline, last seen value continues to update
Author
Owner

@fortitudepub commented on GitHub (Mar 19, 2024):

I also found this issue with the 0.23.5+ version, by some investigation, I think it may be caused by existing connection to controller have been reset (by switching the router /wifi because it may switch the NAT outside address or other reasons) and meanwhile the new connection established quickly, in that case, in poll.go the old connection's defer action may be executed after e the new conection being added because the online status is now a map index by node key.

@fortitudepub commented on GitHub (Mar 19, 2024): I also found this issue with the 0.23.5+ version, by some investigation, I think it may be caused by existing connection to controller have been reset (by switching the router /wifi because it may switch the NAT outside address or other reasons) and meanwhile the new connection established quickly, in that case, in poll.go the old connection's defer action may be executed after e the new conection being added because the online status is now a map index by node key.
Author
Owner

@kradalby commented on GitHub (Apr 17, 2024):

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby commented on GitHub (Apr 17, 2024): Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?
Author
Owner

@vsychov commented on GitHub (Apr 18, 2024):

Thanks @kradalby , I'll make tests today or tomorrow

@vsychov commented on GitHub (Apr 18, 2024): Thanks @kradalby , I'll make tests today or tomorrow
Author
Owner

@kradalby commented on GitHub (May 24, 2024):

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

@kradalby commented on GitHub (May 24, 2024): I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#528