Reduce failover time for subnet routers in HA setup #502

Closed
opened 2025-12-29 02:19:10 +01:00 by adam · 10 comments
Owner

Originally created by @vsychov on GitHub (May 9, 2023).

Feature request

I have been testing the subnet failover feature of the HA router as described in the Tailscale documentation: https://tailscale.com/kb/1115/subnet-failover/. I noticed that when there are two routers in the subnet advertising the same routes, and one of the routers goes down, it takes approximately 1 minute and ~10-15 seconds for traffic to start flowing through the backup router. As far as I can tell, 60 seconds of this delay is due to the keepAliveInterval, which is hardcoded to 60 seconds.

bab4e14828/protocol_common_poll.go (L14)

I propose that this parameter be made configurable and consider reducing the default value to 5 or 10 seconds to minimize failover time. What are your thoughts on this suggestion?

I can make PR if you agree move it to config.

Originally created by @vsychov on GitHub (May 9, 2023). **Feature request** I have been testing the subnet failover feature of the HA router as described in the Tailscale documentation: https://tailscale.com/kb/1115/subnet-failover/. I noticed that when there are two routers in the subnet advertising the same routes, and one of the routers goes down, it takes approximately 1 minute and ~10-15 seconds for traffic to start flowing through the backup router. As far as I can tell, 60 seconds of this delay is due to the `keepAliveInterval`, which is hardcoded to 60 seconds. https://github.com/juanfont/headscale/blob/bab4e14828e36f3bf86f3d2a8ae55b84b996a672/protocol_common_poll.go#L14 I propose that this parameter be made configurable and consider reducing the default value to 5 or 10 seconds to minimize failover time. What are your thoughts on this suggestion? I can make PR if you agree move it to config.
adam added the enhancement label 2025-12-29 02:19:10 +01:00
adam closed this issue 2025-12-29 02:19:10 +01:00
Author
Owner

@juanfont commented on GitHub (May 10, 2023):

@vsychov sounds reasonable.

@kradalby and I are in a refactoring hackathon today, which includes a major restructuring of the repo.

I will do a PR to make a keepAliveInterval configurable once we finish the code moves :)

@juanfont commented on GitHub (May 10, 2023): @vsychov sounds reasonable. @kradalby and I are in a refactoring hackathon today, which includes a major restructuring of the repo. I will do a PR to make a keepAliveInterval configurable once we finish the code moves :)
Author
Owner

@vsychov commented on GitHub (May 10, 2023):

@juanfont, I'm not sure how good of an idea this is, but it might work as well. When a connection with a client is lost, here:

9478c288f6/protocol_common_poll.go (L573-L581)

We can check if we have other online nodes that announce the same route as the node with the broken connection, mark the current node as offline if other nodes are available, and switch the routes to an online node. This might lead to false positives in case of a short-term connection loss, but considering that we check the availability of other nodes, at least one node will be available.

This will help speed up the failover switch.

@vsychov commented on GitHub (May 10, 2023): @juanfont, I'm not sure how good of an idea this is, but it might work as well. When a connection with a client is lost, here: https://github.com/juanfont/headscale/blob/9478c288f62b428348f57e8525126baef9955525/protocol_common_poll.go#L573-L581 We can check if we have other online nodes that announce the same route as the node with the broken connection, mark the current node as offline if other nodes are available, and switch the routes to an online node. This might lead to false positives in case of a short-term connection loss, but considering that we check the availability of other nodes, at least one node will be available. This will help speed up the failover switch.
Author
Owner

@RaheelJameel commented on GitHub (Oct 31, 2023):

Any progress or update on this feature request?

@RaheelJameel commented on GitHub (Oct 31, 2023): Any progress or update on this feature request?
Author
Owner

@kradalby commented on GitHub (Oct 31, 2023):

I believe this will be addressed and improved when 0.23.0 lands, as part of #1564

@kradalby commented on GitHub (Oct 31, 2023): I believe this will be addressed and improved when 0.23.0 lands, as part of #1564
Author
Owner

@github-actions[bot] commented on GitHub (Jan 30, 2024):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Jan 30, 2024): This issue is stale because it has been open for 90 days with no activity.
Author
Owner

@kradalby commented on GitHub (Feb 15, 2024):

could you check this in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

@kradalby commented on GitHub (Feb 15, 2024): could you check this in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4
Author
Owner

@kradalby commented on GitHub (Feb 19, 2024):

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

@kradalby commented on GitHub (Feb 19, 2024): Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?
Author
Owner

@kradalby commented on GitHub (Apr 17, 2024):

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby commented on GitHub (Apr 17, 2024): Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?
Author
Owner

@vsychov commented on GitHub (Apr 18, 2024):

Thanks @kradalby , I'll make tests today or tomorrow

@vsychov commented on GitHub (Apr 18, 2024): Thanks @kradalby , I'll make tests today or tomorrow
Author
Owner

@kradalby commented on GitHub (May 24, 2024):

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

@kradalby commented on GitHub (May 24, 2024): I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#502