[Feature] multiple replicas of headscale instances #1067

New Issue

adam · 2025-12-29T02:28:04+01:00

adam commented

2025-12-29 02:28:04 +01:00

Originally created by @thebigbone on GitHub (Jul 18, 2025).

Use case

currently, there is no option for running headscale in a high availability way. if one goes down, the whole tailnet is unreachable.

Description

adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?

Contribution

I can write the design doc for this feature
I can contribute this feature

How can it be implemented?

No response

Originally created by @thebigbone on GitHub (Jul 18, 2025). ### Use case currently, there is no option for running headscale in a high availability way. if one goes down, the whole tailnet is unreachable. ### Description adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature? ### Contribution - [ ] I can write the design doc for this feature - [ ] I can contribute this feature ### How can it be implemented? _No response_

adam added the enhancement label 2025-12-29 02:28:04 +01:00

adam commented

2025-12-29 02:28:06 +01:00

@tiberiuv commented on GitHub (Jul 18, 2025):

I was able to run an HA setup of headscale previously it was in a kubernetes environment but it should be possible to replicate outside it as well

use external postgres DB
some load balancer with support for sticky sessions in front of headscale

That said I don't think that the control plane going down should affect the data path immediately, unless some endpoints are changing while it's unavailable

Also there was an issue around the same issue https://github.com/juanfont/headscale/issues/100

@tiberiuv commented on GitHub (Jul 18, 2025): I was able to run an HA setup of headscale previously it was in a kubernetes environment but it should be possible to replicate outside it as well * use external postgres DB * some load balancer with support for sticky sessions in front of headscale That said I don't think that the control plane going down should affect the data path immediately, unless some endpoints are changing while it's unavailable Also there was an issue around the same issue https://github.com/juanfont/headscale/issues/100

adam commented

2025-12-29 02:28:06 +01:00

@x1arch commented on GitHub (Jul 19, 2025):

Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work.

You can use external DB and

round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain)
dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec)
virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay)
external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain)

Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.

@x1arch commented on GitHub (Jul 19, 2025): Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work. You can use external DB and - round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain) - dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec) - virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay) - external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain) Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.

adam commented

2025-12-29 02:28:06 +01:00

@lucasfcnunes commented on GitHub (Jul 19, 2025):

@tiberiuv @x1arch do you guys use headplane with multiple replicas >= 2?

@lucasfcnunes commented on GitHub (Jul 19, 2025): @tiberiuv @x1arch do you guys use [headplane](https://github.com/tale/headplane) with multiple replicas >= 2?

adam commented

2025-12-29 02:28:06 +01:00

@x1arch commented on GitHub (Jul 19, 2025):

Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server

PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down

@x1arch commented on GitHub (Jul 19, 2025): Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down

adam commented

2025-12-29 02:28:06 +01:00

@thebigbone commented on GitHub (Jul 20, 2025):

Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work.

You can use external DB and

round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain)

dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec)

virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay)

external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain)

Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.

The whole state resides in the DB? Are there any other config files, apart from the main config.yaml which needs to be accounted for?

@thebigbone commented on GitHub (Jul 20, 2025): > Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work. > > You can use external DB and > > * round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain) > * dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec) > * virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay) > * external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain) > > Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one. The whole state resides in the DB? Are there any other config files, apart from the main config.yaml which needs to be accounted for?

adam commented

2025-12-29 02:28:06 +01:00

@thebigbone commented on GitHub (Jul 20, 2025):

Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server

PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down

I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

@thebigbone commented on GitHub (Jul 20, 2025): > Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server > > PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

adam commented

2025-12-29 02:28:06 +01:00

@x1arch commented on GitHub (Jul 20, 2025):

I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP.

Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.

@x1arch commented on GitHub (Jul 20, 2025): > I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP. Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.

adam commented

2025-12-29 02:28:06 +01:00

@thebigbone commented on GitHub (Jul 20, 2025):

I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP.

Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.

Ok, you might be right. Some peers stay online, while very few goes off. Even tho all of them are connected directly.

@thebigbone commented on GitHub (Jul 20, 2025): > > I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp > > It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP. > > Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server. Ok, you might be right. Some peers stay online, while very few goes off. Even tho all of them are connected directly.

adam commented

2025-12-29 02:28:06 +01:00

@kradalby commented on GitHub (Jul 23, 2025):

adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?

The short answer is no. But I will break it down a little more and add some reasoning.

making sure every server syncs

This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense.

In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder.

What I think people should focus on in their strategy is:

The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection.
This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy.
That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that.

To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc.
To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup.

Minimum setup for this should allow you to recover your Headscale from a backup in minutes:

DNS or Virtual IP, point to a new VM
Restore SQLite database and config from backup
Install Headscale and start it up

@kradalby commented on GitHub (Jul 23, 2025): > adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature? The short answer is no. But I will break it down a little more and add some reasoning. > making sure every server syncs This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense. In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder. What I think people should focus on in their strategy is: The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection. This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy. That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that. To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc. To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup. Minimum setup for this should allow you to recover your Headscale from a backup in minutes: - DNS or Virtual IP, point to a new VM - Restore SQLite database and config from backup - Install Headscale and start it up

adam commented

2025-12-29 02:28:06 +01:00

@blinkinglight commented on GitHub (Aug 31, 2025):

you can backup sqlite easily / restore on start with https://litestream.io/reference/ or something like this project https://github.com/reneleonhardt/harmonylite ( to replicate sqlite via nats.io )

@blinkinglight commented on GitHub (Aug 31, 2025): you can backup sqlite easily / restore on start with https://litestream.io/reference/ or something like this project https://github.com/reneleonhardt/harmonylite ( to replicate sqlite via nats.io )

adam commented

2025-12-29 02:28:06 +01:00

@dzervas commented on GitHub (Aug 31, 2025):

there's also dqlite and rqlite. I think rqlite does not need code changes and can be used as a drop-in replacement

@dzervas commented on GitHub (Aug 31, 2025): there's also dqlite and rqlite. I think rqlite does not need code changes and can be used as a drop-in replacement

adam commented

2025-12-29 02:28:06 +01:00

@gawsoftpl commented on GitHub (Sep 8, 2025):

adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?

The short answer is no. But I will break it down a little more and add some reasoning.

making sure every server syncs

This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense.

In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder.

What I think people should focus on in their strategy is:

The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection. This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy. That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that.

To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc. To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup.

Minimum setup for this should allow you to recover your Headscale from a backup in minutes:

DNS or Virtual IP, point to a new VM

Restore SQLite database and config from backup

Install Headscale and start it up

I've been looking for a solution to the high availability problem for a while over the weekend.
I write how to create auto replication of sqlite and auto failover switch to replica node
All I posted in my blog:
https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/

@gawsoftpl commented on GitHub (Sep 8, 2025): > > adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature? > > The short answer is no. But I will break it down a little more and add some reasoning. > > > making sure every server syncs > > This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense. > > In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder. > > What I think people should focus on in their strategy is: > > The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection. This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy. That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that. > > To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc. To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup. > > Minimum setup for this should allow you to recover your Headscale from a backup in minutes: > > * DNS or Virtual IP, point to a new VM > * Restore SQLite database and config from backup > * Install Headscale and start it up I've been looking for a solution to the high availability problem for a while over the weekend. I write how to create auto replication of sqlite and auto failover switch to replica node All I posted in my blog: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/

adam commented

2025-12-29 02:28:06 +01:00

@anthonyrisinger commented on GitHub (Oct 11, 2025):

@kradalby I'm a bit confused by comments here and the general decision to support Postgres in a way that's overtly precarious (maintenance mode). I've been prototyping for a good size project the last few weeks, one that would almost certainly lead to in/direct support—Headscale is the obvious place/base to start from and early patches would probably be scalability related—but it becomes a significantly tougher sell to both my own good conscious and the wider engineering org when I have to explain only SQLite is supported (great DB but quite challenging in stateless/ephemeral environments), and also, there's little interest in HA too.

I guess what I'm trying to say is, this approach might be leaving money/time/skills/contributors/??? on the table. I 100% understand and respect the desire to Keep It Simple; alas, if I e.g. say out loud in a tech review that "the Headscale control plane could be down for minute(s) plural and this is an acceptable outcome because there is little appetite for fixing this upstream", Headscale is likely to get (reasonably?) dismissed, and Nice Things never happen.

Since I still feel that it's probably the best base for me to start from, what would it take to solidify your thinking around the handful of important things wrt improving core scalability? In my mind, the two most important pieces are full support for externalized database state and the ability to run multiple copies without much fuss.

I also understand there's likely to be other issues around rebuilding the "world map", and I'd hoped to help cross that bridge when the time comes; my fear is I'll not be allowed to cross said bridge because there will be no interest in what I have to offer.

@anthonyrisinger commented on GitHub (Oct 11, 2025): @kradalby I'm a bit confused by comments here and the general decision to support Postgres in a way that's overtly precarious (maintenance mode). I've been prototyping for a good size project the last few weeks, one that would almost certainly lead to in/direct support—Headscale is the obvious place/base to start from and early patches would probably be scalability related—but it becomes a significantly tougher sell to both my own good conscious and the wider engineering org when I have to explain only SQLite is supported (great DB but quite challenging in stateless/ephemeral environments), and also, there's little interest in HA too. I guess what I'm trying to say is, this approach might be leaving money/time/skills/contributors/??? on the table. I 100% understand and respect the desire to Keep It Simple; alas, if I e.g. say out loud in a tech review that "the Headscale control plane could be down for minute(s) plural and this is an acceptable outcome because there is little appetite for fixing this upstream", Headscale is likely to get (reasonably?) dismissed, and Nice Things never happen. Since I still feel that it's probably the best base for me to start from, what would it take to solidify your thinking around the handful of important things wrt improving core scalability? In my mind, the two most important pieces are full support for externalized database state and the ability to run multiple copies without much fuss. I also understand there's likely to be other issues around rebuilding the "world map", and I'd hoped to help cross that bridge when the time comes; my fear is I'll not be allowed to cross said bridge because there will be no interest in what I have to offer.

adam commented

2025-12-29 02:28:06 +01:00

@almereyda commented on GitHub (Oct 15, 2025):

What about reconsidering distributed SQLite flavours, earlier proposed by @dzervas?

The dqlite-fork cowsql is happily in use for production use cases, e.g. by the Golang Incus project.

@almereyda commented on GitHub (Oct 15, 2025): What about reconsidering distributed SQLite flavours, earlier proposed by @dzervas? The dqlite-fork [cowsql](https://github.com/cowsql/cowsql/) is happily in use for production use cases, e.g. by the Golang [Incus](https://github.com/lxc/incus/) project.

adam commented

2025-12-29 02:28:06 +01:00

@ksemele-public commented on GitHub (Nov 6, 2025):

I will also note that this is an important feature for me. Regardless of other conditions, it turns out that Headscale, in its current form, is a tool for hobbies, not for production solutions, which is sad. (Or you have to use homemade solutions or accept the risks.)

I recall Kafka, Postgres, and even Kubernetes itself, which successfully solve the task of high availability. Maybe it's worth looking for some elegant and simple approaches in the open-source community?

Even I studied how to use leases for my experiments with operators in k8s, and it doesn't look like rocket science... for k8s installation, of course, but without external database support from Headscale itself, it's unlikely to implement a good "quick and dirty" solution.

@ksemele-public commented on GitHub (Nov 6, 2025): I will also note that this is an important feature for me. Regardless of other conditions, it turns out that Headscale, in its current form, is a tool for hobbies, not for production solutions, which is sad. (Or you have to use homemade solutions or accept the risks.) I recall Kafka, Postgres, and even Kubernetes itself, which successfully solve the task of high availability. Maybe it's worth looking for some elegant and simple approaches in the open-source community? Even I studied how to use leases for my experiments with operators in k8s, and it doesn't look like rocket science... for k8s installation, of course, but without external database support from Headscale itself, it's unlikely to implement a good "quick and dirty" solution.

adam referenced this issue

2025-12-29 02:31:58 +01:00

[PR #1067] [MERGED] OIDC: Expire machines/nodes after token expiry #1852

Sign in to join this conversation.

Branches Tags

main

gh-pages

update_flake_lock_action

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#1067