[Feature] multiple replicas of headscale instances #1067

Open
opened 2025-12-29 02:28:04 +01:00 by adam · 15 comments
Owner

Originally created by @thebigbone on GitHub (Jul 18, 2025).

Use case

currently, there is no option for running headscale in a high availability way. if one goes down, the whole tailnet is unreachable.

Description

adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?

Contribution

  • I can write the design doc for this feature
  • I can contribute this feature

How can it be implemented?

No response

Originally created by @thebigbone on GitHub (Jul 18, 2025). ### Use case currently, there is no option for running headscale in a high availability way. if one goes down, the whole tailnet is unreachable. ### Description adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature? ### Contribution - [ ] I can write the design doc for this feature - [ ] I can contribute this feature ### How can it be implemented? _No response_
adam added the enhancement label 2025-12-29 02:28:04 +01:00
Author
Owner

@tiberiuv commented on GitHub (Jul 18, 2025):

I was able to run an HA setup of headscale previously it was in a kubernetes environment but it should be possible to replicate outside it as well

  • use external postgres DB
  • some load balancer with support for sticky sessions in front of headscale

That said I don't think that the control plane going down should affect the data path immediately, unless some endpoints are changing while it's unavailable

Also there was an issue around the same issue https://github.com/juanfont/headscale/issues/100

@tiberiuv commented on GitHub (Jul 18, 2025): I was able to run an HA setup of headscale previously it was in a kubernetes environment but it should be possible to replicate outside it as well * use external postgres DB * some load balancer with support for sticky sessions in front of headscale That said I don't think that the control plane going down should affect the data path immediately, unless some endpoints are changing while it's unavailable Also there was an issue around the same issue https://github.com/juanfont/headscale/issues/100
Author
Owner

@x1arch commented on GitHub (Jul 19, 2025):

Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work.

You can use external DB and

  • round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain)
  • dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec)
  • virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay)
  • external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain)

Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.

@x1arch commented on GitHub (Jul 19, 2025): Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work. You can use external DB and - round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain) - dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec) - virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay) - external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain) Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.
Author
Owner

@lucasfcnunes commented on GitHub (Jul 19, 2025):

@tiberiuv @x1arch do you guys use headplane with multiple replicas >= 2?

@lucasfcnunes commented on GitHub (Jul 19, 2025): @tiberiuv @x1arch do you guys use [headplane](https://github.com/tale/headplane) with multiple replicas >= 2?
Author
Owner

@x1arch commented on GitHub (Jul 19, 2025):

Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server

PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down

@x1arch commented on GitHub (Jul 19, 2025): Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down
Author
Owner

@thebigbone commented on GitHub (Jul 20, 2025):

Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work.

You can use external DB and

  • round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain)
  • dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec)
  • virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay)
  • external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain)

Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.

The whole state resides in the DB? Are there any other config files, apart from the main config.yaml which needs to be accounted for?

@thebigbone commented on GitHub (Jul 20, 2025): > Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work. > > You can use external DB and > > * round Robin DNS - assign a few IP for your domain, not sure, but may to work (free, need a domain) > * dynDNS (for home or SMB) - IP for domain will be changed by API request (cheap or free; need custom script; have a time lag, while DNS will be re-cached, but for dyn record you can set ttl 60, then max control plane downtime will be 60 sec) > * virtual IP (for SMB/business) - the IP can be assign dynamically, between few nodes (expensive, you need to pay for IP per month ~$5-$15 and the bandwidth will be limited, but I believe more than enough for Headscale, if you will not use it as relay) > * external gateway (like Cloudflare) - it can rote all requests to your server by many options. It have Cloudflared, a tunnel to your backends, if needed. (free, need a domain) > > Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one. The whole state resides in the DB? Are there any other config files, apart from the main config.yaml which needs to be accounted for?
Author
Owner

@thebigbone commented on GitHub (Jul 20, 2025):

Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server

PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down

I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

@thebigbone commented on GitHub (Jul 20, 2025): > Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server > > PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp
Author
Owner

@x1arch commented on GitHub (Jul 20, 2025):

I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP.

Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.

@x1arch commented on GitHub (Jul 20, 2025): > I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP. Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.
Author
Owner

@thebigbone commented on GitHub (Jul 20, 2025):

I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp

It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP.

Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.

Ok, you might be right. Some peers stay online, while very few goes off. Even tho all of them are connected directly.

@thebigbone commented on GitHub (Jul 20, 2025): > > I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp > > It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP. > > Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server. Ok, you might be right. Some peers stay online, while very few goes off. Even tho all of them are connected directly.
Author
Owner

@kradalby commented on GitHub (Jul 23, 2025):

adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?

The short answer is no. But I will break it down a little more and add some reasoning.

making sure every server syncs

This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense.

In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder.

What I think people should focus on in their strategy is:

The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection.
This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy.
That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that.

To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc.
To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup.

Minimum setup for this should allow you to recover your Headscale from a backup in minutes:

  • DNS or Virtual IP, point to a new VM
  • Restore SQLite database and config from backup
  • Install Headscale and start it up
@kradalby commented on GitHub (Jul 23, 2025): > adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature? The short answer is no. But I will break it down a little more and add some reasoning. > making sure every server syncs This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense. In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder. What I think people should focus on in their strategy is: The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection. This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy. That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that. To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc. To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup. Minimum setup for this should allow you to recover your Headscale from a backup in minutes: - DNS or Virtual IP, point to a new VM - Restore SQLite database and config from backup - Install Headscale and start it up
Author
Owner

@blinkinglight commented on GitHub (Aug 31, 2025):

you can backup sqlite easily / restore on start with https://litestream.io/reference/ or something like this project https://github.com/reneleonhardt/harmonylite ( to replicate sqlite via nats.io )

@blinkinglight commented on GitHub (Aug 31, 2025): you can backup sqlite easily / restore on start with https://litestream.io/reference/ or something like this project https://github.com/reneleonhardt/harmonylite ( to replicate sqlite via nats.io )
Author
Owner

@dzervas commented on GitHub (Aug 31, 2025):

there's also dqlite and rqlite. I think rqlite does not need code changes and can be used as a drop-in replacement

@dzervas commented on GitHub (Aug 31, 2025): there's also dqlite and rqlite. I think rqlite does not need code changes and can be used as a drop-in replacement
Author
Owner

@gawsoftpl commented on GitHub (Sep 8, 2025):

adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?

The short answer is no. But I will break it down a little more and add some reasoning.

making sure every server syncs

This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense.

In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder.

What I think people should focus on in their strategy is:

The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection. This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy. That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that.

To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc. To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup.

Minimum setup for this should allow you to recover your Headscale from a backup in minutes:

  • DNS or Virtual IP, point to a new VM
  • Restore SQLite database and config from backup
  • Install Headscale and start it up

I've been looking for a solution to the high availability problem for a while over the weekend.
I write how to create auto replication of sqlite and auto failover switch to replica node
All I posted in my blog:
https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/

@gawsoftpl commented on GitHub (Sep 8, 2025): > > adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature? > > The short answer is no. But I will break it down a little more and add some reasoning. > > > making sure every server syncs > > This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense. > > In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder. > > What I think people should focus on in their strategy is: > > The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection. This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy. That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that. > > To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc. To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup. > > Minimum setup for this should allow you to recover your Headscale from a backup in minutes: > > * DNS or Virtual IP, point to a new VM > * Restore SQLite database and config from backup > * Install Headscale and start it up I've been looking for a solution to the high availability problem for a while over the weekend. I write how to create auto replication of sqlite and auto failover switch to replica node All I posted in my blog: https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
Author
Owner

@anthonyrisinger commented on GitHub (Oct 11, 2025):

@kradalby I'm a bit confused by comments here and the general decision to support Postgres in a way that's overtly precarious (maintenance mode). I've been prototyping for a good size project the last few weeks, one that would almost certainly lead to in/direct support—Headscale is the obvious place/base to start from and early patches would probably be scalability related—but it becomes a significantly tougher sell to both my own good conscious and the wider engineering org when I have to explain only SQLite is supported (great DB but quite challenging in stateless/ephemeral environments), and also, there's little interest in HA too.

I guess what I'm trying to say is, this approach might be leaving money/time/skills/contributors/??? on the table. I 100% understand and respect the desire to Keep It Simple; alas, if I e.g. say out loud in a tech review that "the Headscale control plane could be down for minute(s) plural and this is an acceptable outcome because there is little appetite for fixing this upstream", Headscale is likely to get (reasonably?) dismissed, and Nice Things never happen.

Since I still feel that it's probably the best base for me to start from, what would it take to solidify your thinking around the handful of important things wrt improving core scalability? In my mind, the two most important pieces are full support for externalized database state and the ability to run multiple copies without much fuss.

I also understand there's likely to be other issues around rebuilding the "world map", and I'd hoped to help cross that bridge when the time comes; my fear is I'll not be allowed to cross said bridge because there will be no interest in what I have to offer.

@anthonyrisinger commented on GitHub (Oct 11, 2025): @kradalby I'm a bit confused by comments here and the general decision to support Postgres in a way that's overtly precarious (maintenance mode). I've been prototyping for a good size project the last few weeks, one that would almost certainly lead to in/direct support—Headscale is the obvious place/base to start from and early patches would probably be scalability related—but it becomes a significantly tougher sell to both my own good conscious and the wider engineering org when I have to explain only SQLite is supported (great DB but quite challenging in stateless/ephemeral environments), and also, there's little interest in HA too. I guess what I'm trying to say is, this approach might be leaving money/time/skills/contributors/??? on the table. I 100% understand and respect the desire to Keep It Simple; alas, if I e.g. say out loud in a tech review that "the Headscale control plane could be down for minute(s) plural and this is an acceptable outcome because there is little appetite for fixing this upstream", Headscale is likely to get (reasonably?) dismissed, and Nice Things never happen. Since I still feel that it's probably the best base for me to start from, what would it take to solidify your thinking around the handful of important things wrt improving core scalability? In my mind, the two most important pieces are full support for externalized database state and the ability to run multiple copies without much fuss. I also understand there's likely to be other issues around rebuilding the "world map", and I'd hoped to help cross that bridge when the time comes; my fear is I'll not be allowed to cross said bridge because there will be no interest in what I have to offer.
Author
Owner

@almereyda commented on GitHub (Oct 15, 2025):

What about reconsidering distributed SQLite flavours, earlier proposed by @dzervas?

The dqlite-fork cowsql is happily in use for production use cases, e.g. by the Golang Incus project.

@almereyda commented on GitHub (Oct 15, 2025): What about reconsidering distributed SQLite flavours, earlier proposed by @dzervas? The dqlite-fork [cowsql](https://github.com/cowsql/cowsql/) is happily in use for production use cases, e.g. by the Golang [Incus](https://github.com/lxc/incus/) project.
Author
Owner

@ksemele-public commented on GitHub (Nov 6, 2025):

I will also note that this is an important feature for me. Regardless of other conditions, it turns out that Headscale, in its current form, is a tool for hobbies, not for production solutions, which is sad. (Or you have to use homemade solutions or accept the risks.)

I recall Kafka, Postgres, and even Kubernetes itself, which successfully solve the task of high availability. Maybe it's worth looking for some elegant and simple approaches in the open-source community?

Even I studied how to use leases for my experiments with operators in k8s, and it doesn't look like rocket science... for k8s installation, of course, but without external database support from Headscale itself, it's unlikely to implement a good "quick and dirty" solution.

@ksemele-public commented on GitHub (Nov 6, 2025): I will also note that this is an important feature for me. Regardless of other conditions, it turns out that Headscale, in its current form, is a tool for hobbies, not for production solutions, which is sad. (Or you have to use homemade solutions or accept the risks.) I recall Kafka, Postgres, and even Kubernetes itself, which successfully solve the task of high availability. Maybe it's worth looking for some elegant and simple approaches in the open-source community? Even I studied how to use leases for my experiments with operators in k8s, and it doesn't look like rocket science... for k8s installation, of course, but without external database support from Headscale itself, it's unlikely to implement a good "quick and dirty" solution.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#1067