Multi server support #266

Closed
opened 2025-12-29 01:25:19 +01:00 by adam · 22 comments
Owner

Originally created by @nabeelshaikh7 on GitHub (May 27, 2022).

Feature request
Multiple server support

A Way to have multiple server nodes so that we can have mesh and the user can connect to the nearest server and

I have multiple servers in different regions so it was better if i would have I node per region so the latency would be low and the load will also be low

Thanks!!

Originally created by @nabeelshaikh7 on GitHub (May 27, 2022). **Feature request** Multiple server support A Way to have multiple server nodes so that we can have mesh and the user can connect to the nearest server and I have multiple servers in different regions so it was better if i would have I node per region so the latency would be low and the load will also be low Thanks!!
adam added the enhancementout of scope labels 2025-12-29 01:25:19 +01:00
adam closed this issue 2025-12-29 01:25:19 +01:00
Author
Owner

@enoperm commented on GitHub (May 31, 2022):

I have not tried it, but I suppose if you have a central database, sharing it across control servers may be possible. The downside is, doing that bypasses any application level locks, which may introduces race conditions around machine registration/ip address allocation.

But then again, since the control server should only provide information about a network, and not forward any traffic by itself, I am not sure it is worth optimizing them for latency. If my interpretation is correct, "latency" in this sense would be "how quickly would existing clients observe a new node joining the network".

@enoperm commented on GitHub (May 31, 2022): I have not tried it, but I *suppose* if you have a central database, sharing it across control servers may be possible. The downside is, doing that bypasses any application level locks, which may introduces race conditions around machine registration/ip address allocation. But then again, since the control server should only provide information about a network, and not forward any traffic by itself, I am not sure it is worth optimizing them for latency. If my interpretation is correct, "latency" in this sense would be "how quickly would existing clients observe a new node joining the network".
Author
Owner

@kradalby commented on GitHub (Jun 12, 2022):

In principle, the amount of traffic going between the control server (headscale) and the tailscale clients are not really affected by latency (unless its so bad it times out, but then I suspect you have other issues).

For DERP relays, lower latency could make sense, and you can host those separately from headscale.

@kradalby commented on GitHub (Jun 12, 2022): In principle, the amount of traffic going between the control server (headscale) and the tailscale clients are not really affected by latency (unless its so bad it times out, but then I suspect you have other issues). For DERP relays, lower latency could make sense, and you [can host those separately](https://tailscale.com/kb/1118/custom-derp-servers/) from headscale.
Author
Owner

@kradalby commented on GitHub (Jun 12, 2022):

There are scenarios where multiple servers would make sense and allowing them to connect would also make sense:

  • Multiple headscale servers, allowing two companies/owners to share nodes between them
  • Redundancy
@kradalby commented on GitHub (Jun 12, 2022): There are scenarios where multiple servers would make sense and allowing them to connect would also make sense: - Multiple headscale servers, allowing two companies/owners to share nodes between them - Redundancy
Author
Owner

@enoperm commented on GitHub (Jun 12, 2022):

  • What consequences would nodes being shared across control servers have on ACLs and security in general? I think it can be done safely as long as the servers can keep track whence each tag/rule originates, but it sure sounds easier to screw up than a single central ACL.
  • As for redundancy, I think moving all state (including any locks around db insertions) to the database server (how does gin handle transactions?) would allow the setup mentioned above, and it should be simpler than anything that would depend on control servers directly communicating.
@enoperm commented on GitHub (Jun 12, 2022): * What consequences would nodes being shared across control servers have on ACLs and security in general? I *think* it can be done safely as long as the servers can keep track whence each tag/rule originates, but it sure sounds easier to screw up than a single central ACL. * As for redundancy, I think moving *all state* (including any locks around db insertions) to the database server (how does gin handle transactions?) would allow the setup mentioned above, and it should be simpler than anything that would depend on control servers directly communicating.
Author
Owner

@enoperm commented on GitHub (Jun 12, 2022):

Though even if the database could be shared, one also needs to ensure the ACLs and the DERP map remain consistent across control servers.

@enoperm commented on GitHub (Jun 12, 2022): Though even if the database could be shared, one also needs to ensure the ACLs and the DERP map remain consistent across control servers.
Author
Owner

@enoperm commented on GitHub (Jun 12, 2022):

This may sound crazy, but how about removing hard dependencies on exact config files and datastores/schemas, and letting users write their own behaviour in some glue/scripting language? As long as APIs are provided to them, they can decide what ACLs exists (for updates, just ask their script again), they'll know if the rules they wish to give out are handcrafted, come from a config file, or a database, or generated on the fly. Same for nodes, instead of hardwiring the address allocation/node listing logic, call into their machine_register, or machine_enumerate functions - this way they can share nodes, set up their own machine registration logic (this would allow for any authentication machinery to be used, including external OIDC/SAML/Basic Auth/mTLS/Kerberos solutions, without the control server needing to care), share users in any manner they wish.

The upside is, the control server becomes a lot simpler and a lot more flexible. The downside is, scripting one's own DB access and the like is easier to screw up than relying on something shipped with the control server, and now the control server really needs to get the admin-facing API right. I think the former can be balanced out to a high degree by providing high quality samples and docs, but it is still more work for the user.

@enoperm commented on GitHub (Jun 12, 2022): This may sound crazy, but how about removing hard dependencies on exact config files and datastores/schemas, and letting users write their own *behaviour* in some glue/scripting language? As long as APIs are provided to them, they can decide what ACLs exists (for updates, just ask their script again), they'll know if the rules they wish to give out are handcrafted, come from a config file, or a database, or generated on the fly. Same for nodes, instead of hardwiring the address allocation/node listing logic, call into their `machine_register`, or `machine_enumerate` functions - this way they can share nodes, set up their own machine registration logic (this would allow for any authentication machinery to be used, including external OIDC/SAML/Basic Auth/mTLS/Kerberos solutions, without the control server needing to care), share users in any manner they wish. The upside is, the control server becomes a *lot* simpler and a *lot* more flexible. The downside is, scripting one's own DB access and the like is easier to screw up than relying on something shipped with the control server, and now the control server *really* needs to get the admin-facing API right. I think the former can be balanced out to a high degree by providing high quality samples and docs, but it is still more work for the user.
Author
Owner

@ciroiriarte commented on GitHub (Mar 24, 2023):

I definitely see the need for multisite/multiregion deployments for redundancy purposes (like nebula lighthouses).

One site dying, shouldn't take down the communication for the rest of them.

@ciroiriarte commented on GitHub (Mar 24, 2023): I definitely see the need for multisite/multiregion deployments for redundancy purposes (like nebula lighthouses). One site dying, shouldn't take down the communication for the rest of them.
Author
Owner

@0n1cOn3 commented on GitHub (Apr 13, 2023):

I would welcome it as well. Our admin stack is struggling with outages and the headscale VM is also crashing. High availability would be extremely desirable. We could have one headscale server in Canada and the second one in Switzerland. And if one of them goes down, for whatever reason, we can still continue to work. So far our master admin always has to fix the whole thing and work with a proxy that is not secured. We would like to prevent that. While Headscale is down, clients that have restarted can't connect and the work has to be down.

@0n1cOn3 commented on GitHub (Apr 13, 2023): I would welcome it as well. Our admin stack is struggling with outages and the headscale VM is also crashing. High availability would be extremely desirable. We could have one headscale server in Canada and the second one in Switzerland. And if one of them goes down, for whatever reason, we can still continue to work. So far our master admin always has to fix the whole thing and work with a proxy that is not secured. We would like to prevent that. While Headscale is down, clients that have restarted can't connect and the work has to be down.
Author
Owner

@juanfont commented on GitHub (Apr 14, 2023):

Hi @0n1cOn3 :)

The main objective of Headscale is to provide a correct implementation of the Tailscale protocol & control server - for hobbyists and self-hosters. We might work in the future to support HA setups, that's not the very short term goal.

Those kinds of requests I would recommend you the official Tailscale.com SaaS + Tailnet Lock.

Or send us a PR :) PRs are always welcomed!

@juanfont commented on GitHub (Apr 14, 2023): Hi @0n1cOn3 :) The main objective of Headscale is to provide a _correct_ implementation of the Tailscale protocol & control server - for hobbyists and self-hosters. We might work in the future to support HA setups, that's not the very short term goal. Those kinds of requests I would recommend you the official Tailscale.com SaaS + Tailnet Lock. Or send us a PR :) PRs are always welcomed!
Author
Owner

@0n1cOn3 commented on GitHub (Apr 14, 2023):

Hi @juanfont

Thanks for your answer.
Yes, we (n64.cc) are do self-hosting and wont reliable on others "computers".

Maybe I'm just asking too much 😂 I'm unfortunately not able to program, otherwise I would very much like to implement somehow and make a PR. But as a hobby system / cloud administrator, I'm almost left to the others who can program.

@0n1cOn3 commented on GitHub (Apr 14, 2023): Hi @juanfont Thanks for your answer. Yes, we (n64.cc) are do self-hosting and wont reliable on others "computers". Maybe I'm just asking too much 😂 I'm unfortunately not able to program, otherwise I would very much like to implement somehow and make a PR. But as a hobby system / cloud administrator, I'm almost left to the others who can program.
Author
Owner

@kradalby commented on GitHub (May 10, 2023):

While we appreciate the suggestion, it is out of scope for this project and not something we will work for now.

@kradalby commented on GitHub (May 10, 2023): While we appreciate the suggestion, it is out of scope for this project and not something we will work for now.
Author
Owner

@ciroiriarte commented on GitHub (May 10, 2023):

image

@ciroiriarte commented on GitHub (May 10, 2023): ![image](https://github.com/juanfont/headscale/assets/1750260/821a938f-f148-49f0-80b7-c6bc0bd3adff)
Author
Owner

@0n1cOn3 commented on GitHub (May 10, 2023):

Thanks for the answer @kradalby

Too bad, because HA for Tailscale would certainly be a groundbreaking possibility.
Our community has unfortunately the problem that the main server with Tailscale random again and again says goodbye and therefore this idea arose.
I would welcome it if this idea is implemented in a later time perhaps.

Thank you very much.

@0n1cOn3 commented on GitHub (May 10, 2023): Thanks for the answer @kradalby Too bad, because HA for Tailscale would certainly be a groundbreaking possibility. Our community has unfortunately the problem that the main server with Tailscale random again and again says goodbye and therefore this idea arose. I would welcome it if this idea is implemented in a later time perhaps. Thank you very much.
Author
Owner

@0n1cOn3 commented on GitHub (May 10, 2023):

Maybe there would be the possibility, if Tailscale is down, that a client as standby can take over this task for authentication in a temporary period. At least for the already logged in clients.
Only, I see some challenges in addressing this.

@0n1cOn3 commented on GitHub (May 10, 2023): Maybe there would be the possibility, if Tailscale is down, that a client as standby can take over this task for authentication in a temporary period. At least for the already logged in clients. Only, I see some challenges in addressing this.
Author
Owner

@gucki commented on GitHub (Jun 26, 2023):

@0n1cOn3 What about using two vms in different availability zones/ datacenters, a floating/ virtual IP for headscale, and a local postgres master/ slave setup. Use keepalive to control the failover.

@gucki commented on GitHub (Jun 26, 2023): @0n1cOn3 What about using two vms in different availability zones/ datacenters, a floating/ virtual IP for headscale, and a local postgres master/ slave setup. Use keepalive to control the failover.
Author
Owner

@rallisf1 commented on GitHub (Aug 26, 2023):

@0n1cOn3 What about using two vms in different availability zones/ datacenters, a floating/ virtual IP for headscale, and a local postgres master/ slave setup. Use keepalive to control the failover.

floating IPs work only in the same network. anyways, since clients use an FQDN to connect all you really need is:

  1. a health checker endpoint for headscale
  2. a cron job to sync database and configuration over to your slave server
  3. the headscale domain to be hosted to some DNS hoster with API access
  4. a script on both headscale servers to monitor the health of the other and change the A record of your headscale domain accordingly
@rallisf1 commented on GitHub (Aug 26, 2023): > @0n1cOn3 What about using two vms in different availability zones/ datacenters, a floating/ virtual IP for headscale, and a local postgres master/ slave setup. Use keepalive to control the failover. floating IPs work only in the same network. anyways, since clients use an FQDN to connect all you really need is: 1. a health checker endpoint for headscale 2. a cron job to sync database and configuration over to your slave server 3. the headscale domain to be hosted to some DNS hoster with API access 4. a script on both headscale servers to monitor the health of the other and change the A record of your headscale domain accordingly
Author
Owner

@krlc commented on GitHub (Oct 23, 2023):

Has anyone implemeted what @rallisf1 has outlined? What's your experience? Any pitfalls?

@krlc commented on GitHub (Oct 23, 2023): Has anyone implemeted what @rallisf1 has outlined? What's your experience? Any pitfalls?
Author
Owner

@0n1cOn3 commented on GitHub (Oct 23, 2023):

Has anyone implemeted what @rallisf1 has outlined? What's your experience? Any pitfalls?

Not yet. I have to speak with our main admin to test that.

@0n1cOn3 commented on GitHub (Oct 23, 2023): > Has anyone implemeted what @rallisf1 has outlined? What's your experience? Any pitfalls? Not yet. I have to speak with our main admin to test that.
Author
Owner

@Capelinha commented on GitHub (Nov 21, 2023):

@0n1cOn3 Were you able to try this idea?

@Capelinha commented on GitHub (Nov 21, 2023): @0n1cOn3 Were you able to try this idea?
Author
Owner

@0n1cOn3 commented on GitHub (Nov 26, 2023):

@0n1cOn3 Were you able to try this idea?

Not yet. Our main admin has to perform this setup.

@0n1cOn3 commented on GitHub (Nov 26, 2023): > @0n1cOn3 Were you able to try this idea? Not yet. Our main admin has to perform this setup.
Author
Owner
@ser commented on GitHub (Oct 22, 2025): https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
Author
Owner

@unixfox commented on GitHub (Oct 22, 2025):

Wouldn't use litefs, it's in a "pause" state (https://community.fly.io/t/litefs-discontinued/23682) in favor of litestream: https://fly.io/blog/litestream-revamped/

But the idea is good.

Though I have personally instead done the same idea with patroni : postgresql + consul.

@unixfox commented on GitHub (Oct 22, 2025): Wouldn't use litefs, it's in a "pause" state (https://community.fly.io/t/litefs-discontinued/23682) in favor of litestream: https://fly.io/blog/litestream-revamped/ But the idea is good. Though I have personally instead done the same idea with patroni : postgresql + consul.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#266