mirror of
https://github.com/juanfont/headscale.git
synced 2026-01-12 04:10:32 +01:00
[Feature] multiple replicas of headscale instances #1067
Open
opened 2025-12-29 02:28:04 +01:00 by adam
·
15 comments
No Branch/Tag Specified
main
update_flake_lock_action
gh-pages
kradalby/release-v0.27.2
dependabot/go_modules/golang.org/x/crypto-0.45.0
dependabot/go_modules/github.com/opencontainers/runc-1.3.3
copilot/investigate-headscale-issue-2788
copilot/investigate-visibility-issue-2788
copilot/investigate-issue-2833
copilot/debug-issue-2846
copilot/fix-issue-2847
dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0
dependabot/go_modules/github.com/docker/docker-28.3.3incompatible
kradalby/cli-experiement3
doc/0.26.1
doc/0.25.1
doc/0.25.0
doc/0.24.3
doc/0.24.2
doc/0.24.1
doc/0.24.0
kradalby/build-docker-on-pr
topic/docu-versioning
topic/docker-kos
juanfont/fix-crash-node-id
juanfont/better-disclaimer
update-contributors
topic/prettier
revert-1893-add-test-stage-to-docs
add-test-stage-to-docs
remove-node-check-interval
fix-empty-prefix
fix-ephemeral-reusable
bug_report-debuginfo
autogroups
logs-to-stderr
revert-1414-topic/fix_unix_socket
rename-machine-node
port-embedded-derp-tests-v2
port-derp-tests
duplicate-word-linter
update-tailscale-1.36
warn-against-apache
ko-fi-link
more-acl-tests
fix-typo-standalone
parallel-nolint
tparallel-fix
rerouting
ssh-changelog-docs
oidc-cleanup
web-auth-flow-tests
kradalby-gh-runner
fix-proto-lint
remove-funding-links
go-1.19
enable-1.30-in-tests
0.16.x
cosmetic-changes-integration
tmp-fix-integration-docker
fix-integration-docker
configurable-update-interval
show-nodes-online
hs2021
acl-syntax-fixes
ts2021-implementation
fix-spurious-updates
unstable-integration-tests
mandatory-stun
embedded-derp
prtemplate-fix
v0.28.0-beta.1
v0.27.2-rc.1
v0.27.1
v0.27.0
v0.27.0-beta.2
v0.27.0-beta.1
v0.26.1
v0.26.0
v0.26.0-beta.2
v0.26.0-beta.1
v0.25.1
v0.25.0
v0.25.0-beta.2
v0.24.3
v0.25.0-beta.1
v0.24.2
v0.24.1
v0.24.0
v0.24.0-beta.2
v0.24.0-beta.1
v0.23.0
v0.23.0-rc.1
v0.23.0-beta.5
v0.23.0-beta.4
v0.23.0-beta3
v0.23.0-beta2
v0.23.0-beta1
v0.23.0-alpha12
v0.23.0-alpha11
v0.23.0-alpha10
v0.23.0-alpha9
v0.23.0-alpha8
v0.23.0-alpha7
v0.23.0-alpha6
v0.23.0-alpha5
v0.23.0-alpha4
v0.23.0-alpha4-docker-ko-test9
v0.23.0-alpha4-docker-ko-test8
v0.23.0-alpha4-docker-ko-test7
v0.23.0-alpha4-docker-ko-test6
v0.23.0-alpha4-docker-ko-test5
v0.23.0-alpha-docker-release-test-debug2
v0.23.0-alpha-docker-release-test-debug
v0.23.0-alpha4-docker-ko-test4
v0.23.0-alpha4-docker-ko-test3
v0.23.0-alpha4-docker-ko-test2
v0.23.0-alpha4-docker-ko-test
v0.23.0-alpha3
v0.23.0-alpha2
v0.23.0-alpha1
v0.22.3
v0.22.2
v0.23.0-alpha-docker-release-test
v0.22.1
v0.22.0
v0.22.0-alpha3
v0.22.0-alpha2
v0.22.0-alpha1
v0.22.0-nfpmtest
v0.21.0
v0.20.0
v0.19.0
v0.19.0-beta2
v0.19.0-beta1
v0.18.0
v0.18.0-beta4
v0.18.0-beta3
v0.18.0-beta2
v0.18.0-beta1
v0.17.1
v0.17.0
v0.17.0-beta5
v0.17.0-beta4
v0.17.0-beta3
v0.17.0-beta2
v0.17.0-beta1
v0.17.0-alpha4
v0.17.0-alpha3
v0.17.0-alpha2
v0.17.0-alpha1
v0.16.4
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.16.0-beta7
v0.16.0-beta6
v0.16.0-beta5
v0.16.0-beta4
v0.16.0-beta3
v0.16.0-beta2
v0.16.0-beta1
v0.15.0
v0.15.0-beta6
v0.15.0-beta5
v0.15.0-beta4
v0.15.0-beta3
v0.15.0-beta2
v0.15.0-beta1
v0.14.0
v0.14.0-beta2
v0.14.0-beta1
v0.13.0
v0.13.0-beta3
v0.13.0-beta2
v0.13.0-beta1
upstream/v0.12.4
v0.12.4
v0.12.3
v0.12.2
v0.12.2-beta1
v0.12.1
v0.12.0-beta2
v0.12.0-beta1
v0.11.0
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.1
v0.8.0
v0.7.1
v0.7.0
v0.6.1
v0.6.0
v0.5.2
v0.5.1
v0.5.0
v0.4.0
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.1
v0.1.0
Labels
Clear labels
CLI
DERP
DNS
Nix
OIDC
SSH
bug
database
documentation
duplicate
enhancement
faq
good first issue
grants
help wanted
might-come
needs design doc
needs investigation
no-stale-bot
out of scope
performance
policy 📝
pull-request
question
regression
routes
stale
tags
tailscale-feature-gap
well described ❤️
wontfix
Mirrored from GitHub Pull Request
No Label
enhancement
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/headscale#1067
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @thebigbone on GitHub (Jul 18, 2025).
Use case
currently, there is no option for running headscale in a high availability way. if one goes down, the whole tailnet is unreachable.
Description
adding an option for using multiple headscale servers to distribute the load as well as making sure every server syncs the config so the tailnet is highly available. are there any plans for such a feature?
Contribution
How can it be implemented?
No response
@tiberiuv commented on GitHub (Jul 18, 2025):
I was able to run an HA setup of headscale previously it was in a kubernetes environment but it should be possible to replicate outside it as well
That said I don't think that the control plane going down should affect the data path immediately, unless some endpoints are changing while it's unavailable
Also there was an issue around the same issue https://github.com/juanfont/headscale/issues/100
@x1arch commented on GitHub (Jul 19, 2025):
Tailnet continue working if control server is down, you can't connect new nodes, but nodes which connected continue to work.
You can use external DB and
Anyway it could be great to have embedded HA mechanism. I believe it is not a rocket science, just provide a list of backends for client (this update need apply to client) and try to connect one by one.
@lucasfcnunes commented on GitHub (Jul 19, 2025):
@tiberiuv @x1arch do you guys use headplane with multiple replicas >= 2?
@x1arch commented on GitHub (Jul 19, 2025):
Nope, but I don't see any problem in all of this variants, because in every moment will work only one headscale control server
PS. even if that have a problem with run two controls in one time, you can fix it by the script which will check your first server ans starts only if the fist one is down
@thebigbone commented on GitHub (Jul 20, 2025):
The whole state resides in the DB? Are there any other config files, apart from the main config.yaml which needs to be accounted for?
@thebigbone commented on GitHub (Jul 20, 2025):
I personally lose connection to tailnet when headscale server goes down. I don't know how your peers still stay connected and are able to communicate. It's a direct connection, no derp
@x1arch commented on GitHub (Jul 20, 2025):
It's really weird, because, the nodes has direct connections between each other, when my headscale goes down, direct connections continues to work, not all, but most of them, I was sure the problem was in DERP.
Right now I stopped my headscale server and uptime monitor continues to ping all hosts without any problems, the phone - too. Headscale - is only management server.
@thebigbone commented on GitHub (Jul 20, 2025):
Ok, you might be right. Some peers stay online, while very few goes off. Even tho all of them are connected directly.
@kradalby commented on GitHub (Jul 23, 2025):
The short answer is no. But I will break it down a little more and add some reasoning.
This quickly complicates the server by potentially order of magnitude making simple bugs hard to debug, and harder bugs even harder. It is feasible for some sort of systems and particularly if you have a lot of developers, but as I see it, the net gain for a project like this, it just does not make sense.
In addition, it significantly complicates your runtime setup, you now need to ensure you have multiple replicas, databases, there might be split brain issues, recovery becomes harder.
What I think people should focus on in their strategy is:
The Tailscale client has a lot of redundancy built in, as mentioned above, in general if the client has an up to date map, everything continue to work as long not too many nodes move at the same time. Technically if one side of the nodes move, it could still work as one node can reach the other to establish the connection.
This has been built in since the beginning and I believe Tailscale runs quite a simple setup themselves where this is key part of the strategy.
That said, there might be differences between Headscale and Tailscale here, where we do not implement everything correctly so all of these things dont work, and of course we should continue to improve that.
To the last point, this means that 5-10 minute, even an hour outage should not be noticed to much as long as your not "changing the shape of your network" so too much movement or new nodes etc.
To this I will say that instead of a HA setup, you can easier focus on a much simpler "how quickly can I recover or replace my server?" setup.
Minimum setup for this should allow you to recover your Headscale from a backup in minutes:
@blinkinglight commented on GitHub (Aug 31, 2025):
you can backup sqlite easily / restore on start with https://litestream.io/reference/ or something like this project https://github.com/reneleonhardt/harmonylite ( to replicate sqlite via nats.io )
@dzervas commented on GitHub (Aug 31, 2025):
there's also dqlite and rqlite. I think rqlite does not need code changes and can be used as a drop-in replacement
@gawsoftpl commented on GitHub (Sep 8, 2025):
I've been looking for a solution to the high availability problem for a while over the weekend.
I write how to create auto replication of sqlite and auto failover switch to replica node
All I posted in my blog:
https://gawsoft.com/blog/headscale-litefs-consul-replication-failover/
@anthonyrisinger commented on GitHub (Oct 11, 2025):
@kradalby I'm a bit confused by comments here and the general decision to support Postgres in a way that's overtly precarious (maintenance mode). I've been prototyping for a good size project the last few weeks, one that would almost certainly lead to in/direct support—Headscale is the obvious place/base to start from and early patches would probably be scalability related—but it becomes a significantly tougher sell to both my own good conscious and the wider engineering org when I have to explain only SQLite is supported (great DB but quite challenging in stateless/ephemeral environments), and also, there's little interest in HA too.
I guess what I'm trying to say is, this approach might be leaving money/time/skills/contributors/??? on the table. I 100% understand and respect the desire to Keep It Simple; alas, if I e.g. say out loud in a tech review that "the Headscale control plane could be down for minute(s) plural and this is an acceptable outcome because there is little appetite for fixing this upstream", Headscale is likely to get (reasonably?) dismissed, and Nice Things never happen.
Since I still feel that it's probably the best base for me to start from, what would it take to solidify your thinking around the handful of important things wrt improving core scalability? In my mind, the two most important pieces are full support for externalized database state and the ability to run multiple copies without much fuss.
I also understand there's likely to be other issues around rebuilding the "world map", and I'd hoped to help cross that bridge when the time comes; my fear is I'll not be allowed to cross said bridge because there will be no interest in what I have to offer.
@almereyda commented on GitHub (Oct 15, 2025):
What about reconsidering distributed SQLite flavours, earlier proposed by @dzervas?
The dqlite-fork cowsql is happily in use for production use cases, e.g. by the Golang Incus project.
@ksemele-public commented on GitHub (Nov 6, 2025):
I will also note that this is an important feature for me. Regardless of other conditions, it turns out that Headscale, in its current form, is a tool for hobbies, not for production solutions, which is sad. (Or you have to use homemade solutions or accept the risks.)
I recall Kafka, Postgres, and even Kubernetes itself, which successfully solve the task of high availability. Maybe it's worth looking for some elegant and simple approaches in the open-source community?
Even I studied how to use leases for my experiments with operators in k8s, and it doesn't look like rocket science... for k8s installation, of course, but without external database support from Headscale itself, it's unlikely to implement a good "quick and dirty" solution.