Steps to make headscale production ready #549

New Issue

adam · 2025-12-29T02:19:51+01:00

adam commented

2025-12-29 02:19:51 +01:00

Originally created by @rallisf1 on GitHub (Aug 26, 2023).

Why

I know this is a hobbyist implementation of the tailscale server but you have already implemented much of the original functionality and also taking steps into improving the codebase (e.g. #1473 ) . I'm not sure if the goal is to be 1:1 compatible with tailscale, frankly you can't (e.g. Funnels), but that doesn't mean that headscale can't become a serious open source competitor.

Description

So IMHO these are the things needed to make headscale production ready:

Get rid of sqlite, sure it is easy but it is not scalable. I'd go with mongodb (see 5.)
Get rid of JSON/Yaml for ACL/Config and store the rules in the database (p.s. you'd still need some initial configuration, just use the db after that)
Add more API endpoints for groups, acl, routes, healthcheck, dns, reload etc
Embed a GUI. I currently use https://github.com/gurucomputing/headscale-ui which is nice and functional. With a simple backend for Auth it should serve just fine.
High availability: By using an easily clustered db (not to mention Atlas is really cheap for a load like this) and having a healthcheck endpoint, all someone needs to create a network of headscale servers is a load balancer.

Originally created by @rallisf1 on GitHub (Aug 26, 2023). ## Why I know this is a hobbyist implementation of the tailscale server but you have already implemented much of the original functionality and also taking steps into improving the codebase (e.g. #1473 ) . I'm not sure if the goal is to be 1:1 compatible with tailscale, frankly you can't (e.g. Funnels), but that doesn't mean that headscale can't become a serious open source competitor. ## Description So IMHO these are the things needed to make headscale production ready: 1. Get rid of sqlite, sure it is easy but it is not scalable. I'd go with mongodb (see 5.) 2. Get rid of JSON/Yaml for ACL/Config and store the rules in the database (p.s. you'd still need some initial configuration, just use the db after that) 3. Add more API endpoints for groups, acl, routes, healthcheck, dns, reload etc 4. Embed a GUI. I currently use https://github.com/gurucomputing/headscale-ui which is nice and functional. With a simple backend for Auth it should serve just fine. 5. High availability: By using an easily clustered db (not to mention Atlas is really cheap for a load like this) and having a healthcheck endpoint, all someone needs to create a network of headscale servers is a load balancer.

adam added the enhancement label 2025-12-29 02:19:51 +01:00

adam closed this issue

2025-12-29 02:19:51 +01:00

adam commented

2025-12-29 02:19:51 +01:00

@juanfont commented on GitHub (Aug 27, 2023):

Hi,

Thanks for opening this.

We support sqlite and PostgreSQL. Any current bottleneck headscale might have is in our code, definitely not in those world-class DB. Not sure how Mongo would help :/
Keeping those outside the DB is very much on purpose - as they can be put in a version control system like git, and its changes being tracked - similar to what you would do in Tailscale's SaaS. Some kind of GitOps flow can also be done in this way.
The CLI talks to the server via a gRPC API. All those endpoints are also available in a REST API. If something can't be done via the CLI it can of course be implemented.
Those are external projects that we do not have the bandwidth to support. They are linked here https://headscale.net/web-ui/
Currently the issue is on headscale side, as only one server can keep track of the connected clients. That might change in the future, but we are not there yet. Having a clustered DB would not help for the time being.

@juanfont commented on GitHub (Aug 27, 2023): Hi, Thanks for opening this. 1. We support sqlite and PostgreSQL. Any current bottleneck headscale might have is in our code, definitely not in those world-class DB. Not sure how Mongo would help :/ 2. Keeping those outside the DB is very much on purpose - as they can be put in a version control system like git, and its changes being tracked - similar to what you would do in Tailscale's SaaS. Some kind of GitOps flow can also be done in this way. 3. The CLI talks to the server via a gRPC API. All those endpoints are also available in a REST API. If something can't be done via the CLI it can of course be implemented. 4. Those are external projects that we do not have the bandwidth to support. They are linked here https://headscale.net/web-ui/ 5. Currently the issue is on headscale side, as only one server can keep track of the connected clients. That might change in the future, but we are not there yet. Having a clustered DB would not help for the time being.

adam commented

2025-12-29 02:19:52 +01:00

@rallisf1 commented on GitHub (Aug 28, 2023):

We support sqlite and PostgreSQL. Any current bottleneck headscale might have is in our code, definitely not in those world-class DB. Not sure how Mongo would help :/

The data types used by the project resemble a document tree rather than relational data, thus noSQL makes more sense, or even a key:value db like Redis.

Keeping those outside the DB is very much on purpose - as they can be put in a version control system like git, and its changes being tracked - similar to what you would do in Tailscale's SaaS. Some kind of GitOps flow can also be done in this way.

True, still a large corporate network can make the ACL file grow quite large, plus race conditions can occur when multiple people edit the same configuration file. I'd rather keep both: a database backend and a JSON editor for the front-end (API).

Those are external projects that we do not have the bandwidth to support. They are linked here https://headscale.net/web-ui/

Ok, still some better integration wouldn't hurt. I suppose I can help with that once refactoring is over.

Currently the issue is on headscale side, as only one server can keep track of the connected clients. That might change in the future, but we are not there yet. Having a clustered DB would not help for the time being.

True, as long as you store those in private memory it won't work. Maybe this can be solved with redis?

@rallisf1 commented on GitHub (Aug 28, 2023): > 1. We support sqlite and PostgreSQL. Any current bottleneck headscale might have is in our code, definitely not in those world-class DB. Not sure how Mongo would help :/ The data types used by the project resemble a document tree rather than relational data, thus noSQL makes more sense, or even a key:value db like Redis. > 2. Keeping those outside the DB is very much on purpose - as they can be put in a version control system like git, and its changes being tracked - similar to what you would do in Tailscale's SaaS. Some kind of GitOps flow can also be done in this way. True, still a large corporate network can make the ACL file grow quite large, plus race conditions can occur when multiple people edit the same configuration file. I'd rather keep both: a database backend and a JSON editor for the front-end (API). > 4. Those are external projects that we do not have the bandwidth to support. They are linked here https://headscale.net/web-ui/ Ok, still some better integration wouldn't hurt. I suppose I can help with that once refactoring is over. > 5. Currently the issue is on headscale side, as only one server can keep track of the connected clients. That might change in the future, but we are not there yet. Having a clustered DB would not help for the time being. True, as long as you store those in private memory it won't work. Maybe this can be solved with redis?

adam commented

2025-12-29 02:19:52 +01:00

@linsomniac commented on GitHub (Sep 1, 2023):

Just to clarify your position on the database, originally you were saying, if I read it correctly, that sqlite wasn't performant enough, but then in your second reply you are saying it feels like a better fit. So just to clarify: changing databases isn't a "needed for production ready" requirement?

Personally, my experiences with Mongo have basically never been good over the long term.

WRT ACLs in the database: I agree with juanfont that having them in a GitOps workflow for version control is better and is exactly what we do. GitOps resolves the "race conditions when multiple people edit the same configuraton file", because git has world-class conflict resolution -- that's it's bread and butter.

I think the primary issue with ACLs and being "production ready" are primarily testing new ACLs. I tried to resolve that with the "configtest", but there is some issue with the database now that prevents configtest from running. I maybe have a 75% success rate with adding ACLs without taking down my entire tailnet due to ACL errors or tags that are in the ACL but not tailnet because of a missing node.

I also wouldn't classify bundling a UI as a production requirement: sure it makes it more turnkey, but installing one project versus two doesn't block deployment to production.

@linsomniac commented on GitHub (Sep 1, 2023): Just to clarify your position on the database, originally you were saying, if I read it correctly, that sqlite wasn't performant enough, but then in your second reply you are saying it feels like a better fit. So just to clarify: changing databases isn't a "needed for production ready" requirement? Personally, my experiences with Mongo have basically never been good over the long term. WRT ACLs in the database: I agree with juanfont that having them in a GitOps workflow for version control is better and is exactly what we do. GitOps resolves the "race conditions when multiple people edit the same configuraton file", because git has world-class conflict resolution -- that's it's bread and butter. I think the primary issue with ACLs and being "production ready" are primarily testing new ACLs. I tried to resolve that with the "configtest", but there is some issue with the database now that prevents configtest from running. I maybe have a 75% success rate with adding ACLs without taking down my entire tailnet due to ACL errors or tags that are in the ACL but not tailnet because of a missing node. I also wouldn't classify bundling a UI as a production requirement: sure it makes it more turnkey, but installing one project versus two doesn't block deployment to production.

adam commented

2025-12-29 02:19:52 +01:00

@rallisf1 commented on GitHub (Sep 1, 2023):

Just to clarify your position on the database, originally you were saying, if I read it correctly, that sqlite wasn't performant enough, but then in your second reply you are saying it feels like a better fit.

You didn't read correctly. I said noSQL or key:value storage is a better fit. Scalable means more than just performant. Mongo is just a recommendation because of how handy (and cheap) Atlas is for such small databases.

Personally, my experiences with Mongo have basically never been good over the long term.

Mine too, until I stopped trying to treat it like a relational database.

WRT ACLs in the database: I agree with juanfont that having them in a GitOps workflow for version control is better and is exactly what we do.

I believe that's not the case when you are adding new users daily, let alone have the users register themselves.

I think the primary issue with ACLs and being "production ready" are primarily testing new ACLs.

Yup, I'm still getting familiar with it.

I also wouldn't classify bundling a UI as a production requirement: sure it makes it more turnkey, but installing one project versus two doesn't block deployment to production.

I can't argue with that; I already use it that way. Turnkey solutions are more attractive though.

@rallisf1 commented on GitHub (Sep 1, 2023): > Just to clarify your position on the database, originally you were saying, if I read it correctly, that sqlite wasn't performant enough, but then in your second reply you are saying it feels like a better fit. You didn't read correctly. I said noSQL or key:value storage is a better fit. Scalable means more than just performant. Mongo is just a recommendation because of how handy (and cheap) Atlas is for such small databases. > Personally, my experiences with Mongo have basically never been good over the long term. Mine too, until I stopped trying to treat it like a relational database. > WRT ACLs in the database: I agree with juanfont that having them in a GitOps workflow for version control is better and is exactly what we do. I believe that's not the case when you are adding new users daily, let alone have the users register themselves. > I think the primary issue with ACLs and being "production ready" are primarily testing new ACLs. Yup, I'm still getting familiar with it. > I also wouldn't classify bundling a UI as a production requirement: sure it makes it more turnkey, but installing one project versus two doesn't block deployment to production. I can't argue with that; I already use it that way. Turnkey solutions are more attractive though.

adam commented

2025-12-29 02:19:52 +01:00

@linsomniac commented on GitHub (Sep 1, 2023):

Ok, so can you clarify what makes the current options of Postgres and SQLite not production ready? I've run Postgres databases that handled services for URLs on Superbowl adverts, so I think it scales... Fly.io's entire business is built on SQLite scaling...

@linsomniac commented on GitHub (Sep 1, 2023): Ok, so can you clarify what makes the current options of Postgres and SQLite not production ready? I've run Postgres databases that handled services for URLs on Superbowl adverts, so I think it scales... Fly.io's entire business is built on SQLite scaling...

adam commented

2025-12-29 02:19:52 +01:00

@evenh commented on GitHub (Sep 1, 2023):

It's worth noting that Tailscale itself is using a variant of sqlite: https://tailscale.com/blog/database-for-2022/

@evenh commented on GitHub (Sep 1, 2023): It's worth noting that Tailscale itself is using a variant of sqlite: https://tailscale.com/blog/database-for-2022/

adam commented

2025-12-29 02:19:52 +01:00

@rallisf1 commented on GitHub (Sep 1, 2023):

@linsomniac frankly; when I started this thread I was not aware that headscale used anything other than SQLite. SQLite by itself is not scalable. Are there sync tools or SQLite clones out there that support clustering? Sure. Is that a viable option? I don't know, that's why we're having this discussion. By the way PostgreSQL is great, it is my go to relational database, but its clustering is tricky, at least for me. I still use MySQL Galera when I need a self-hosted high availability relational database.

@evenh it's still SQLite, they just added a replication service (litestream) to create replica(s). The downside of this setup is that there is only 1 master database, you can't perform any load balancing like when clustering. Do they, or headscale, need database clustering? I'm not sure, perhaps not.

@rallisf1 commented on GitHub (Sep 1, 2023): @linsomniac frankly; when I started this thread I was not aware that headscale used anything other than SQLite. SQLite by itself is not scalable. Are there sync tools or SQLite clones out there that support clustering? Sure. Is that a viable option? I don't know, that's why we're having this discussion. By the way PostgreSQL is great, it is my go to relational database, but its clustering is tricky, at least for me. I still use MySQL Galera when I need a self-hosted high availability relational database. @evenh it's still SQLite, they just added a replication service (litestream) to create replica(s). The downside of this setup is that there is only 1 master database, you can't perform any load balancing like when clustering. Do they, or headscale, need database clustering? I'm not sure, perhaps not.

adam commented

2025-12-29 02:19:53 +01:00

@kradalby commented on GitHub (Sep 1, 2023):

SQLite by itself is not scalable.

This sounds like a very old, uneducated and undocumented statement.

This gives you near real time backups

Tailscale use SQLite in production, as stated in the blogpost, litestream is used for backups.

So IMHO these are the things needed to make headscale production ready

I think the term "production ready" is a bit arbitrary and vague, and I appreciate it is an opinion. My interpretation of what you describe is "turn-key" and not "production ready".

The configuration has been deliberately kept as is for the "config as code" reasons mentioned by other, I do not see this changing. My work experience indicates that this is a more desirable style of config than in the database.

I agree that we could expose more of the API, we would be happy to get proposals and PRs to do that.

As @juanfont, there are some good webui, we list them, but can't support them, bundling them would mean supporting them.

Clustering, sharding, HA, is way out of scope for this project, it is not a requirement for "production".

This last part is my personal take of what I work for this project to be:

From a business perspective:
It is not a way for people to have "free" Tailscale. It is for the use cases where people cannot use Tailscale because of policy, systems that are not connected to the internet, Red/blue teams what needs more control.

If you work for a business that leverage Headscale or Tailscale, you should contribute/pay/donate to either.

For everyone else, hobbyists and self-hosters, fun project.

This means that the goal is for it to be as stable, and scalable as it needs to be.

@kradalby commented on GitHub (Sep 1, 2023): > SQLite by itself is not scalable. This sounds like a very old, uneducated and undocumented statement. > This gives you near real time backups Tailscale use SQLite in production, as stated in the blogpost, litestream is used for backups. > So IMHO these are the things needed to make headscale production ready I think the term "production ready" is a bit arbitrary and vague, and I appreciate it is an opinion. My interpretation of what you describe is "turn-key" and not "production ready". The configuration has been deliberately kept as is for the "config as code" reasons mentioned by other, I do not see this changing. My work experience indicates that this is a more desirable style of config than in the database. I agree that we could expose more of the API, we would be happy to get proposals and PRs to do that. As @juanfont, there are some good webui, we list them, but can't support them, bundling them would mean supporting them. Clustering, sharding, HA, is way out of scope for this project, it is not a requirement for "production". This last part is my personal take of what I work for this project to be: From a business perspective: It is not a way for people to have "free" Tailscale. It is for the use cases where people _cannot_ use Tailscale because of policy, systems that are not connected to the internet, Red/blue teams what needs more control. If you work for a business that leverage Headscale or Tailscale, you should contribute/pay/donate to either. For everyone else, hobbyists and self-hosters, fun project. This means that the goal is for it to be as stable, and scalable as it needs to be.

adam commented

2025-12-29 02:19:53 +01:00

@rallisf1 commented on GitHub (Sep 1, 2023):

It is what it is then.

Thank you all for your time.

@rallisf1 commented on GitHub (Sep 1, 2023): It is what it is then. Thank you all for your time.

adam commented

2025-12-29 02:19:53 +01:00

@linsomniac commented on GitHub (Sep 3, 2023):

@rallisf1 FYI: another option for Postgres clustering is CockroachDB. It is wire compatible with Postgres, it may or may not support the queries that headscale does (it may, but it may not, I ran into issues using it with Postfix because Cockroach only supports UTF-8, and Postfix used to require a different encoding). That was the issue I ran into when I tried to convert my Galera cluster to Cockroach.

@linsomniac commented on GitHub (Sep 3, 2023): @rallisf1 FYI: another option for Postgres clustering is CockroachDB. It is wire compatible with Postgres, it may or may not support the queries that headscale does (it may, but it may not, I ran into issues using it with Postfix because Cockroach only supports UTF-8, and Postfix used to require a different encoding). That was the issue I ran into when I tried to convert my Galera cluster to Cockroach.

adam commented

2025-12-29 02:19:53 +01:00

@kedare commented on GitHub (Sep 16, 2023):

I confirm CockroachDB would likely be great (and you could still use sqlite / PostgreSQL with nearly same code).
I used MongoDB in the past and it's been terrible in my experience, memleak and not so great performance (but mostly due to the MongoDB client implementation back then), switched to CockroachDB and it was indeed much better (I also find MongoDB query syntax a pain to work with compared to SQL)

@kedare commented on GitHub (Sep 16, 2023): I confirm CockroachDB would likely be great (and you could still use sqlite / PostgreSQL with nearly same code). I used MongoDB in the past and it's been terrible in my experience, memleak and not so great performance (but mostly due to the MongoDB client implementation back then), switched to CockroachDB and it was indeed much better (I also find MongoDB query syntax a pain to work with compared to SQL)

adam referenced this issue

2025-12-29 02:30:12 +01:00

[PR #549] [CLOSED] docs(README): update contributors #1474

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#549