Headscale fails to activate clients with postgresql backend #314

Closed
opened 2025-12-29 01:26:37 +01:00 by adam · 25 comments
Owner

Originally created by @ishanjain28 on GitHub (Aug 24, 2022).

Bug description

Tailscale clients authenticate successfully with headscale when headscale is configured to use postgres but then get stuck in a loop and keep refreshing keys.

More specifically,

In case of sqlite,

  1. Clients sends 556byte payload, headscale answers it.
  2. Then, Client sends ~629 byte payload containing the peerapi4, peerapi6 services and client responds
  3. Then, client sends a 1000+ byte payload with read_only=false containing it's endpoints and after this everything works!

In case of postgres,

  1. Client sends 556byte payload, headscale returns the exact same answer as sqlite.
  2. Client sends 629byte payload(read only=true), headscale responds but KeyExpiry is set to 0001-01-01 05:53:28+05:53:28.
  3. Client again sends a 629byte payload and this keeps happening in a loop.
    Client never sends a payload with read_only=false.

To Reproduce

  1. Install postgresql 14
  2. Install latest build of headscale(from git head or the latest release)
  3. Configure it to use postgresql and run it

Try to register any tailscale client.

Context info

Originally created by @ishanjain28 on GitHub (Aug 24, 2022). **Bug description** Tailscale clients authenticate successfully with headscale when headscale is configured to use postgres but then get stuck in a loop and keep refreshing keys. More specifically, In case of sqlite, 1. Clients sends 556byte payload, headscale answers it. 2. Then, Client sends ~629 byte payload containing the peerapi4, peerapi6 services and client responds 3. Then, client sends a 1000+ byte payload with read_only=false containing it's endpoints and after this everything works! In case of postgres, 1. Client sends 556byte payload, headscale returns the exact same answer as sqlite. 2. Client sends 629byte payload(read only=true), headscale responds but KeyExpiry is set to `0001-01-01 05:53:28+05:53:28`. 4. Client again sends a 629byte payload and this keeps happening in a loop. Client never sends a payload with read_only=false. **To Reproduce** 1. Install postgresql 14 2. Install latest build of headscale(from git head or the latest release) 3. Configure it to use postgresql and run it Try to register any tailscale client. **Context info** <!-- Please add relevant information about your system. For example: - Version of headscale used: GIT_HEAD - Version of tailscale client: 1.29.72 - OS (e.g. Linux, Mac, Cygwin, WSL, etc.) and version - Kernel version: Linux emerald 5.17.6-arch1-1 #1 SMP PREEMPT Tue, 10 May 2022 23:00:39 +0000 x86_64 GNU/Linux - The relevant config parameters you used - Log output -->
adam added the bug label 2025-12-29 01:26:37 +01:00
adam closed this issue 2025-12-29 01:26:37 +01:00
Author
Owner

@kradalby commented on GitHub (Sep 8, 2022):

Note for when this is tackled,

  • We should first implement integration tests that shows this, then fix it to prevent regression
  • We need to get rid of all timestamps that is not initialised/set (no more nil or 0001-01-01)
@kradalby commented on GitHub (Sep 8, 2022): Note for when this is tackled, - We should _first_ implement integration tests that shows this, then fix it to prevent regression - We need to get rid of _all_ timestamps that is not initialised/set (no more nil or 0001-01-01)
Author
Owner

@QZAiXH commented on GitHub (Apr 3, 2023):

I seem to have encountered this problem, the client cannot join headscale with authkey, and the tailscale status keeps showing Logged out

@QZAiXH commented on GitHub (Apr 3, 2023): I seem to have encountered this problem, the client cannot join headscale with authkey, and the tailscale status keeps showing Logged out
Author
Owner

@QZAiXH commented on GitHub (Apr 3, 2023):

Can you guys tell me how to solve this problem? Maybe I can x change the code accordingly. @kradalby @juanfont

@QZAiXH commented on GitHub (Apr 3, 2023): Can you guys tell me how to solve this problem? Maybe I can x change the code accordingly. @kradalby @juanfont
Author
Owner

@Yxnt commented on GitHub (May 29, 2023):

Can you guys tell me how to solve this problem? Maybe I can x change the code accordingly. @kradalby @juanfont

@QZAiXH
maybe i can solve this problem, but i need more time to test this case.
#765 has already been fixed for the situation where the command line gets stuck when using tailscale up and tailscale login, but there are still more cases that need to be tested.

@Yxnt commented on GitHub (May 29, 2023): > Can you guys tell me how to solve this problem? Maybe I can x change the code accordingly. @kradalby @juanfont @QZAiXH maybe i can solve this problem, but i need more time to test this case. #765 has already been fixed for the situation where the command line gets stuck when using `tailscale up` and `tailscale login`, but there are still more cases that need to be tested.
Author
Owner

@github-actions[bot] commented on GitHub (Nov 26, 2023):

This issue is stale because it has been open for 180 days with no activity.

@github-actions[bot] commented on GitHub (Nov 26, 2023): This issue is stale because it has been open for 180 days with no activity.
Author
Owner

@github-actions[bot] commented on GitHub (Dec 10, 2023):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Dec 10, 2023): This issue was closed because it has been inactive for 14 days since being marked as stale.
Author
Owner

@weikinhuang commented on GitHub (Dec 13, 2023):

I'm also encountering this issue, clients status:

tailscale status
Logged out.

Even though I'm connected and have an ip. I'm using a reusable key (for a router device).

@weikinhuang commented on GitHub (Dec 13, 2023): I'm also encountering this issue, clients status: ``` tailscale status Logged out. ``` Even though I'm connected and have an ip. I'm using a reusable key (for a router device).
Author
Owner

@TnZzZHlp commented on GitHub (Dec 14, 2023):

I also encountered this problem.
Headscale version: v0.23.0-alpha2
I use nginx for reverse proxy, I don't know if it is because of the problem caused by nginx

@TnZzZHlp commented on GitHub (Dec 14, 2023): I also encountered this problem. Headscale version: v0.23.0-alpha2 I use nginx for reverse proxy, I don't know if it is because of the problem caused by nginx
Author
Owner

@weikinhuang commented on GitHub (Dec 14, 2023):

Like the reporter, it works fine with sqlite. I was trying to move to postgres for HA, but encountered this issue and went back to sqlite.

@weikinhuang commented on GitHub (Dec 14, 2023): Like the reporter, it works fine with sqlite. I was trying to move to postgres for HA, but encountered this issue and went back to sqlite.
Author
Owner

@2kvasnikov commented on GitHub (Dec 21, 2023):

Same issue with OpenWrt router as a tailscale client. Can register cleent with sqllite.

@2kvasnikov commented on GitHub (Dec 21, 2023): Same issue with OpenWrt router as a tailscale client. Can register cleent with sqllite.
Author
Owner

@alexfornuto commented on GitHub (Apr 26, 2024):

Hate to leave a "bump" comment, but FYI this issue occurs on Postgres 15 as well.

@alexfornuto commented on GitHub (Apr 26, 2024): Hate to leave a "bump" comment, but FYI this issue occurs on Postgres 15 as well.
Author
Owner

@m1cr0man commented on GitHub (May 8, 2024):

Can confirm this happens on Postgres 16 also. I have collected logs from headscale whilst trying to join with a reusable preauth key (log level debug). Also in the gist is the client status json output, and the server node info json output, after the registration. I redacted/obfuscated some data.

Clientside interaction looks like so:

$ sudo tailscale up --reset --login-server https://headscale.example.com --timeout 20s --authkey xyz123
timeout waiting for Tailscale service to enter a Running state; check health with "tailscale status"
$ sudo tailscale status
Logged out.
Log in at: https://headscale.example.com/register/nodekey:abc123

Can confirm that switching to SQLite resolves the issue. Perhaps it is a collation issue, wherein some comparison is returning different results depending on the engine? Happy to do more testing.

@m1cr0man commented on GitHub (May 8, 2024): Can confirm this happens on Postgres 16 also. I have [collected logs from headscale](https://gist.github.com/m1cr0man/be22bc2868ef7dfe6e62bc12f9760530) whilst trying to join with a reusable preauth key (log level debug). Also in the gist is the client status json output, and the server node info json output, after the registration. I redacted/obfuscated some data. Clientside interaction looks like so: ```bash $ sudo tailscale up --reset --login-server https://headscale.example.com --timeout 20s --authkey xyz123 timeout waiting for Tailscale service to enter a Running state; check health with "tailscale status" $ sudo tailscale status Logged out. Log in at: https://headscale.example.com/register/nodekey:abc123 ``` Can confirm that switching to SQLite resolves the issue. Perhaps it is a collation issue, wherein some comparison is returning different results depending on the engine? Happy to do more testing.
Author
Owner

@sjansen1 commented on GitHub (Jun 10, 2024):

I always wonder why authkey is not working on my Headscale installation until i found this issue. For now, i auth with openid, move this nodes to a fake user that can not login and set expiration date in the database to a high value to avoid expiration. For me, thats the only way to avoid expiration on server/subnet gateways.

@sjansen1 commented on GitHub (Jun 10, 2024): I always wonder why authkey is not working on my Headscale installation until i found this issue. For now, i auth with openid, move this nodes to a fake user that can not login and set expiration date in the database to a high value to avoid expiration. For me, thats the only way to avoid expiration on server/subnet gateways.
Author
Owner

@Cubea01 commented on GitHub (Jul 16, 2024):

This is still an issue, two years later.

@Cubea01 commented on GitHub (Jul 16, 2024): This is still an issue, two years later.
Author
Owner

@kradalby commented on GitHub (Aug 30, 2024):

As mentioned in #2087, it takes a lot more effort than initially anticipated to support multiple database engines and Postgres is not really a priority as the benefits to scaling something like headscale is marginal.

That does not mean we will never attempt to resolve it, it just means that we have a bunch of other things that we consider more important to bring headscale forward.

Often people create issues or bring up that you need Postgres to scale headscale beyond X nodes, while it is currently true that you might be able to have 10-20% more node with the current code, the main bottlenecks are in the headscale code and not dependent on the database.
I imagine that it will improve over time as we free up more time to work on making things more efficient, and if it comes to optimising around a database, we will do it for SQLite.

While we will work hard to not break postgresql or regress it, I would consider our support for it "best effort" and if your looking to run Headscale in a more serious manner, I would choose SQLite.

@kradalby commented on GitHub (Aug 30, 2024): As mentioned in #2087, it takes a lot more effort than initially anticipated to support multiple database engines and Postgres is not really a priority as the benefits to scaling something like headscale is marginal. That does not mean we will never attempt to resolve it, it just means that we have a bunch of other things that we consider more important to bring headscale forward. Often people create issues or bring up that you need Postgres to scale headscale beyond X nodes, while it is currently true that you might be able to have 10-20% more node with the current code, the main bottlenecks are in the headscale code and not dependent on the database. I imagine that it will improve over time as we free up more time to work on making things more efficient, and if it comes to optimising around a database, we will do it for SQLite. While we will work hard to not break postgresql or regress it, I would consider our support for it "best effort" and if your looking to run Headscale in a more serious manner, I would choose SQLite.
Author
Owner

@alexfornuto commented on GitHub (Aug 30, 2024):

@kradalby Everything you said makes a lot of sense, and I do not intend to argue any of your points, only to add another perspective.

My interest in using a db other than sqlite is not for "scaling" as much as fault tolerance. Consider a setup where HS is running in a single VM. If that VM is destroyed, the tailnet will suffer while it's recreated. With a version-controlled policy file and good monitoring coupled with a CI pipeline, this issue can be resolved in minutes.

Compare that scenario to a setup with two VMs running headscale, a primary and a backup, both connected to a managed database with its own redundancy. If the primary is destroyed, the IP address can be swapped to the backup VM, as just one of several options to swap traffic in less time.

@alexfornuto commented on GitHub (Aug 30, 2024): @kradalby Everything you said makes a lot of sense, and I do not intend to argue any of your points, only to add another perspective. My interest in using a db other than sqlite is not for "scaling" as much as fault tolerance. Consider a setup where HS is running in a single VM. If that VM is destroyed, the tailnet will suffer while it's recreated. With a version-controlled policy file and good monitoring coupled with a CI pipeline, this issue can be resolved in minutes. Compare that scenario to a setup with two VMs running headscale, a primary and a backup, both connected to a managed database with its own redundancy. If the primary is destroyed, the IP address can be swapped to the backup VM, as just one of several options to swap traffic in less time.
Author
Owner

@kradalby commented on GitHub (Aug 30, 2024):

@alexfornuto Its fair, I understand that people have different solutions to recovery and HA and solutions they are more familiar with.

That said, for all the things mentioned, SQLite has excellent backup/streaming/cold copy solutions like litestream, which I use with my Headscale(s).

We are not removing it, but likely not investing in it. I think a sensible way to look at the "investing" or optimisation part, is to think, if we find that we can make a change that will benefit SQLites performance, we will implement them and sacrifice postgres performance, not implement two solutions.

As a side note, we have also started to have an increase in special cases for migrating both databases, which also is eating out of our dev time.

@kradalby commented on GitHub (Aug 30, 2024): @alexfornuto Its fair, I understand that people have different solutions to recovery and HA and solutions they are more familiar with. That said, for all the things mentioned, SQLite has excellent backup/streaming/cold copy solutions like [litestream](https://litestream.io), which I use with my Headscale(s). We are not removing it, but likely not investing in it. I think a sensible way to look at the "investing" or optimisation part, is to think, if we find that we can make a change that will benefit SQLites performance, we will implement them and sacrifice postgres performance, not implement two solutions. As a side note, we have also started to have an increase in special cases for migrating both databases, which also is eating out of our dev time.
Author
Owner

@m1cr0man commented on GitHub (Aug 30, 2024):

I agree with the sentiments and statements here but would like to highlight that using postgresql is not an option at all due to this issue. I personally tried to use psql because I already had it set up, but using sqlite instead was fine. The problem is that the documentation states that postgresql should work, and so I did waste a good amount of time trying to figure out why it did not work before giving up on it. At the very least, it may be worth amending the documentation to state that psql support is best effort and not as well tested as sqlite.

@m1cr0man commented on GitHub (Aug 30, 2024): I agree with the sentiments and statements here but would like to highlight that using postgresql is not an option at all due to this issue. I personally tried to use psql because I already had it set up, but using sqlite instead was fine. The problem is that the documentation states that postgresql should work, and so I did waste a good amount of time trying to figure out why it did not work before giving up on it. At the very least, it may be worth amending the documentation to state that psql support is best effort and not as well tested as sqlite.
Author
Owner

@kradalby commented on GitHub (Aug 30, 2024):

I updated the config with some notes in https://github.com/juanfont/headscale/pull/2091, but I agree, that is fair.

I will try to assess this issue next week and evaluate if the work will result in a fix or documentation of known limitations. I know people out there are running Postgres, so it is strange that not everyone runs into this, maybe they dont use preauthkeys.

@kradalby commented on GitHub (Aug 30, 2024): I updated the config with some notes in https://github.com/juanfont/headscale/pull/2091, but I agree, that is fair. I will try to assess this issue next week and evaluate if the work will result in a fix or documentation of known limitations. I know people out there are running Postgres, so it is strange that not everyone runs into this, maybe they dont use preauthkeys.
Author
Owner

@mpoindexter commented on GitHub (Aug 30, 2024):

I'll chime in as another postgres user: I understand that there are great options to backup/manage SQLite, but for anyone running on the major cloud providers (AWS, GCP, Azure, etc) they all have a managed DB solution that speaks postgresql protocol, so it's dramatically easier to set up a database with proper backup, etc. with anything that can use that. Deploying headscale as a stateless container, with external state in a DB is a really easy way to manage it, and it would be a shame to lose that. I understand that it's an increased burden of maintenance, and I'm happy to help with testing and fixes for postgres if it helps alleviate the problem a little.

@mpoindexter commented on GitHub (Aug 30, 2024): I'll chime in as another postgres user: I understand that there are great options to backup/manage SQLite, but for anyone running on the major cloud providers (AWS, GCP, Azure, etc) they all have a managed DB solution that speaks postgresql protocol, so it's dramatically easier to set up a database with proper backup, etc. with anything that can use that. Deploying headscale as a stateless container, with external state in a DB is a _really_ easy way to manage it, and it would be a shame to lose that. I understand that it's an increased burden of maintenance, and I'm happy to help with testing and fixes for postgres if it helps alleviate the problem a little.
Author
Owner

@alexfornuto commented on GitHub (Aug 30, 2024):

for anyone running on the major cloud providers (AWS, GCP, Azure, etc) they all have a managed DB solution that speaks postgresql protocol, so it's dramatically easier to set up a database with proper backup, etc. with anything that can use that.

This is my situation exactly. I'm already using managed databases for other self-hosted services. Litestream does look like a viable solution for sqlite, but it's also an additional burden in terms of having to learn, deploy, and maintain another system just for use by Headscale. Ultimately this disqualifies Headscale as a viable alternative for use in the network I administer professionally.

I completely understand that this is not a priority for the HS dev team, and it's not my intention to argue that point. I just want to make sure my POV is properly articulated.

@alexfornuto commented on GitHub (Aug 30, 2024): > for anyone running on the major cloud providers (AWS, GCP, Azure, etc) they all have a managed DB solution that speaks postgresql protocol, so it's dramatically easier to set up a database with proper backup, etc. with anything that can use that. This is my situation exactly. I'm already using managed databases for other self-hosted services. Litestream does look like a viable solution for sqlite, but it's also an additional burden in terms of having to learn, deploy, and maintain another system just for use by Headscale. Ultimately this disqualifies Headscale as a viable alternative for use in the network I administer professionally. I completely understand that this is not a priority for the HS dev team, and it's not my intention to argue that point. I just want to make sure my POV is properly articulated.
Author
Owner

@mpoindexter commented on GitHub (Aug 31, 2024):

PR #2093 should fix this issue. @kradalby feel free to mention me on any other postgresql related issues if you'd like me to take a look.

@mpoindexter commented on GitHub (Aug 31, 2024): PR #2093 should fix this issue. @kradalby feel free to mention me on any other postgresql related issues if you'd like me to take a look.
Author
Owner

@alexfornuto commented on GitHub (Sep 3, 2024):

Many thanks @mpoindexter!

EDIT: P.S. Is there any chance of the fix being backported to the current stable release?

@alexfornuto commented on GitHub (Sep 3, 2024): Many thanks @mpoindexter! EDIT: P.S. Is there any chance of the fix being backported to the current stable release?
Author
Owner

@mpoindexter commented on GitHub (Sep 3, 2024):

@alexfornuto I would doubt it makes sense to backport, but I think just ensuring your headscale process runs using the UTC timezone should function as a workaround.

@mpoindexter commented on GitHub (Sep 3, 2024): @alexfornuto I would doubt it makes sense to backport, but I think just ensuring your headscale process runs using the UTC timezone should function as a workaround.
Author
Owner

@alexfornuto commented on GitHub (Sep 3, 2024):

@mpoindexter I will likely do that if we decide to proceed with headscale in that environment.

FWIW, it would make sense to me to backport it since a production deployment requiring a more reliable db backend than sqlite would also likely require a stable release (my environment does), and v0.23 is still in beta.

@alexfornuto commented on GitHub (Sep 3, 2024): @mpoindexter I will likely do that if we decide to proceed with headscale in that environment. FWIW, it would make sense to me to backport it since a production deployment requiring a more reliable db backend than sqlite would also likely require a stable release (my environment does), and v0.23 is still in beta.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#314