[Bug] Automatic database migration from 0.23.0 to 0.24.0 does not work with postgres #908

Closed
opened 2025-12-29 02:25:46 +01:00 by adam · 13 comments
Owner

Originally created by @sysvinit on GitHub (Jan 17, 2025).

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I tried to update my headscale instance, which uses postgres as the database, from 0.23.0 to 0.24.0 using the Debian packages provided as part of the releases on Github. However, after installing the new package, headscale failed to start due to problems with the database migration, with the following message in the logs:

Jan 17 16:48:47 hostname headscale[130595]: 2025-01-17T16:48:47Z INF Opening database database=postgres path="host=/run/postgresql dbname=headscale user=headscale"
Jan 17 16:48:47 hostname headscale[130595]: 2025-01-17T16:48:47Z FTL Migration failed: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) error="ERROR: constraint \"uni_users_name\" of relation \"users\" does not exist (SQLSTATE 42704)"

I was also left unable to blindly downgrade to 0.23.0, as the new version had already executed a migration successfully before encountering the error, leaving my database in an inconsistent state that would not have been supported by the old version.

Expected Behavior

Headscale executes the database migration without errors, and then proceeds to function normally.

Steps To Reproduce

  1. Install headscale 0.23.0 using postgres as the database.
  2. Update the installed version of headscale to 0.24.0, and attempt to start the headscale process again.

Environment

- OS: Debian 12
- Headscale version: 0.23.0
- Tailscale version: N/A

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Anything else?

I manually inspected the migrations table in the database and compared this with the code in 0.24.0, which indicates that this was a problem with migration 202407191627. This is an automatic migration executed by gorm to update the schema of the users table -- I'm not sure if there have been changes in gorm, but I'm not familiar with the library so I didn't investigate that further.

As I mentioned above, my database was left in an inconsistent state which prevented me from downgrading (in fairness I should have backed up my database before performing the upgrade...). However, my database did have a uniqueness constraint for users.name as implied by the log message above, but in my database the constraint was called users_name_key instead of uni_users_name.

I used the following SQL to rename the constraint on the users table, and with this change the migration which had been causing me problems then executed correctly:

begin;
alter table users add constraint uni_users_name unique (name);
alter table users drop constraint users_name_key;
commit;

My headscale instance now appears to be working correctly, though I haven't tried adding new users or nodes to the Tailnet yet.

Originally created by @sysvinit on GitHub (Jan 17, 2025). ### Is this a support request? - [x] This is not a support request ### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior I tried to update my headscale instance, which uses postgres as the database, from 0.23.0 to 0.24.0 using the Debian packages provided as part of the releases on Github. However, after installing the new package, headscale failed to start due to problems with the database migration, with the following message in the logs: ``` Jan 17 16:48:47 hostname headscale[130595]: 2025-01-17T16:48:47Z INF Opening database database=postgres path="host=/run/postgresql dbname=headscale user=headscale" Jan 17 16:48:47 hostname headscale[130595]: 2025-01-17T16:48:47Z FTL Migration failed: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) error="ERROR: constraint \"uni_users_name\" of relation \"users\" does not exist (SQLSTATE 42704)" ``` I was also left unable to blindly downgrade to 0.23.0, as the new version had already executed a migration successfully before encountering the error, leaving my database in an inconsistent state that would not have been supported by the old version. ### Expected Behavior Headscale executes the database migration without errors, and then proceeds to function normally. ### Steps To Reproduce 1. Install headscale 0.23.0 using postgres as the database. 2. Update the installed version of headscale to 0.24.0, and attempt to start the headscale process again. ### Environment ```markdown - OS: Debian 12 - Headscale version: 0.23.0 - Tailscale version: N/A ``` ### Runtime environment - [x] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ### Anything else? I manually inspected the migrations table in the database and compared this with the code in 0.24.0, which indicates that this was a problem with migration `202407191627`. This is an automatic migration executed by gorm to update the schema of the `users` table -- I'm not sure if there have been changes in gorm, but I'm not familiar with the library so I didn't investigate that further. As I mentioned above, my database was left in an inconsistent state which prevented me from downgrading (in fairness I should have backed up my database *before* performing the upgrade...). However, my database did have a uniqueness constraint for `users.name` as implied by the log message above, but in my database the constraint was called `users_name_key` instead of `uni_users_name`. I used the following SQL to rename the constraint on the `users` table, and with this change the migration which had been causing me problems then executed correctly: ``` begin; alter table users add constraint uni_users_name unique (name); alter table users drop constraint users_name_key; commit; ``` My headscale instance now appears to be working correctly, though I haven't tried adding new users or nodes to the Tailnet yet.
adam added the bug label 2025-12-29 02:25:46 +01:00
adam closed this issue 2025-12-29 02:25:46 +01:00
Author
Owner

@kradalby commented on GitHub (Jan 19, 2025):

This is unfortunate, I will not have time to look into this until next week, but in the meantime if someone has a Postgres backup that can reproduce this issue, I would appreciate getting it.

As often mentioned we don't have the time to do the extensive testing for Postgres as we have so much other things to fix.

A personal rant;
I personally have less and less appreciation for Gorm, and if this turns out to be some sort of auto magic footgun that doesn't help.
I want to get rid of Gorm, but that means get rid of Postgres and only support one database, which we are not doing, so it will likely continue like this where we try to test things, but if no one tests the betas with the setups we can't test, then this will continue to happen. In this case I suspect the beta testers didn't run Postgres so we didn't discover it.

@kradalby commented on GitHub (Jan 19, 2025): This is unfortunate, I will not have time to look into this until next week, but in the meantime if someone has a Postgres backup that can reproduce this issue, I would appreciate getting it. As often mentioned we don't have the time to do the extensive testing for Postgres as we have so much other things to fix. A personal rant; I personally have less and less appreciation for Gorm, and if this turns out to be some sort of auto magic footgun that doesn't help. I want to get rid of Gorm, but that means get rid of Postgres and only support one database, which we are not doing, so it will likely continue like this where we try to test things, but if no one tests the betas with the setups we can't test, then this will continue to happen. In this case I suspect the beta testers didn't run Postgres so we didn't discover it.
Author
Owner

@sysvinit commented on GitHub (Jan 19, 2025):

This is unfortunate, I will not have time to look into this until next week, but in the meantime if someone has a Postgres backup that can reproduce this issue, I would appreciate getting it.

I do have pg_dumpall backups of the entire postgres server on that machine as part of the daily system backups (useful for restoring after a storage failure, less so for restoring individual tables in the heat of the moment). I could get you a copy of the headscale DB from my server from the night before I attempted the upgrade, though I'll need to find some free time next week to spin up a test instance of postgres which I can load the backup into first in order to extract the headscale parts (and censor things like IP addresses). What would be the best way to send you the DB dump?

@sysvinit commented on GitHub (Jan 19, 2025): > This is unfortunate, I will not have time to look into this until next week, but in the meantime if someone has a Postgres backup that can reproduce this issue, I would appreciate getting it. I do have pg_dumpall backups of the entire postgres server on that machine as part of the daily system backups (useful for restoring after a storage failure, less so for restoring individual tables in the heat of the moment). I could get you a copy of the headscale DB from my server from the night before I attempted the upgrade, though I'll need to find some free time next week to spin up a test instance of postgres which I can load the backup into first in order to extract the headscale parts (and censor things like IP addresses). What would be the best way to send you the DB dump?
Author
Owner

@kradalby commented on GitHub (Jan 19, 2025):

Great, email in my GitHub would be sufficient

@kradalby commented on GitHub (Jan 19, 2025): Great, email in my GitHub would be sufficient
Author
Owner

@sysvinit commented on GitHub (Jan 23, 2025):

Given you've found a fix already -- do you still need a copy of my headscale database out of my backups?

@sysvinit commented on GitHub (Jan 23, 2025): Given you've found a fix already -- do you still need a copy of my headscale database out of my backups?
Author
Owner

@kradalby commented on GitHub (Jan 23, 2025):

No, thank you, I got one from another user and wrote a test based on that.

@kradalby commented on GitHub (Jan 23, 2025): No, thank you, I got one from another user and wrote a test based on that.
Author
Owner

@panteparak commented on GitHub (Feb 1, 2025):

Hi @kradalby

Just a heads up, somehow the migration in the fix PR have no effect and produce the same error.
I had to copy out the SQL statement AS IS and run manually on the DB to fix it.

Upgrading from 0.22.3 -> 0.24.2 with Postgres.

If you wanted to investigate further, I could sent you my SQL dump v0.22.3.

@panteparak commented on GitHub (Feb 1, 2025): Hi @kradalby Just a heads up, somehow the migration in the fix PR have no effect and produce the same error. I had to copy out the [SQL statement ](https://github.com/juanfont/headscale/blob/v0.24.2/hscontrol/db/db.go#L486-L506)AS IS and run manually on the DB to fix it. Upgrading from 0.22.3 -> 0.24.2 with Postgres. If you wanted to investigate further, I could sent you my SQL dump v0.22.3.
Author
Owner

@kradalby commented on GitHub (Feb 1, 2025):

Yes please, it worked from 0.23, but maybe the step from 0.22 was different, 0.23 is the time we introduced migrations so it would not surprise me. Email is in my profile

@kradalby commented on GitHub (Feb 1, 2025): Yes please, it worked from 0.23, but maybe the step from 0.22 was different, 0.23 is the time we introduced migrations so it would not surprise me. Email is in my profile
Author
Owner

@panteparak commented on GitHub (Feb 11, 2025):

@kradalby
Sorry for the late reply, any tips on sanitising the db dump?, From what I can see, I should strip out IPs, nodekeys, discokeys. Anything else should be done? or is there a script clean this all up.

@panteparak commented on GitHub (Feb 11, 2025): @kradalby Sorry for the late reply, any tips on sanitising the db dump?, From what I can see, I should strip out IPs, nodekeys, discokeys. Anything else should be done? or is there a script clean this all up.
Author
Owner

@panteparak commented on GitHub (Feb 17, 2025):

@kradalby i've send my Postgres DB Dump to your email. Let me know if you need other info.

@panteparak commented on GitHub (Feb 17, 2025): @kradalby i've send my Postgres DB Dump to your email. Let me know if you need other info.
Author
Owner

@haatveit commented on GitHub (Feb 25, 2025):

I've (finally) upgraded from 0.21.0 by starting up every release along the way in case it was needed, but haven't been able to get releases past 0.24.2 working.

On 0.23.0 I got this on first startup after migrations:

2025-02-25T21:50:20Z FTL home/runner/work/headscale/headscale/cmd/headscale/cli/serve.go:29 > Headscale ran into an error and had to shut down. error="failed to load ACL policy: loading nodes from database to validate policy: ERROR: cached plan must not change result type (SQLSTATE 0A000)"

then it worked next time I started the service:

From 0.24.0 onward I get this:

2025-02-25T21:51:43Z FTL Migration failed: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) error="ERROR: constraint \"uni_users_name\" of relation \"users\" does not exist (SQLSTATE 42704)"

If I manually add the constraint via psql, I can run 0.24.0 through 0.24.2:

alter table users add constraint uni_users_name unique (name);

But if I try to upgrade to 0.24.3 or newer, I get this error on first startup:

2025-02-25T22:11:47Z FTL Migration failed: automigrating types.Node: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) error="automigrating types.Node: ERROR: constraint \"uni_users_name\" of relation \"users\" does not exist (SQLSTATE 42704)"

and it removes the uni_users_name constraint again. I've looked at the diff between 0.24.2 and 0.24.3 but don't really get why that is happening.

@haatveit commented on GitHub (Feb 25, 2025): I've (finally) upgraded from 0.21.0 by starting up every release along the way in case it was needed, but haven't been able to get releases past 0.24.2 working. On 0.23.0 I got this on first startup after migrations: ```console 2025-02-25T21:50:20Z FTL home/runner/work/headscale/headscale/cmd/headscale/cli/serve.go:29 > Headscale ran into an error and had to shut down. error="failed to load ACL policy: loading nodes from database to validate policy: ERROR: cached plan must not change result type (SQLSTATE 0A000)" ``` then it worked next time I started the service: From 0.24.0 onward I get this: ```console 2025-02-25T21:51:43Z FTL Migration failed: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) error="ERROR: constraint \"uni_users_name\" of relation \"users\" does not exist (SQLSTATE 42704)" ``` If I manually add the constraint via psql, I can run 0.24.0 through 0.24.2: ```sql alter table users add constraint uni_users_name unique (name); ``` But if I try to upgrade to 0.24.3 or newer, I get this error on first startup: ```console 2025-02-25T22:11:47Z FTL Migration failed: automigrating types.Node: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) error="automigrating types.Node: ERROR: constraint \"uni_users_name\" of relation \"users\" does not exist (SQLSTATE 42704)" ``` and it removes the `uni_users_name` constraint again. I've looked at the [diff between 0.24.2 and 0.24.3](https://github.com/juanfont/headscale/compare/v0.24.2...v0.24.3) but don't really get why that is happening.
Author
Owner

@kradalby commented on GitHub (Feb 26, 2025):

I haven’t had a close look, but starting every release would be counter productive. Since there are migration fixes in the fix releases, you should jump to the latest release or at least the latest fix release.

We have not had a database that old in a bit, so if going from 0.21 straight to 0.24.3 or 0.25.1 doesn’t work I would appreciate a scrubbed copy I can write a test against.

@kradalby commented on GitHub (Feb 26, 2025): I haven’t had a close look, but starting every release would be counter productive. Since there are migration fixes in the fix releases, you should jump to the latest release or at least the latest fix release. We have not had a database that old in a bit, so if going from 0.21 straight to 0.24.3 or 0.25.1 doesn’t work I would appreciate a scrubbed copy I can write a test against.
Author
Owner

@nicka101 commented on GitHub (Feb 28, 2025):

For me, simply spamming the Alter table in Postgres manually to add the uni_users_name constraint back while headscale is starting allows the migrations to succeed and headscale to start on 0.25.1

@nicka101 commented on GitHub (Feb 28, 2025): For me, simply spamming the Alter table in Postgres manually to add the `uni_users_name` constraint back while headscale is starting allows the migrations to succeed and headscale to start on 0.25.1
Author
Owner

@bazaah commented on GitHub (Mar 22, 2025):

The PreAuthKey and/or Node automigrates here: https://github.com/juanfont/headscale/compare/v0.24.2...v0.24.3#diff-826f8967b23df1209318fba82f94d8ed3eb12136b6cb2d070ddb76d9a059c336R590-R597 are somehow responsible for the uni_users_name issue. They seemingly race to create problem, as I see both of them in the logs

FTL Migration failed: automigrating types.Node: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704)
                      ^^^^^^^^^^^^^^^^^^^^^^^^

(or)

FTL Migration failed: automigrating types.PreAuthKey: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I eventually just needed to INSERT INTO migrations (id) VALUES (202501311657); to get a functional instance again. Is there anything I should run by hand to replicate the expected schema changes from https://github.com/juanfont/headscale/pull/2396?

My migration path was: 0.22.3 -> 0.23.0 -> 0.24.0 -> 0.24.2 -> 0.24.3, in case it matters.

@bazaah commented on GitHub (Mar 22, 2025): The PreAuthKey and/or Node automigrates here: https://github.com/juanfont/headscale/compare/v0.24.2...v0.24.3#diff-826f8967b23df1209318fba82f94d8ed3eb12136b6cb2d070ddb76d9a059c336R590-R597 are somehow responsible for the `uni_users_name` issue. They seemingly race to create problem, as I see both of them in the logs ``` FTL Migration failed: automigrating types.Node: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) ^^^^^^^^^^^^^^^^^^^^^^^^ ``` (or) ``` FTL Migration failed: automigrating types.PreAuthKey: ERROR: constraint "uni_users_name" of relation "users" does not exist (SQLSTATE 42704) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` I eventually just needed to `INSERT INTO migrations (id) VALUES (202501311657);` to get a functional instance again. Is there anything I should run by hand to replicate the expected schema changes from https://github.com/juanfont/headscale/pull/2396? My migration path was: 0.22.3 -> 0.23.0 -> 0.24.0 -> 0.24.2 -> 0.24.3, in case it matters.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#908