[Bug] Upgrade from version 0.22.3 to 0.23-beta2 failed database migration #773

Closed
opened 2025-12-29 02:23:50 +01:00 by adam · 10 comments
Owner

Originally created by @mpoindexter on GitHub (Aug 23, 2024).

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Upgrading from 0.22.3 to 0.23-beta2 failed on database migration with this error:

{"level":"fatal","error":"ERROR: insert or update on table \"routes\" violates foreign key constraint \"fk_nodes_routes\" (SQLSTATE 23503)","time":1724359938,"message":"Migration failed: ERROR: insert or update on table \"routes\" violates foreign key constraint \"fk_nodes_routes\" (SQLSTATE 23503)"}

Expected Behavior

Migration completes successfully

Steps To Reproduce

Have an existing database where there are rows in the routes table that reference nodes that no longer exist in the nodes table. Not sure how it happened, but seems like I had a number of these, seems like older versions of the code could cause this to happen.

Running delete from routes where node_id not in (select id from nodes); and then attempting to upgrade again solved the issue, but ideally this would be done as part of the migration scripts.

Environment

- OS: any
- Headscale version: 0.23-beta2
- Tailscale version: any

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Anything else?

No response

Originally created by @mpoindexter on GitHub (Aug 23, 2024). ### Is this a support request? - [X] This is not a support request ### Is there an existing issue for this? - [X] I have searched the existing issues ### Current Behavior Upgrading from 0.22.3 to 0.23-beta2 failed on database migration with this error: ``` {"level":"fatal","error":"ERROR: insert or update on table \"routes\" violates foreign key constraint \"fk_nodes_routes\" (SQLSTATE 23503)","time":1724359938,"message":"Migration failed: ERROR: insert or update on table \"routes\" violates foreign key constraint \"fk_nodes_routes\" (SQLSTATE 23503)"} ``` ### Expected Behavior Migration completes successfully ### Steps To Reproduce Have an existing database where there are rows in the `routes` table that reference nodes that no longer exist in the `nodes` table. Not sure how it happened, but seems like I had a number of these, seems like older versions of the code could cause this to happen. Running `delete from routes where node_id not in (select id from nodes);` and then attempting to upgrade again solved the issue, but ideally this would be done as part of the migration scripts. ### Environment ```markdown - OS: any - Headscale version: 0.23-beta2 - Tailscale version: any ``` ### Runtime environment - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ### Anything else? _No response_
adam added the bug label 2025-12-29 02:23:50 +01:00
adam closed this issue 2025-12-29 02:23:50 +01:00
Author
Owner

@kradalby commented on GitHub (Aug 23, 2024):

I think I have fixed this in #2076, do you have the ability to test it?

It is some tech debts from the early days of non structured migrations that keep biting us.

@kradalby commented on GitHub (Aug 23, 2024): I think I have fixed this in #2076, do you have the ability to test it? It is some tech debts from the early days of non structured migrations that keep biting us.
Author
Owner

@mpoindexter commented on GitHub (Aug 23, 2024):

In some sense the change in #2076 works in that migration does not fail. It does mean that there is now a difference in the structure of the routes table between a fresh database and my upgraded one - specifically, the foreign key between routes and node does not get created. This doesn't seem ideal, IMO it is best to have the migration scripts remove any invalid data vs continuing to have data that doesn't conform to the intended schema still around and creating problems.

@mpoindexter commented on GitHub (Aug 23, 2024): In some sense the change in #2076 works in that migration does not fail. It does mean that there is now a difference in the structure of the `routes` table between a fresh database and my upgraded one - specifically, the foreign key between `routes` and `node` does not get created. This doesn't seem ideal, IMO it is best to have the migration scripts remove any invalid data vs continuing to have data that doesn't conform to the intended schema still around and creating problems.
Author
Owner

@mpoindexter commented on GitHub (Aug 23, 2024):

I tried as an alternate fix the following code vs only running automigrate if the table does not exist:

+                                       if tx.Migrator().HasTable(&types.Route{}) && tx.Migrator().HasTable(&types.Node{}) {
+                                               err := tx.Exec("delete from routes where node_id not in (select id from nodes)").Error
+                                               if err != nil {
+                                                       return err
+                                               }
+                                       }
+
                                        err = tx.AutoMigrate(&types.Route{})
                                        if err != nil {
                                                return err

and it solved this problem, and did not delete my existing routes (other than the ones that were linked to nodes that did not exist), and I ended up with the correct foreign keys created.

@mpoindexter commented on GitHub (Aug 23, 2024): I tried as an alternate fix the following code vs only running automigrate if the table does not exist: ``` + if tx.Migrator().HasTable(&types.Route{}) && tx.Migrator().HasTable(&types.Node{}) { + err := tx.Exec("delete from routes where node_id not in (select id from nodes)").Error + if err != nil { + return err + } + } + err = tx.AutoMigrate(&types.Route{}) if err != nil { return err ``` and it solved this problem, and did not delete my existing routes (other than the ones that were linked to nodes that did not exist), and I ended up with the correct foreign keys created.
Author
Owner

@kradalby commented on GitHub (Aug 24, 2024):

apologise, when I posted my reply to this PR I thought I was posting on #2063, so indeed, this has not been fixed, my apologies. Too late on a friday...

Thank you for investigating, I'll take this and amend the PR.

@kradalby commented on GitHub (Aug 24, 2024): apologise, when I posted my reply to this PR I thought I was posting on #2063, so indeed, this has not been fixed, my apologies. Too late on a friday... Thank you for investigating, I'll take this and amend the PR.
Author
Owner

@kradalby commented on GitHub (Aug 24, 2024):

This fix breaks #2063, so I will have to investigate. The whole initial migration is incredibly fragile and horrible, but untangling it is next to impossible considering we dont have databases to verify with.

Do you have an unupgraded 0.22 database you could email me that I can test with?

@kradalby commented on GitHub (Aug 24, 2024): This fix breaks #2063, so I will have to investigate. The whole initial migration is incredibly fragile and horrible, but untangling it is next to impossible considering we dont have databases to verify with. Do you have an unupgraded 0.22 database you could email me that I can test with?
Author
Owner

@mpoindexter commented on GitHub (Aug 26, 2024):

Yes, I have a development one that I have a backup of. I'll update it to remove any sensitive information and send it over today. FWIW, I'm having trouble seeing how the AutoMigrate command would drop rows as described in #2063 - (https://github.com/go-gorm/gorm/blob/v1.25.10/migrator/migrator.go#L121) doesn't seem to have any logic that would drop rows

@mpoindexter commented on GitHub (Aug 26, 2024): Yes, I have a development one that I have a backup of. I'll update it to remove any sensitive information and send it over today. FWIW, I'm having trouble seeing how the AutoMigrate command would drop rows as described in #2063 - (https://github.com/go-gorm/gorm/blob/v1.25.10/migrator/migrator.go#L121) doesn't seem to have any logic that would drop rows
Author
Owner

@kradalby commented on GitHub (Aug 27, 2024):

Could you by any chance create the dev one as sqlite? it would make it infinitely easier to make it part of a unit test. And it is our recommended database.

I also do not understand why its dropped, but I'm gonna pay back the missing techdebt by adding tests for migrations, if not only to make sure we dont regress.

@kradalby commented on GitHub (Aug 27, 2024): Could you by any chance create the dev one as sqlite? it would make it infinitely easier to make it part of a unit test. And it is our recommended database. I also do not understand why its dropped, but I'm gonna pay back the missing techdebt by adding tests for migrations, if not only to make sure we dont regress.
Author
Owner

@kradalby commented on GitHub (Aug 27, 2024):

I've converted it to SQLite and made this integration test:
ac045c3 (#2076)

I am not sure if it ended up correctly, but at least now I got two failing tests. I'll try to continue as I have time, but please feel free to have a look.

@kradalby commented on GitHub (Aug 27, 2024): I've converted it to SQLite and made this integration test: [`ac045c3` (#2076)](https://github.com/juanfont/headscale/pull/2076/commits/ac045c350c7ac93d8296efdb8c9c2de189447700) I am not sure if it ended up correctly, but at least now I got two failing tests. I'll try to continue as I have time, but please feel free to have a look.
Author
Owner

@kradalby commented on GitHub (Aug 27, 2024):

specifically, the foreign key between routes and node does not get created. This doesn't seem ideal, IMO it is best to have the migration scripts remove any invalid data vs continuing to have data that doesn't conform to the intended schema still around and creating problems.

@mpoindexter I ended up with a combination, now the migration will use your code to clean up, and then only run the automigrate conditionally.

The foreign keys will not be "created" retroactively, and I am not sure if that is a common thing for the db engine to sort out. For now, I wont spent more time on this as its more of a inconvenience and the code does not rely on fk or cascading deletes so its a minor issue.

@kradalby commented on GitHub (Aug 27, 2024): > specifically, the foreign key between routes and node does not get created. This doesn't seem ideal, IMO it is best to have the migration scripts remove any invalid data vs continuing to have data that doesn't conform to the intended schema still around and creating problems. @mpoindexter I ended up with a combination, now the migration will use your code to clean up, and then only run the automigrate conditionally. The foreign keys will not be "created" retroactively, and I am not sure if that is a common thing for the db engine to sort out. For now, I wont spent more time on this as its more of a inconvenience and the code does not rely on fk or cascading deletes so its a minor issue.
Author
Owner

@mpoindexter commented on GitHub (Aug 28, 2024):

@kradalby I think there is still a problem with how it has been done. I spent a while tracing down why the routes table gets cleared on migration and found the following:

The implication of this, is that with the change done in #2076 the routes rows are not dropped for existing databases, since the automigration is not done that actually creates the constraints with the ON DELETE CASCADE. For new databases, the constraint will still be created, and the next time you drop a column from the nodes table, or add a constraint to that table, the routes table will get cleared. So there's a big trap lurking for you in the future for databases that were not migrated from 0.22.

There is a way to deal with this in sqlite: foreign key processing can be turned off temporarily using a pragma and then turned back on. The GORM migrator does this for alter column (https://github.com/glebarez/sqlite/blob/master/migrator.go#L79), but not for drop column/create constraint/drop constraint. I am not sure why that is and it seems like a bug.

sqlite provides guidance on how to perform these migrations here: https://www.sqlite.org/lang_altertable.html - see section 7. Based on that, I would propose that the right way to solve the problem is to:

  1. Disable foreign key processing prior to running the migrations when using sqlite
  2. Run the migrations as they used to be (i.e. not conditioning the AutoMigrate call on whether the table already exists)
  3. Use PRAGMA foreign_key_check to make sure all foreign key constraints are followed properly post-migration
  4. Turn foreign key processing back on

I have tested this locally, and both test cases you added pass with my changes. Let me know if you would like me to submit a PR for this. I think following this approach will make life a lot easier in the future since you won't have to worry about two different upgrade paths (from a DB that has the constraints created on routes and from a DB that doesn't have the constraints created on routes)

@mpoindexter commented on GitHub (Aug 28, 2024): @kradalby I think there is still a problem with how it has been done. I spent a while tracing down why the `routes` table gets cleared on migration and found the following: - A number of schema migrations in sqlite are implemented by GORM using the pattern: 1. Create temp table with target schema 2. Copy all rows from existing table to temp table 3. Drop existing table 4. Rename temp table to match original table name See https://github.com/glebarez/sqlite/blob/master/migrator.go#L376 Specifically, alter column, drop column, create constraint, drop constraint are implemented this way. - When step 3 of the above pattern occurs, any `ON DELETE CASCADE` relations to the table are dropped. In this case, the relation from `nodes` to `routes` cascades like this (https://github.com/juanfont/headscale/blob/main/hscontrol/types/node.go#L128) The implication of this, is that with the change done in #2076 the routes rows are not dropped for _existing_ databases, since the automigration is not done that actually creates the constraints with the ON DELETE CASCADE. For _new_ databases, the constraint will still be created, and the next time you drop a column from the `nodes` table, or add a constraint to that table, the routes table will get cleared. So there's a big trap lurking for you in the future for databases that were not migrated from 0.22. There is a way to deal with this in sqlite: foreign key processing can be turned off temporarily using a pragma and then turned back on. The GORM migrator does this for alter column (https://github.com/glebarez/sqlite/blob/master/migrator.go#L79), but not for drop column/create constraint/drop constraint. I am not sure why that is and it seems like a bug. sqlite provides guidance on how to perform these migrations here: https://www.sqlite.org/lang_altertable.html - see section 7. Based on that, I would propose that the right way to solve the problem is to: 1. Disable foreign key processing prior to running the migrations when using sqlite 2. Run the migrations as they used to be (i.e. not conditioning the AutoMigrate call on whether the table already exists) 3. Use [PRAGMA foreign_key_check](https://www.sqlite.org/pragma.html#pragma_foreign_key_check) to make sure all foreign key constraints are followed properly post-migration 4. Turn foreign key processing back on I have tested this locally, and both test cases you added pass with my changes. Let me know if you would like me to submit a PR for this. I think following this approach will make life a lot easier in the future since you won't have to worry about two different upgrade paths (from a DB that has the constraints created on `routes` and from a DB that doesn't have the constraints created on `routes`)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#773