mirror of
https://github.com/juanfont/headscale.git
synced 2026-01-11 20:00:28 +01:00
[Bug] node randomly gets removed due to a panic #1019
Closed
opened 2025-12-29 02:27:33 +01:00 by adam
·
23 comments
No Branch/Tag Specified
main
update_flake_lock_action
gh-pages
kradalby/release-v0.27.2
dependabot/go_modules/golang.org/x/crypto-0.45.0
dependabot/go_modules/github.com/opencontainers/runc-1.3.3
copilot/investigate-headscale-issue-2788
copilot/investigate-visibility-issue-2788
copilot/investigate-issue-2833
copilot/debug-issue-2846
copilot/fix-issue-2847
dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0
dependabot/go_modules/github.com/docker/docker-28.3.3incompatible
kradalby/cli-experiement3
doc/0.26.1
doc/0.25.1
doc/0.25.0
doc/0.24.3
doc/0.24.2
doc/0.24.1
doc/0.24.0
kradalby/build-docker-on-pr
topic/docu-versioning
topic/docker-kos
juanfont/fix-crash-node-id
juanfont/better-disclaimer
update-contributors
topic/prettier
revert-1893-add-test-stage-to-docs
add-test-stage-to-docs
remove-node-check-interval
fix-empty-prefix
fix-ephemeral-reusable
bug_report-debuginfo
autogroups
logs-to-stderr
revert-1414-topic/fix_unix_socket
rename-machine-node
port-embedded-derp-tests-v2
port-derp-tests
duplicate-word-linter
update-tailscale-1.36
warn-against-apache
ko-fi-link
more-acl-tests
fix-typo-standalone
parallel-nolint
tparallel-fix
rerouting
ssh-changelog-docs
oidc-cleanup
web-auth-flow-tests
kradalby-gh-runner
fix-proto-lint
remove-funding-links
go-1.19
enable-1.30-in-tests
0.16.x
cosmetic-changes-integration
tmp-fix-integration-docker
fix-integration-docker
configurable-update-interval
show-nodes-online
hs2021
acl-syntax-fixes
ts2021-implementation
fix-spurious-updates
unstable-integration-tests
mandatory-stun
embedded-derp
prtemplate-fix
v0.28.0-beta.1
v0.27.2-rc.1
v0.27.1
v0.27.0
v0.27.0-beta.2
v0.27.0-beta.1
v0.26.1
v0.26.0
v0.26.0-beta.2
v0.26.0-beta.1
v0.25.1
v0.25.0
v0.25.0-beta.2
v0.24.3
v0.25.0-beta.1
v0.24.2
v0.24.1
v0.24.0
v0.24.0-beta.2
v0.24.0-beta.1
v0.23.0
v0.23.0-rc.1
v0.23.0-beta.5
v0.23.0-beta.4
v0.23.0-beta3
v0.23.0-beta2
v0.23.0-beta1
v0.23.0-alpha12
v0.23.0-alpha11
v0.23.0-alpha10
v0.23.0-alpha9
v0.23.0-alpha8
v0.23.0-alpha7
v0.23.0-alpha6
v0.23.0-alpha5
v0.23.0-alpha4
v0.23.0-alpha4-docker-ko-test9
v0.23.0-alpha4-docker-ko-test8
v0.23.0-alpha4-docker-ko-test7
v0.23.0-alpha4-docker-ko-test6
v0.23.0-alpha4-docker-ko-test5
v0.23.0-alpha-docker-release-test-debug2
v0.23.0-alpha-docker-release-test-debug
v0.23.0-alpha4-docker-ko-test4
v0.23.0-alpha4-docker-ko-test3
v0.23.0-alpha4-docker-ko-test2
v0.23.0-alpha4-docker-ko-test
v0.23.0-alpha3
v0.23.0-alpha2
v0.23.0-alpha1
v0.22.3
v0.22.2
v0.23.0-alpha-docker-release-test
v0.22.1
v0.22.0
v0.22.0-alpha3
v0.22.0-alpha2
v0.22.0-alpha1
v0.22.0-nfpmtest
v0.21.0
v0.20.0
v0.19.0
v0.19.0-beta2
v0.19.0-beta1
v0.18.0
v0.18.0-beta4
v0.18.0-beta3
v0.18.0-beta2
v0.18.0-beta1
v0.17.1
v0.17.0
v0.17.0-beta5
v0.17.0-beta4
v0.17.0-beta3
v0.17.0-beta2
v0.17.0-beta1
v0.17.0-alpha4
v0.17.0-alpha3
v0.17.0-alpha2
v0.17.0-alpha1
v0.16.4
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.16.0-beta7
v0.16.0-beta6
v0.16.0-beta5
v0.16.0-beta4
v0.16.0-beta3
v0.16.0-beta2
v0.16.0-beta1
v0.15.0
v0.15.0-beta6
v0.15.0-beta5
v0.15.0-beta4
v0.15.0-beta3
v0.15.0-beta2
v0.15.0-beta1
v0.14.0
v0.14.0-beta2
v0.14.0-beta1
v0.13.0
v0.13.0-beta3
v0.13.0-beta2
v0.13.0-beta1
upstream/v0.12.4
v0.12.4
v0.12.3
v0.12.2
v0.12.2-beta1
v0.12.1
v0.12.0-beta2
v0.12.0-beta1
v0.11.0
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.1
v0.8.0
v0.7.1
v0.7.0
v0.6.1
v0.6.0
v0.5.2
v0.5.1
v0.5.0
v0.4.0
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.1
v0.1.0
Labels
Clear labels
CLI
DERP
DNS
Nix
OIDC
SSH
bug
database
documentation
duplicate
enhancement
faq
good first issue
grants
help wanted
might-come
needs design doc
needs investigation
no-stale-bot
out of scope
performance
policy 📝
pull-request
question
regression
routes
stale
tags
tailscale-feature-gap
well described ❤️
wontfix
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/headscale#1019
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jemershaw on GitHub (May 15, 2025).
Is this a support request?
Is there an existing issue for this?
Current Behavior
Randomly the client that is connected will get removed from headscale. This usually happens when it receives some map update or something and then on the client it starts throwing 404 errors because the node was removed from headscale.
Expected Behavior
I'd expect a device to always exist in headscale unless it was logged out on the client or deleted via headscale. It isn't an ephemeral node so it shouldn't be GC it.
Steps To Reproduce
I can't actually recreate it since it happens on random updates
Environment
Runtime environment
Debug information
I actually don't have the logs atm but I do have the panic.
And the client I noticed this
@jemershaw commented on GitHub (May 15, 2025):
For more information this started happening in version 0.23.0.
I did have the
randomize_client_port: truesetting - I'm trying to disable this to see if I still run into the issue or not.@jemershaw commented on GitHub (May 15, 2025):
Okay I fixed the nil pointer, which might not be the best fix because now I'm getting a node with duplicate names but no panic.
@kradalby commented on GitHub (May 16, 2025):
Duplicate nodes sounds strange and I am surprised noone else has ran into this if its been present since 0.23.0.
What else is special about the node?
Does it happen to multiple nodes?
How is it logged in?
Does this happens randomly, or during login?
@lotheac commented on GitHub (Jun 12, 2025):
I am seeing a similar panic every now and then, but for me it's happening for nodes added via ephemeral keys only -- based on the logs, this happens after the inactivity timer expires after the previous connection. I'm not totally sure if this is the same problem, but at least the panic stacktrace matches so I am commenting here.
The strange thing here is that while the tailscaled logs on the node do indicate it received an error from headscale around
1749532343(2025-06-10 05:12:23 UTC), it seems that it operated normally afterward, performing reconfigs etc., up until headscale deleted it a little over 24 hours later. Here are the corresponding tailscaled logs:This is on headscale v0.26.1 & tailscale v1.84.0.
@jemershaw commented on GitHub (Jun 21, 2025):
Okay I figured out the issue. It seems like if you add an ephemeral node and it gets added as nodeId=10 and then you remove it, and will get scheduled for deletion. Then if you join a node that gets the same nodeId it will be removed when it runs GC.
I did try this with a fresh build and it seems like it always gives new nodeID's. The DB that I'm using was created from version 0.22.3, so I'm not sure if there is anything that is triggered it. I copied the database over. And even deleted everything using sqlite3 and it still had issues. I'm not sure what would trigger this?
@jemershaw commented on GitHub (Jun 21, 2025):
Okay I think this might be the problem:
Existing DB
Fresh DB
@jemershaw commented on GitHub (Jun 21, 2025):
I ran this on a existing db that had issues. and after running this it is good. I'll try to figure out how to add this as a db migration or something.
@lotheac commented on GitHub (Jun 23, 2025):
@jemershaw good find! however, on my side the nodes table already looks like it's supposed to and I'm still seeing nodes removed.
@jemershaw commented on GitHub (Jun 23, 2025):
@lotheac you can always try my branch https://github.com/jemershaw/headscale which does address the nil pointer but since I couldn't reproduce I wasn't sure it was the best fix. That's when I discovered the issue with the deleted nodes. If I hit the panic again I'll dig further.
@jemershaw commented on GitHub (Jun 25, 2025):
It turns out that gorm changed the default behavior of creating primary_key in the version used between 0.22.3 and 0.23.0. And in headscale it doesn't check to see if the schema matches. For me to fix and hopefully resolve any future upgrades + migrations I had to do a full database upgrade using temporary tables and copying the data over since you can just update the id to autoinc. If it does become a db migration there would need to be map_ids for a few other tables to handle the inserts.
This is the fix if someone wants to run it manually, but it might make sense if @kradalby or someone else that maintains this to provide suggestions on to best handle this. I can add the below as a full migration so then everyone would be in the happy path going forward.
Another idea would be to have a integration test that build headscale using the 0.22.3 version and then runs the full upgrade path and compare the database schema to make sure it matches. If it doesn't then the underlying orm changed functionality and could potentially break assumptions on what the db structure looks like.
@kradalby commented on GitHub (Jun 25, 2025):
Yes, it does not matches, I really dislike GORM and it has caused me so much headache. @nblock discovered that the schema does not end up the same.
I have a draft pr (https://github.com/juanfont/headscale/pull/2617) for:
However I have not had time to finish it, not sure how to test it yet.
It will not do any changes to Postgres.
@kradalby commented on GitHub (Jul 2, 2025):
I'm working on this over in #2617 , more databases to add to the tests would be greatly appreciated. only schema helps, but with data would be even better.
I think we (@nblock) have a script for helping us randomise the data.
find my email in my profile
@lotheac commented on GitHub (Jul 2, 2025):
@kradalby I am confused. Your recent comments read to me like they are AI-generated, and so do your recent pull requests.
for example:
this is nonsensical. what data? why does it need to be randomised? why is your email relevant?
@kradalby commented on GitHub (Jul 2, 2025):
Sorry, let me try again. I use some AI, but not to write these, not sure if it is a compliment or not :P.
The state of the migrations and databases in headscale is a massive mess. It has not really been addressed and we didnt really realise how bad it was until we opened the can of worms.
There are a lot of databases out there which has not been properly migrated and it is causing all sort of weird behaviour.
@nblock discovered some time ago that one of his old databases had a different schema than a newly created one, which explains a lot as the migration was very broken.
I started the effort in #2617 to clean this up. Essentially we will introduce a file describing how the schema should end up, so we can validate every time we migrate, so this does not happen again.
As part of this, I am trying to write a migration which resets everything to how it should be, but I can only fix things I know how look. So the more potentially broken databases I have access to, the better the coverage of the testing will be and hopefully we will be done with this.
Since this issue seems related to broken database tables. Having a copy of your database would be very valuable.
The randomisation is relevant if you send us an database and is concerned about the data, then we can change the data, but keep the "bad behaviour". Same for the email, the place to send it.
If you have your database before and after your fix mentioned, then that would be meaningful.
@lotheac commented on GitHub (Jul 2, 2025):
Okay, I think I see what you might have meant. It is not easy to decipher meaning from your messages. Sorry for calling you a bot.
When you say "send us an database" you mean that you wanted to be sent the data that exists in a previously created database to help you test and verify the refactoring in #2617.
When you say "find my email in my profile" you meant "find whatever script my colleague posted somewhere (maybe?), use it to anonymize your data before sending it to me, and then send it to me over email".
@kradalby commented on GitHub (Jul 3, 2025):
Yes, Ideally I just want databases with real data in them that doesnt work as expected so we can fix it. If you are concerned about the data, it would be good if you try to change up things to hide PII and so on. We have some tools, but they dont cover all cases.
@lotheac commented on GitHub (Jul 3, 2025):
alright, I think I can anonymize and send you the DB contents the next time I see the issue with disappearing ephemeral nodes. I don't have a reliable repro so it doesn't happen often; the previous occasion was on June 20th.
@kradalby commented on GitHub (Jul 3, 2025):
Great thank you.
@jemershaw, if you have a backup of your database before you applied the fixes mentioned, that would be really valuable, your 0.22.3 database.
@jemershaw commented on GitHub (Jul 9, 2025):
@kradalby I can give you a backup if needed but all you need to do is to create a fresh db using version 0.22.3 of headscale and then upgrades will keep the existing structure.
Happy to provide instructions or a db backup if you can't reproduce.
@kradalby commented on GitHub (Jul 10, 2025):
Yes please
@lotheac commented on GitHub (Aug 26, 2025):
I saw this issue happen again today. I've sent the details and anonymized db dump to @kradalby's email.
@kradalby commented on GitHub (Sep 10, 2025):
I suspect this is the same as https://github.com/juanfont/headscale/issues/2698 and https://github.com/juanfont/headscale/issues/2697, which should be fixed in https://github.com/juanfont/headscale/pull/2670
@lotheac commented on GitHub (Sep 17, 2025):
I updated to a local build at commit
3f6657aelast week, and have not seen this issue since despite numerous scale-ups/downs of ephemeral nodes, so I would say your fix worked. Thanks!