[Bug] node randomly gets removed due to a panic #1019

Closed
opened 2025-12-29 02:27:33 +01:00 by adam · 23 comments
Owner

Originally created by @jemershaw on GitHub (May 15, 2025).

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Randomly the client that is connected will get removed from headscale. This usually happens when it receives some map update or something and then on the client it starts throwing 404 errors because the node was removed from headscale.

Expected Behavior

I'd expect a device to always exist in headscale unless it was logged out on the client or deleted via headscale. It isn't an ephemeral node so it shouldn't be GC it.

Steps To Reproduce

I can't actually recreate it since it happens on random updates

Environment

- OS: Ubuntu 20.04
- Headscale version: 0.26.0
- Tailscale version: 1.72.1

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Debug information

I actually don't have the logs atm but I do have the panic.

2025-05-14T23:39:41Z TRC ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:612 > received stream update: StatePeerChanged  node=7e65d254fe node.id=147 omitPeers=false readOnly=false stream=true
2025-05-14T23:39:41Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:622 > Could not get machine from db error="record not found" node=7e65d254fe node.id=147 omitPeers=false readOnly=false stream=true
2025/05/14 23:39:41 http2: panic serving <public_ip>:51788: runtime error: invalid memory address or nil pointer dereference
goroutine 6113 [running]:
golang.org/x/net/http2.(*serverConn).runHandler.func1()
        /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2426 +0x13e
panic({0x1f530a0?, 0x3ce1af0?})
        /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/runtime/panic.go:792 +0x132
github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll.func1()
        /home/runner/work/headscale/headscale/hscontrol/poll.go:203 +0xaa
github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll(0xc000673080)
        /home/runner/work/headscale/headscale/hscontrol/poll.go:289 +0x16a5
github.com/juanfont/headscale/hscontrol.(*noiseServer).NoisePollNetMapHandler(0xc00095e090, {0x2af59d0, 0xc000a0e2e0}, 0xc00024f180)
        /home/runner/work/headscale/headscale/hscontrol/noise.go:229 +0x258
net/http.HandlerFunc.ServeHTTP(0xc0009eb170?, {0x2af59d0?, 0xc000a0e2e0?}, 0x7fe6d4809108?)
        /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29
github.com/juanfont/headscale/hscontrol.prometheusMiddleware.func1({0x2af59d0, 0xc000a0e2e0}, 0xc00024f180)
        /home/runner/work/headscale/headscale/hscontrol/metrics.go:82 +0x18e
net/http.HandlerFunc.ServeHTTP(0xc00024f040?, {0x2af59d0?, 0xc000a0e2e0?}, 0x134?)
        /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29
github.com/gorilla/mux.(*Router).ServeHTTP(0xc000934900, {0x2af59d0, 0xc000a0e2e0}, 0xc00024e000)
        /home/runner/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1e2
golang.org/x/net/http2.(*serverConn).runHandler(0x44d2f2?, 0x0?, 0x0?, 0xc0006e65f0?)
        /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2433 +0xf5
created by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 6096
        /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2367 +0x21d

And the client I noticed this

2025/05/14 22:54:53 Switching ipn state Starting -> Running (WantRunning=true, nm=true)
2025/05/14 22:54:53 health(warnable=warming-up): ok
2025/05/14 22:54:53 derphttp.Client.Connect: connecting to derp-13 (den)
2025/05/14 22:54:53 magicsock: endpoints changed: <public_ip>:49444 (stun), 10.253.4.198:49444 (local)
2025/05/14 22:54:53 control: NetInfo: NetInfo{varies=false hairpin= ipv6=false ipv6os=true udp=true icmpv4=false derp=#13 portmap= link=\"\" firewallmode=\"ipt-default\"}
2025/05/14 22:54:54 magicsock: derp-13 connected; connGen=1
2025/05/14 22:54:54 health(warnable=no-derp-connection): ok
2025/05/14 22:54:54 health(warnable=no-derp-home): ok
2025/05/14 22:54:54 [RATELIMIT] format(\"health(warnable=%s): ok\")
2025/05/14 22:54:54 wgengine: Reconfig: configuring userspace WireGuard config (with 1/2 peers)
2025/05/14 22:54:54 magicsock: disco: node [+sf/K] d:cd1b3a0e7830e0c1 now using 172.18.0.2:35944 mtu=1360 tx=db539cac8e86
2025/05/14 23:29:47 magicsock: disco: node [+sf/K] d:cd1b3a0e7830e0c1 now using 10.253.0.1:35944 mtu=1360 tx=832ae4688c32
2025/05/14 23:32:20 magicsock: disco: node [+sf/K] d:cd1b3a0e7830e0c1 now using 172.18.0.2:35944 mtu=1360 tx=c16c99e7e7ac
2025/05/14 23:39:41 Received error: PollNetMap: stream error: stream ID 5; INTERNAL_ERROR; received from peer
2025/05/14 23:39:41 Received error: PollNetMap: initial fetch failed 404: node not found
Originally created by @jemershaw on GitHub (May 15, 2025). ### Is this a support request? - [x] This is not a support request ### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior Randomly the client that is connected will get removed from headscale. This usually happens when it receives some map update or something and then on the client it starts throwing 404 errors because the node was removed from headscale. ### Expected Behavior I'd expect a device to always exist in headscale unless it was logged out on the client or deleted via headscale. It isn't an ephemeral node so it shouldn't be GC it. ### Steps To Reproduce I can't actually recreate it since it happens on random updates ### Environment ```markdown - OS: Ubuntu 20.04 - Headscale version: 0.26.0 - Tailscale version: 1.72.1 ``` ### Runtime environment - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ### Debug information I actually don't have the logs atm but I do have the panic. ``` 2025-05-14T23:39:41Z TRC ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:612 > received stream update: StatePeerChanged node=7e65d254fe node.id=147 omitPeers=false readOnly=false stream=true 2025-05-14T23:39:41Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:622 > Could not get machine from db error="record not found" node=7e65d254fe node.id=147 omitPeers=false readOnly=false stream=true 2025/05/14 23:39:41 http2: panic serving <public_ip>:51788: runtime error: invalid memory address or nil pointer dereference goroutine 6113 [running]: golang.org/x/net/http2.(*serverConn).runHandler.func1() /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2426 +0x13e panic({0x1f530a0?, 0x3ce1af0?}) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/runtime/panic.go:792 +0x132 github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll.func1() /home/runner/work/headscale/headscale/hscontrol/poll.go:203 +0xaa github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll(0xc000673080) /home/runner/work/headscale/headscale/hscontrol/poll.go:289 +0x16a5 github.com/juanfont/headscale/hscontrol.(*noiseServer).NoisePollNetMapHandler(0xc00095e090, {0x2af59d0, 0xc000a0e2e0}, 0xc00024f180) /home/runner/work/headscale/headscale/hscontrol/noise.go:229 +0x258 net/http.HandlerFunc.ServeHTTP(0xc0009eb170?, {0x2af59d0?, 0xc000a0e2e0?}, 0x7fe6d4809108?) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29 github.com/juanfont/headscale/hscontrol.prometheusMiddleware.func1({0x2af59d0, 0xc000a0e2e0}, 0xc00024f180) /home/runner/work/headscale/headscale/hscontrol/metrics.go:82 +0x18e net/http.HandlerFunc.ServeHTTP(0xc00024f040?, {0x2af59d0?, 0xc000a0e2e0?}, 0x134?) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29 github.com/gorilla/mux.(*Router).ServeHTTP(0xc000934900, {0x2af59d0, 0xc000a0e2e0}, 0xc00024e000) /home/runner/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1e2 golang.org/x/net/http2.(*serverConn).runHandler(0x44d2f2?, 0x0?, 0x0?, 0xc0006e65f0?) /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2433 +0xf5 created by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 6096 /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2367 +0x21d ``` And the client I noticed this ``` 2025/05/14 22:54:53 Switching ipn state Starting -> Running (WantRunning=true, nm=true) 2025/05/14 22:54:53 health(warnable=warming-up): ok 2025/05/14 22:54:53 derphttp.Client.Connect: connecting to derp-13 (den) 2025/05/14 22:54:53 magicsock: endpoints changed: <public_ip>:49444 (stun), 10.253.4.198:49444 (local) 2025/05/14 22:54:53 control: NetInfo: NetInfo{varies=false hairpin= ipv6=false ipv6os=true udp=true icmpv4=false derp=#13 portmap= link=\"\" firewallmode=\"ipt-default\"} 2025/05/14 22:54:54 magicsock: derp-13 connected; connGen=1 2025/05/14 22:54:54 health(warnable=no-derp-connection): ok 2025/05/14 22:54:54 health(warnable=no-derp-home): ok 2025/05/14 22:54:54 [RATELIMIT] format(\"health(warnable=%s): ok\") 2025/05/14 22:54:54 wgengine: Reconfig: configuring userspace WireGuard config (with 1/2 peers) 2025/05/14 22:54:54 magicsock: disco: node [+sf/K] d:cd1b3a0e7830e0c1 now using 172.18.0.2:35944 mtu=1360 tx=db539cac8e86 2025/05/14 23:29:47 magicsock: disco: node [+sf/K] d:cd1b3a0e7830e0c1 now using 10.253.0.1:35944 mtu=1360 tx=832ae4688c32 2025/05/14 23:32:20 magicsock: disco: node [+sf/K] d:cd1b3a0e7830e0c1 now using 172.18.0.2:35944 mtu=1360 tx=c16c99e7e7ac 2025/05/14 23:39:41 Received error: PollNetMap: stream error: stream ID 5; INTERNAL_ERROR; received from peer 2025/05/14 23:39:41 Received error: PollNetMap: initial fetch failed 404: node not found ```
adam added the bug label 2025-12-29 02:27:33 +01:00
adam closed this issue 2025-12-29 02:27:33 +01:00
Author
Owner

@jemershaw commented on GitHub (May 15, 2025):

For more information this started happening in version 0.23.0.

I did have the randomize_client_port: true setting - I'm trying to disable this to see if I still run into the issue or not.

@jemershaw commented on GitHub (May 15, 2025): For more information this started happening in version 0.23.0. I did have the `randomize_client_port: true` setting - I'm trying to disable this to see if I still run into the issue or not.
Author
Owner

@jemershaw commented on GitHub (May 15, 2025):

Okay I fixed the nil pointer, which might not be the best fix because now I'm getting a node with duplicate names but no panic.

@jemershaw commented on GitHub (May 15, 2025): Okay I fixed the nil pointer, which might not be the best fix because now I'm getting a node with duplicate names but no panic.
Author
Owner

@kradalby commented on GitHub (May 16, 2025):

Duplicate nodes sounds strange and I am surprised noone else has ran into this if its been present since 0.23.0.

What else is special about the node?
Does it happen to multiple nodes?
How is it logged in?
Does this happens randomly, or during login?

@kradalby commented on GitHub (May 16, 2025): Duplicate nodes sounds strange and I am surprised noone else has ran into this if its been present since 0.23.0. What else is special about the node? Does it happen to multiple nodes? How is it logged in? Does this happens randomly, or during login?
Author
Owner

@lotheac commented on GitHub (Jun 12, 2025):

I am seeing a similar panic every now and then, but for me it's happening for nodes added via ephemeral keys only -- based on the logs, this happens after the inactivity timer expires after the previous connection. I'm not totally sure if this is the same problem, but at least the panic stacktrace matches so I am commenting here.

{"level":"info","caller":"/home/runner/work/headscale/headscale/hscontrol/poll.go:602","readOnly":false,"omitPeers":false,"stream":true,"node.id":284,"node":"cax31-6aba12f7f55523c3","time":1749532343,"message":"node has disconnected, mapSession: 0x400052c900, chan: 0x4000314380"}
...
{"level":"error","caller":"/home/runner/work/headscale/headscale/hscontrol/poll.go:622","readOnly":false,"omitPeers":false,"stream":true,"node.id":284,"node":"cax31-6aba12f7f55523c3","error":"record not found","time":1749618757,"message":"Could not get machine from db"}
2025/06/11 05:12:37 http2: panic serving 10.128.11.73:60504: runtime error: invalid memory address or nil pointer dereference
goroutine 1312072 [running]:
golang.org/x/net/http2.(*serverConn).runHandler.func1()
        /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2426 +0x138
panic({0x191df20?, 0x3666000?})
        /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/runtime/panic.go:792 +0x124
github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll.func1()
        /home/runner/work/headscale/headscale/hscontrol/poll.go:203 +0x80
github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll(0x400157b380)
        /home/runner/work/headscale/headscale/hscontrol/poll.go:289 +0x12c0
github.com/juanfont/headscale/hscontrol.(*noiseServer).NoisePollNetMapHandler(0x40018a2b40, {0x24d1e10, 0x400008a680}, 0x40002a4500)
        /home/runner/work/headscale/headscale/hscontrol/noise.go:229 +0x1cc
net/http.HandlerFunc.ServeHTTP(0x1400000030?, {0x24d1e10?, 0x400008a680?}, 0x140?)
        /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x38
github.com/juanfont/headscale/hscontrol.prometheusMiddleware.func1({0x24d1e10, 0x400008a680}, 0x40002a4500)
        /home/runner/work/headscale/headscale/hscontrol/metrics.go:82 +0x204
net/http.HandlerFunc.ServeHTTP(0x40002a4000?, {0x24d1e10?, 0x400008a680?}, 0x10?)
        /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x38
github.com/gorilla/mux.(*Router).ServeHTTP(0x40002e7500, {0x24d1e10, 0x400008a680}, 0x4000666dc0)
        /home/runner/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x194
golang.org/x/net/http2.(*serverConn).runHandler(0x40018fe2d0?, 0x24d4c30?, 0x40018fe780?, 0x40005c5df0?)
        /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2433 +0xe8
created by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 1311866
        /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2367 +0x208
{"level":"error","code":404,"time":1749618757,"message":"user msg: node not found"}
{"level":"error","code":404,"time":1749618757,"message":"user msg: node not found"}

The strange thing here is that while the tailscaled logs on the node do indicate it received an error from headscale around 1749532343 (2025-06-10 05:12:23 UTC), it seems that it operated normally afterward, performing reconfigs etc., up until headscale deleted it a little over 24 hours later. Here are the corresponding tailscaled logs:

2025/06/10 04:49:29 magicsock: 1 active derp conns: derp-28=cr358h29m0s,wr258h5m0s
2025/06/10 04:49:29 post-rebind ping of DERP region 28 okay
2025/06/10 05:08:13 monitor: RTM_NEWROUTE: src=10.128.11.22/0, dst=10.128.6.0/24, gw=10.128.11.22, outif=5, table=254
2025/06/10 05:08:27 wgengine: Reconfig: configuring userspace WireGuard config (with 0/104 peers)
2025/06/10 05:08:27 wgengine: Reconfig: configuring router
2025/06/10 05:08:27 wgengine: Reconfig: user dialer
2025/06/10 05:08:27 tsdial: bart table size: 107
2025/06/10 05:08:27 wgengine: Reconfig: configuring DNS
2025/06/10 05:08:27 dns: Set: {DefaultResolvers:[] Routes:{} SearchDomains:[] Hosts:121}
2025/06/10 05:08:27 dns: Resolvercfg: {Routes:{} Hosts:121 LocalDomains:[]}
2025/06/10 05:08:27 dns: OScfg: {}
2025/06/10 05:12:23 Received error: PollNetMap: unexpected EOF
2025/06/10 05:12:25 [RATELIMIT] format("control: [v\x00JSON]%d%s")
2025/06/10 05:13:18 [RATELIMIT] format("control: [v\x00JSON]%d%s") (1 dropped)
2025/06/10 05:18:09 wgengine: Reconfig: [Wt8cc] changed from "discokey:f29557b5719bd96125caef730dba65f0700c83fdf5f11486d6a7ce0e081d9b6e" to "discokey:c5885d8ddeeaf44323b3ce28544ba44bee1141ebebcb3a0298e21b602508465e"
2025/06/10 06:46:02 wgengine: Reconfig: [OCUoj] changed from "discokey:5b19175771c5914cec229f1f5180802393fd674a8ba3dadfccb028acb5f3f74a" to "discokey:55fe7379ea9ecbcbbfe98339697f18d210d525b54fb578600000f8775c4dee1f"
2025/06/10 07:48:36 wgengine: Reconfig: [OGNfu] changed from "discokey:fd6cf2bd8b37cc1b2893107d1d835d4e6eb34c20840fdf09cfc259faea84ca41" to "discokey:77af73e5f9ce453418b800b23fd06badf06833a81ff25ad15167c4a1a044dd0f"
2025/06/10 07:54:22 wgengine: Reconfig: [c8WIz] changed from "discokey:3b3000932465f7be2baac1e8cce4bb1864367eedd77c1f8e568c5d8842ae6e55" to "discokey:ab6a981b359e0cb1ee18bf7406145b08ac0d64d45723d297b6db5de32b200e6c"
2025/06/10 08:00:04 wgengine: Reconfig: [Vg2rz] changed from "discokey:5404f37314fce0c9f91c6dde34fe42aa3b1448d9b08a9c2d467031b555bfc379" to "discokey:6ca3e586f53e52108bf3ef6f8d45fc829228f59c6515e6543c09dbea17d0fb63"
2025/06/10 08:09:41 wgengine: Reconfig: [O1Cad] changed from "discokey:368beb0999c5f8492686a79a39aac5f5678ad066e3f310f4255733744ea19d64" to "discokey:7f14c4350a391eb577c0ea3683b08d3b49215d98ab18ec239dd75fa3d9af756e"
2025/06/10 08:34:17 wgengine: Reconfig: [WjTrf] changed from "discokey:478ca31883819fbb901e1ebb76e4e494b3dbc957b9f641db96bfa77806260d3e" to "discokey:bd10a7783d6fcdd60f6c07f1c2f78ecf35c0ac2443d345d88b0b82add1f7a70c"
2025/06/10 08:58:14 wgengine: Reconfig: [CBO9q] changed from "discokey:d51b4140afab9dbd7bbbb1b477bfb0eb8d90e6f6ae2fc2b768edce0a5c52707b" to "discokey:04db48512c882916605a05d7afedaa34e5effcd6fbc11b03d2efce70ad76e216"
2025/06/10 09:09:53 wgengine: Reconfig: [P7uKb] changed from "discokey:68294292c2f9dd4f3241b68acb355becd42f4a73ab68a2d217e280a2b06bb82c" to "discokey:470064dc88290074d6c962cbefc800989508dd65e999829fc8285d89837c2313"
2025/06/10 11:54:31 wgengine: Reconfig: [MUC2T] changed from "discokey:1c2e2cb22f28fd33590e71bc5b58f72a2b28f4a7c390f4e6fd4d98504c31100e" to "discokey:488e775af430ec34712707e71af8331069325550d4e31c78c7d463b3f28e1d7b"
2025/06/10 12:01:45 wgengine: Reconfig: [JZ6HW] changed from "discokey:15add537794dd64b70ad88017092d6b5c828775cb37e0a308fdb065524bc5f0a" to "discokey:a37e3816d02833a7b1ffbf42af2c93296e61de1b2edcf59f9060d3cfa68aa95d"
2025/06/10 14:42:17 wgengine: Reconfig: [c8WIz] changed from "discokey:ab6a981b359e0cb1ee18bf7406145b08ac0d64d45723d297b6db5de32b200e6c" to "discokey:3276e6b921c062ac61819f640e8c59288c472aa635e0f49d3b752892ed9e1c66"
2025/06/10 15:03:05 wgengine: Reconfig: [tO+bK] changed from "discokey:bd9a8fc4712dc40b1cbfb81e346299cafe1050adcfeebd3328c0b81e96de0868" to "discokey:2c5c2420c99fc7db654ef9deafd7923894d19df8c447a3e9448395d50da8461a"
2025/06/10 15:03:07 [RATELIMIT] format("control: [v\x00JSON]%d%s")
2025/06/10 15:07:22 wgengine: Reconfig: [tO+bK] changed from "discokey:2c5c2420c99fc7db654ef9deafd7923894d19df8c447a3e9448395d50da8461a" to "discokey:c3a8399eb6a02c8735802caf79ce3948c7dc40afd54f547e246977ec62c6ac19"
2025/06/10 15:29:00 wgengine: Reconfig: [JZ6HW] changed from "discokey:a37e3816d02833a7b1ffbf42af2c93296e61de1b2edcf59f9060d3cfa68aa95d" to "discokey:80325ccfb9b3b5708fed511f6184a1a1d8ccd393236abbec721d22d9bda20458"
2025/06/11 04:23:57 wgengine: Reconfig: [JZ6HW] changed from "discokey:80325ccfb9b3b5708fed511f6184a1a1d8ccd393236abbec721d22d9bda20458" to "discokey:edda4f33ccae57221e21dd6c87214706532f65d56122c0e3d161f71566889170"
2025/06/11 05:12:37 Received error: PollNetMap: stream error: stream ID 1; INTERNAL_ERROR; received from peer
2025/06/11 05:12:37 Received error: PollNetMap: initial fetch failed 404: node not found
2025/06/11 05:12:37 Received error: PollNetMap: initial fetch failed 404: node not found

This is on headscale v0.26.1 & tailscale v1.84.0.

@lotheac commented on GitHub (Jun 12, 2025): I am seeing a similar panic every now and then, but for me it's happening for nodes added via ephemeral keys only -- based on the logs, this happens after the inactivity timer expires after the previous connection. I'm not totally sure if this is the same problem, but at least the panic stacktrace matches so I am commenting here. ``` {"level":"info","caller":"/home/runner/work/headscale/headscale/hscontrol/poll.go:602","readOnly":false,"omitPeers":false,"stream":true,"node.id":284,"node":"cax31-6aba12f7f55523c3","time":1749532343,"message":"node has disconnected, mapSession: 0x400052c900, chan: 0x4000314380"} ... {"level":"error","caller":"/home/runner/work/headscale/headscale/hscontrol/poll.go:622","readOnly":false,"omitPeers":false,"stream":true,"node.id":284,"node":"cax31-6aba12f7f55523c3","error":"record not found","time":1749618757,"message":"Could not get machine from db"} 2025/06/11 05:12:37 http2: panic serving 10.128.11.73:60504: runtime error: invalid memory address or nil pointer dereference goroutine 1312072 [running]: golang.org/x/net/http2.(*serverConn).runHandler.func1() /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2426 +0x138 panic({0x191df20?, 0x3666000?}) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/runtime/panic.go:792 +0x124 github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll.func1() /home/runner/work/headscale/headscale/hscontrol/poll.go:203 +0x80 github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll(0x400157b380) /home/runner/work/headscale/headscale/hscontrol/poll.go:289 +0x12c0 github.com/juanfont/headscale/hscontrol.(*noiseServer).NoisePollNetMapHandler(0x40018a2b40, {0x24d1e10, 0x400008a680}, 0x40002a4500) /home/runner/work/headscale/headscale/hscontrol/noise.go:229 +0x1cc net/http.HandlerFunc.ServeHTTP(0x1400000030?, {0x24d1e10?, 0x400008a680?}, 0x140?) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x38 github.com/juanfont/headscale/hscontrol.prometheusMiddleware.func1({0x24d1e10, 0x400008a680}, 0x40002a4500) /home/runner/work/headscale/headscale/hscontrol/metrics.go:82 +0x204 net/http.HandlerFunc.ServeHTTP(0x40002a4000?, {0x24d1e10?, 0x400008a680?}, 0x10?) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x38 github.com/gorilla/mux.(*Router).ServeHTTP(0x40002e7500, {0x24d1e10, 0x400008a680}, 0x4000666dc0) /home/runner/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x194 golang.org/x/net/http2.(*serverConn).runHandler(0x40018fe2d0?, 0x24d4c30?, 0x40018fe780?, 0x40005c5df0?) /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2433 +0xe8 created by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 1311866 /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2367 +0x208 {"level":"error","code":404,"time":1749618757,"message":"user msg: node not found"} {"level":"error","code":404,"time":1749618757,"message":"user msg: node not found"} ``` The strange thing here is that while the tailscaled logs on the node do indicate it received an error from headscale around `1749532343` (2025-06-10 05:12:23 UTC), it seems that it operated normally afterward, performing reconfigs etc., up until headscale deleted it a little over 24 hours later. Here are the corresponding tailscaled logs: ``` 2025/06/10 04:49:29 magicsock: 1 active derp conns: derp-28=cr358h29m0s,wr258h5m0s 2025/06/10 04:49:29 post-rebind ping of DERP region 28 okay 2025/06/10 05:08:13 monitor: RTM_NEWROUTE: src=10.128.11.22/0, dst=10.128.6.0/24, gw=10.128.11.22, outif=5, table=254 2025/06/10 05:08:27 wgengine: Reconfig: configuring userspace WireGuard config (with 0/104 peers) 2025/06/10 05:08:27 wgengine: Reconfig: configuring router 2025/06/10 05:08:27 wgengine: Reconfig: user dialer 2025/06/10 05:08:27 tsdial: bart table size: 107 2025/06/10 05:08:27 wgengine: Reconfig: configuring DNS 2025/06/10 05:08:27 dns: Set: {DefaultResolvers:[] Routes:{} SearchDomains:[] Hosts:121} 2025/06/10 05:08:27 dns: Resolvercfg: {Routes:{} Hosts:121 LocalDomains:[]} 2025/06/10 05:08:27 dns: OScfg: {} 2025/06/10 05:12:23 Received error: PollNetMap: unexpected EOF 2025/06/10 05:12:25 [RATELIMIT] format("control: [v\x00JSON]%d%s") 2025/06/10 05:13:18 [RATELIMIT] format("control: [v\x00JSON]%d%s") (1 dropped) 2025/06/10 05:18:09 wgengine: Reconfig: [Wt8cc] changed from "discokey:f29557b5719bd96125caef730dba65f0700c83fdf5f11486d6a7ce0e081d9b6e" to "discokey:c5885d8ddeeaf44323b3ce28544ba44bee1141ebebcb3a0298e21b602508465e" 2025/06/10 06:46:02 wgengine: Reconfig: [OCUoj] changed from "discokey:5b19175771c5914cec229f1f5180802393fd674a8ba3dadfccb028acb5f3f74a" to "discokey:55fe7379ea9ecbcbbfe98339697f18d210d525b54fb578600000f8775c4dee1f" 2025/06/10 07:48:36 wgengine: Reconfig: [OGNfu] changed from "discokey:fd6cf2bd8b37cc1b2893107d1d835d4e6eb34c20840fdf09cfc259faea84ca41" to "discokey:77af73e5f9ce453418b800b23fd06badf06833a81ff25ad15167c4a1a044dd0f" 2025/06/10 07:54:22 wgengine: Reconfig: [c8WIz] changed from "discokey:3b3000932465f7be2baac1e8cce4bb1864367eedd77c1f8e568c5d8842ae6e55" to "discokey:ab6a981b359e0cb1ee18bf7406145b08ac0d64d45723d297b6db5de32b200e6c" 2025/06/10 08:00:04 wgengine: Reconfig: [Vg2rz] changed from "discokey:5404f37314fce0c9f91c6dde34fe42aa3b1448d9b08a9c2d467031b555bfc379" to "discokey:6ca3e586f53e52108bf3ef6f8d45fc829228f59c6515e6543c09dbea17d0fb63" 2025/06/10 08:09:41 wgengine: Reconfig: [O1Cad] changed from "discokey:368beb0999c5f8492686a79a39aac5f5678ad066e3f310f4255733744ea19d64" to "discokey:7f14c4350a391eb577c0ea3683b08d3b49215d98ab18ec239dd75fa3d9af756e" 2025/06/10 08:34:17 wgengine: Reconfig: [WjTrf] changed from "discokey:478ca31883819fbb901e1ebb76e4e494b3dbc957b9f641db96bfa77806260d3e" to "discokey:bd10a7783d6fcdd60f6c07f1c2f78ecf35c0ac2443d345d88b0b82add1f7a70c" 2025/06/10 08:58:14 wgengine: Reconfig: [CBO9q] changed from "discokey:d51b4140afab9dbd7bbbb1b477bfb0eb8d90e6f6ae2fc2b768edce0a5c52707b" to "discokey:04db48512c882916605a05d7afedaa34e5effcd6fbc11b03d2efce70ad76e216" 2025/06/10 09:09:53 wgengine: Reconfig: [P7uKb] changed from "discokey:68294292c2f9dd4f3241b68acb355becd42f4a73ab68a2d217e280a2b06bb82c" to "discokey:470064dc88290074d6c962cbefc800989508dd65e999829fc8285d89837c2313" 2025/06/10 11:54:31 wgengine: Reconfig: [MUC2T] changed from "discokey:1c2e2cb22f28fd33590e71bc5b58f72a2b28f4a7c390f4e6fd4d98504c31100e" to "discokey:488e775af430ec34712707e71af8331069325550d4e31c78c7d463b3f28e1d7b" 2025/06/10 12:01:45 wgengine: Reconfig: [JZ6HW] changed from "discokey:15add537794dd64b70ad88017092d6b5c828775cb37e0a308fdb065524bc5f0a" to "discokey:a37e3816d02833a7b1ffbf42af2c93296e61de1b2edcf59f9060d3cfa68aa95d" 2025/06/10 14:42:17 wgengine: Reconfig: [c8WIz] changed from "discokey:ab6a981b359e0cb1ee18bf7406145b08ac0d64d45723d297b6db5de32b200e6c" to "discokey:3276e6b921c062ac61819f640e8c59288c472aa635e0f49d3b752892ed9e1c66" 2025/06/10 15:03:05 wgengine: Reconfig: [tO+bK] changed from "discokey:bd9a8fc4712dc40b1cbfb81e346299cafe1050adcfeebd3328c0b81e96de0868" to "discokey:2c5c2420c99fc7db654ef9deafd7923894d19df8c447a3e9448395d50da8461a" 2025/06/10 15:03:07 [RATELIMIT] format("control: [v\x00JSON]%d%s") 2025/06/10 15:07:22 wgengine: Reconfig: [tO+bK] changed from "discokey:2c5c2420c99fc7db654ef9deafd7923894d19df8c447a3e9448395d50da8461a" to "discokey:c3a8399eb6a02c8735802caf79ce3948c7dc40afd54f547e246977ec62c6ac19" 2025/06/10 15:29:00 wgengine: Reconfig: [JZ6HW] changed from "discokey:a37e3816d02833a7b1ffbf42af2c93296e61de1b2edcf59f9060d3cfa68aa95d" to "discokey:80325ccfb9b3b5708fed511f6184a1a1d8ccd393236abbec721d22d9bda20458" 2025/06/11 04:23:57 wgengine: Reconfig: [JZ6HW] changed from "discokey:80325ccfb9b3b5708fed511f6184a1a1d8ccd393236abbec721d22d9bda20458" to "discokey:edda4f33ccae57221e21dd6c87214706532f65d56122c0e3d161f71566889170" 2025/06/11 05:12:37 Received error: PollNetMap: stream error: stream ID 1; INTERNAL_ERROR; received from peer 2025/06/11 05:12:37 Received error: PollNetMap: initial fetch failed 404: node not found 2025/06/11 05:12:37 Received error: PollNetMap: initial fetch failed 404: node not found ``` This is on headscale v0.26.1 & tailscale v1.84.0.
Author
Owner

@jemershaw commented on GitHub (Jun 21, 2025):

Okay I figured out the issue. It seems like if you add an ephemeral node and it gets added as nodeId=10 and then you remove it, and will get scheduled for deletion. Then if you join a node that gets the same nodeId it will be removed when it runs GC.

I did try this with a fresh build and it seems like it always gives new nodeID's. The DB that I'm using was created from version 0.22.3, so I'm not sure if there is anything that is triggered it. I copied the database over. And even deleted everything using sqlite3 and it still had issues. I'm not sure what would trigger this?

HEADSCALE_URL= 
EPHEMERAL_KEY= 
REGULAR_KEY=$(heads
docker rm --force tailscale-test-removal;

docker run -it -d --rm \
            --name tailscale-test-removal \
            --hostname tailscale-test-removal \
            --privileged \
            --entrypoint=/bin/sh \
            tailscale/tailscale:v1.72.1;

# Start tailscaled
docker exec tailscale-test-removal /bin/sh -c "tailscaled -no-logs-no-support &"

# Connecting onboarding..
docker exec tailscale-test-removal /bin/sh -c "tailscale up --accept-routes --login-server $HEADSCALE_URL --accept-dns=false --auth-key $EPHEMERAL_KEY"

docker exec tailscale-test-removal tailscale status --json | jq .Self.Online
sleep 5
docker exec tailscale-test-removal tailscale status --json | jq .Self.Online

# Kill tailscaled
docker exec tailscale-test-removal /bin/sh -c "tailscale logout"
sleep 2
docker exec tailscale-test-removal tailscale status --json | jq .Self.Online
docker exec tailscale-test-removal pkill -9 tailscaled
docker exec tailscale-test-removal rm -rf /var/lib/tailscale

sleep 15

# Start tailscaled
docker exec tailscale-test-removal /bin/sh -c "tailscaled -no-logs-no-support &"

# Connecting tpm
docker exec tailscale-test-removal /bin/sh -c "tailscale up --accept-routes --login-server $HEADSCALE_URL --accept-dns=false --auth-key $REGULAR_KEY"

sleep 5
docker exec tailscale-test-removal tailscale status --json | jq .Self.Online

# Eventually the node will be removed once GC runs (I have `ephemeral_node_inactivity_timeout: 5m` in the config)
@jemershaw commented on GitHub (Jun 21, 2025): Okay I figured out the issue. It seems like if you add an ephemeral node and it gets added as nodeId=10 and then you remove it, and will get scheduled for deletion. Then if you join a node that gets the same nodeId it will be removed when it runs GC. I did try this with a fresh build and it seems like it always gives new nodeID's. The DB that I'm using was created from version 0.22.3, so I'm not sure if there is anything that is triggered it. I copied the database over. And even deleted everything using sqlite3 and it still had issues. I'm not sure what would trigger this? ``` HEADSCALE_URL= EPHEMERAL_KEY= REGULAR_KEY=$(heads docker rm --force tailscale-test-removal; docker run -it -d --rm \ --name tailscale-test-removal \ --hostname tailscale-test-removal \ --privileged \ --entrypoint=/bin/sh \ tailscale/tailscale:v1.72.1; # Start tailscaled docker exec tailscale-test-removal /bin/sh -c "tailscaled -no-logs-no-support &" # Connecting onboarding.. docker exec tailscale-test-removal /bin/sh -c "tailscale up --accept-routes --login-server $HEADSCALE_URL --accept-dns=false --auth-key $EPHEMERAL_KEY" docker exec tailscale-test-removal tailscale status --json | jq .Self.Online sleep 5 docker exec tailscale-test-removal tailscale status --json | jq .Self.Online # Kill tailscaled docker exec tailscale-test-removal /bin/sh -c "tailscale logout" sleep 2 docker exec tailscale-test-removal tailscale status --json | jq .Self.Online docker exec tailscale-test-removal pkill -9 tailscaled docker exec tailscale-test-removal rm -rf /var/lib/tailscale sleep 15 # Start tailscaled docker exec tailscale-test-removal /bin/sh -c "tailscaled -no-logs-no-support &" # Connecting tpm docker exec tailscale-test-removal /bin/sh -c "tailscale up --accept-routes --login-server $HEADSCALE_URL --accept-dns=false --auth-key $REGULAR_KEY" sleep 5 docker exec tailscale-test-removal tailscale status --json | jq .Self.Online # Eventually the node will be removed once GC runs (I have `ephemeral_node_inactivity_timeout: 5m` in the config) ```
Author
Owner

@jemershaw commented on GitHub (Jun 21, 2025):

Okay I think this might be the problem:

Existing DB

sqlite> SELECT sql from sqlite_master where name="nodes";
CREATE TABLE "nodes"  (`id` integer,`machine_key` text,`node_key` text,`disco_key` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`last_seen` datetime,`expiry` datetime,`host_info` text,`endpoints` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,`ipv4` text,`ipv6` text, `approved_routes` text,PRIMARY KEY (`id`),CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`))

Fresh DB

sqlite> SELECT sql from sqlite_master where name="nodes";
sql
CREATE TABLE "nodes"  (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`expiry` datetime,`last_seen` datetime,`approved_routes` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`))
@jemershaw commented on GitHub (Jun 21, 2025): Okay I think this might be the problem: Existing DB ``` sqlite> SELECT sql from sqlite_master where name="nodes"; CREATE TABLE "nodes" (`id` integer,`machine_key` text,`node_key` text,`disco_key` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`last_seen` datetime,`expiry` datetime,`host_info` text,`endpoints` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,`ipv4` text,`ipv6` text, `approved_routes` text,PRIMARY KEY (`id`),CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`)) ``` Fresh DB ``` sqlite> SELECT sql from sqlite_master where name="nodes"; sql CREATE TABLE "nodes" (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`expiry` datetime,`last_seen` datetime,`approved_routes` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`)) ```
Author
Owner

@jemershaw commented on GitHub (Jun 21, 2025):

I ran this on a existing db that had issues. and after running this it is good. I'll try to figure out how to add this as a db migration or something.

-- 1. Create new table
CREATE TABLE "nodes_new"  (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key`
text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4`
text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id`
integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`expiry`
datetime,`last_seen` datetime,`approved_routes` text,`created_at`
datetime,`updated_at` datetime,`deleted_at` datetime,CONSTRAINT `fk_nodes_user`
FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT
`fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES
`pre_auth_keys`(`id`));


-- 2. Copy data
INSERT INTO nodes_new (machine_key, node_key, disco_key, endpoints,
host_info, ipv4, ipv6, hostname, given_name, user_id, register_method,
forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at,
updated_at, deleted_at, auth_key_id ) SELECT machine_key, node_key, disco_key, endpoints,
host_info, ipv4, ipv6, hostname, given_name, user_id, register_method,
forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at,
updated_at, deleted_at, auth_key_id  FROM nodes;

-- 3. Drop old table
DROP table nodes;

-- 4. rename new table
ALTER table nodes_new RENAME TO nodes;
@jemershaw commented on GitHub (Jun 21, 2025): I ran this on a existing db that had issues. and after running this it is good. I'll try to figure out how to add this as a db migration or something. ``` -- 1. Create new table CREATE TABLE "nodes_new" (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`expiry` datetime,`last_seen` datetime,`approved_routes` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`)); -- 2. Copy data INSERT INTO nodes_new (machine_key, node_key, disco_key, endpoints, host_info, ipv4, ipv6, hostname, given_name, user_id, register_method, forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at, updated_at, deleted_at, auth_key_id ) SELECT machine_key, node_key, disco_key, endpoints, host_info, ipv4, ipv6, hostname, given_name, user_id, register_method, forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at, updated_at, deleted_at, auth_key_id FROM nodes; -- 3. Drop old table DROP table nodes; -- 4. rename new table ALTER table nodes_new RENAME TO nodes; ```
Author
Owner

@lotheac commented on GitHub (Jun 23, 2025):

@jemershaw good find! however, on my side the nodes table already looks like it's supposed to and I'm still seeing nodes removed.

sqlite> .schema nodes
CREATE TABLE IF NOT EXISTS "nodes"  (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`last_seen` datetime,`expiry` datetime,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime, `approved_routes` text,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`) ON DELETE SET NULL);
@lotheac commented on GitHub (Jun 23, 2025): @jemershaw good find! however, on my side the nodes table already looks like it's supposed to and I'm still seeing nodes removed. ``` sqlite> .schema nodes CREATE TABLE IF NOT EXISTS "nodes" (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`last_seen` datetime,`expiry` datetime,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime, `approved_routes` text,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys`(`id`) ON DELETE SET NULL); ```
Author
Owner

@jemershaw commented on GitHub (Jun 23, 2025):

@lotheac you can always try my branch https://github.com/jemershaw/headscale which does address the nil pointer but since I couldn't reproduce I wasn't sure it was the best fix. That's when I discovered the issue with the deleted nodes. If I hit the panic again I'll dig further.

@jemershaw commented on GitHub (Jun 23, 2025): @lotheac you can always try my branch https://github.com/jemershaw/headscale which does address the nil pointer but since I couldn't reproduce I wasn't sure it was the best fix. That's when I discovered the issue with the deleted nodes. If I hit the panic again I'll dig further.
Author
Owner

@jemershaw commented on GitHub (Jun 25, 2025):

It turns out that gorm changed the default behavior of creating primary_key in the version used between 0.22.3 and 0.23.0. And in headscale it doesn't check to see if the schema matches. For me to fix and hopefully resolve any future upgrades + migrations I had to do a full database upgrade using temporary tables and copying the data over since you can just update the id to autoinc. If it does become a db migration there would need to be map_ids for a few other tables to handle the inserts.

This is the fix if someone wants to run it manually, but it might make sense if @kradalby or someone else that maintains this to provide suggestions on to best handle this. I can add the below as a full migration so then everyone would be in the happy path going forward.

Another idea would be to have a integration test that build headscale using the 0.22.3 version and then runs the full upgrade path and compare the database schema to make sure it matches. If it doesn't then the underlying orm changed functionality and could potentially break assumptions on what the db structure looks like.

-- Start transactions
BEGIN TRANSACTION;

-- Step 1: Create new tables from existing tables

-- Disable fk checks so we can migrate data
PRAGMA foreign_keys = OFF;

-- create new_users table
CREATE TABLE user_id_map (
    old_id INTEGER,
    new_id INTEGER
);

CREATE TABLE `users_new` (`id` integer PRIMARY KEY AUTOINCREMENT,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,`name` text,`display_name` text,`email` text,`provider_identifier` text,`provider` text,`profile_pic_url` text);

-- load users
INSERT INTO users_new (created_at, updated_at, deleted_at, name, display_name, email, provider_identifier, provider, profile_pic_url) 
  SELECT created_at, updated_at, deleted_at, name, display_name, email, provider_identifier, provider, profile_pic_url FROM users;

-- create user_id mapping
INSERT INTO user_id_map (old_id, new_id)
SELECT u_old.id, u_new.id
FROM users u_old
JOIN users_new u_new ON u_old.name = u_new.name;

-- create new_pre_auth_keys table
CREATE TABLE `pre_auth_keys_new` (`id` integer PRIMARY KEY AUTOINCREMENT,`key` text,`user_id` integer,`reusable` numeric,`ephemeral` numeric DEFAULT false,`used` numeric DEFAULT false,`tags` text,`created_at` datetime,`expiration` datetime,CONSTRAINT `fk_pre_auth_keys_user` FOREIGN KEY (`user_id`) REFERENCES `users_new`(`id`) ON DELETE SET NULL);

INSERT INTO pre_auth_keys_new (key, user_id, reusable, ephemeral, used, tags, created_at, expiration) 
  SELECT key, m.new_id as user_id, reusable, ephemeral, used, tags, created_at, expiration FROM pre_auth_keys p
  JOIN user_id_map m ON p.user_id = old_id;


-- create new_nodes table
CREATE TABLE IF NOT EXISTS "nodes_new"  (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`expiry` datetime,`last_seen` datetime,`approved_routes` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users_new`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys_new`(`id`));

INSERT INTO nodes_new (machine_key, node_key, disco_key, endpoints, host_info, ipv4, ipv6, hostname, given_name, user_id, register_method, forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at, updated_at, deleted_at)
SELECT machine_key, node_key, disco_key, endpoints, host_info, ipv4, ipv6, hostname, given_name, m.new_id as user_id, register_method, forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at, updated_at, deleted_at FROM nodes n
JOIN user_id_map m ON n.user_id = old_id;

-- create new api_keys table
CREATE TABLE `api_keys_new` (`id` integer PRIMARY KEY AUTOINCREMENT,`prefix` text,`hash` blob,`created_at` datetime,`expiration` datetime,`last_seen` datetime);
INSERT INTO api_keys_new (prefix, hash, created_at, expiration, last_seen)
SELECT prefix, hash, created_at, expiration, last_seen FROM api_keys;


-- drop temporary map tables
DROP table user_id_map;

-- Setep 2: Table Promtion - Drop existing tables and rename new tables

-- promote users_new table
DROP TABLE users;
ALTER table users_new RENAME TO users;
CREATE INDEX `idx_users_deleted_at` ON `users`(`deleted_at`);
CREATE UNIQUE INDEX idx_provider_identifier ON users (provider_identifier) WHERE provider_identifier IS NOT NULL;
CREATE UNIQUE INDEX idx_name_provider_identifier ON users (name,provider_identifier);
CREATE UNIQUE INDEX idx_name_no_provider_identifier ON users (name) WHERE provider_identifier IS NULL;

-- promote pre_auth_keys table
DROP TABLE pre_auth_keys;
ALTER table pre_auth_keys_new RENAME TO pre_auth_keys;

-- promote api_keys table
DROP TABLE api_keys;
ALTER table api_keys_new RENAME TO api_keys;
CREATE UNIQUE INDEX `idx_api_keys_prefix` ON `api_keys`(`prefix`);

-- promote nodes table
DROP TABLE nodes;
ALTER table nodes_new RENAME TO nodes;

-- Check/enable fk
PRAGMA foreign_key_check;
PRAGMA foreign_keys = ON;

COMMIT;
@jemershaw commented on GitHub (Jun 25, 2025): It turns out that gorm changed the default behavior of creating primary_key in the version used between 0.22.3 and 0.23.0. And in headscale it doesn't check to see if the schema matches. For me to fix and hopefully resolve any future upgrades + migrations I had to do a full database upgrade using temporary tables and copying the data over since you can just update the id to autoinc. If it does become a db migration there would need to be map_ids for a few other tables to handle the inserts. This is the fix if someone wants to run it manually, but it might make sense if @kradalby or someone else that maintains this to provide suggestions on to best handle this. I can add the below as a full migration so then everyone would be in the happy path going forward. Another idea would be to have a integration test that build headscale using the 0.22.3 version and then runs the full upgrade path and compare the database schema to make sure it matches. If it doesn't then the underlying orm changed functionality and could potentially break assumptions on what the db structure looks like. ``` -- Start transactions BEGIN TRANSACTION; -- Step 1: Create new tables from existing tables -- Disable fk checks so we can migrate data PRAGMA foreign_keys = OFF; -- create new_users table CREATE TABLE user_id_map ( old_id INTEGER, new_id INTEGER ); CREATE TABLE `users_new` (`id` integer PRIMARY KEY AUTOINCREMENT,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,`name` text,`display_name` text,`email` text,`provider_identifier` text,`provider` text,`profile_pic_url` text); -- load users INSERT INTO users_new (created_at, updated_at, deleted_at, name, display_name, email, provider_identifier, provider, profile_pic_url) SELECT created_at, updated_at, deleted_at, name, display_name, email, provider_identifier, provider, profile_pic_url FROM users; -- create user_id mapping INSERT INTO user_id_map (old_id, new_id) SELECT u_old.id, u_new.id FROM users u_old JOIN users_new u_new ON u_old.name = u_new.name; -- create new_pre_auth_keys table CREATE TABLE `pre_auth_keys_new` (`id` integer PRIMARY KEY AUTOINCREMENT,`key` text,`user_id` integer,`reusable` numeric,`ephemeral` numeric DEFAULT false,`used` numeric DEFAULT false,`tags` text,`created_at` datetime,`expiration` datetime,CONSTRAINT `fk_pre_auth_keys_user` FOREIGN KEY (`user_id`) REFERENCES `users_new`(`id`) ON DELETE SET NULL); INSERT INTO pre_auth_keys_new (key, user_id, reusable, ephemeral, used, tags, created_at, expiration) SELECT key, m.new_id as user_id, reusable, ephemeral, used, tags, created_at, expiration FROM pre_auth_keys p JOIN user_id_map m ON p.user_id = old_id; -- create new_nodes table CREATE TABLE IF NOT EXISTS "nodes_new" (`id` integer PRIMARY KEY AUTOINCREMENT,`machine_key` text,`node_key` text,`disco_key` text,`endpoints` text,`host_info` text,`ipv4` text,`ipv6` text,`hostname` text,`given_name` varchar(63),`user_id` integer,`register_method` text,`forced_tags` text,`auth_key_id` integer,`expiry` datetime,`last_seen` datetime,`approved_routes` text,`created_at` datetime,`updated_at` datetime,`deleted_at` datetime,CONSTRAINT `fk_nodes_user` FOREIGN KEY (`user_id`) REFERENCES `users_new`(`id`) ON DELETE CASCADE,CONSTRAINT `fk_nodes_auth_key` FOREIGN KEY (`auth_key_id`) REFERENCES `pre_auth_keys_new`(`id`)); INSERT INTO nodes_new (machine_key, node_key, disco_key, endpoints, host_info, ipv4, ipv6, hostname, given_name, user_id, register_method, forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at, updated_at, deleted_at) SELECT machine_key, node_key, disco_key, endpoints, host_info, ipv4, ipv6, hostname, given_name, m.new_id as user_id, register_method, forced_tags, auth_key_id, expiry, last_seen, approved_routes, created_at, updated_at, deleted_at FROM nodes n JOIN user_id_map m ON n.user_id = old_id; -- create new api_keys table CREATE TABLE `api_keys_new` (`id` integer PRIMARY KEY AUTOINCREMENT,`prefix` text,`hash` blob,`created_at` datetime,`expiration` datetime,`last_seen` datetime); INSERT INTO api_keys_new (prefix, hash, created_at, expiration, last_seen) SELECT prefix, hash, created_at, expiration, last_seen FROM api_keys; -- drop temporary map tables DROP table user_id_map; -- Setep 2: Table Promtion - Drop existing tables and rename new tables -- promote users_new table DROP TABLE users; ALTER table users_new RENAME TO users; CREATE INDEX `idx_users_deleted_at` ON `users`(`deleted_at`); CREATE UNIQUE INDEX idx_provider_identifier ON users (provider_identifier) WHERE provider_identifier IS NOT NULL; CREATE UNIQUE INDEX idx_name_provider_identifier ON users (name,provider_identifier); CREATE UNIQUE INDEX idx_name_no_provider_identifier ON users (name) WHERE provider_identifier IS NULL; -- promote pre_auth_keys table DROP TABLE pre_auth_keys; ALTER table pre_auth_keys_new RENAME TO pre_auth_keys; -- promote api_keys table DROP TABLE api_keys; ALTER table api_keys_new RENAME TO api_keys; CREATE UNIQUE INDEX `idx_api_keys_prefix` ON `api_keys`(`prefix`); -- promote nodes table DROP TABLE nodes; ALTER table nodes_new RENAME TO nodes; -- Check/enable fk PRAGMA foreign_key_check; PRAGMA foreign_keys = ON; COMMIT; ```
Author
Owner

@kradalby commented on GitHub (Jun 25, 2025):

Yes, it does not matches, I really dislike GORM and it has caused me so much headache. @nblock discovered that the schema does not end up the same.

I have a draft pr (https://github.com/juanfont/headscale/pull/2617) for:

  • Adding a reference schema so it doesn't diverge
  • rebuild the whole schema so it matches

However I have not had time to finish it, not sure how to test it yet.

It will not do any changes to Postgres.

@kradalby commented on GitHub (Jun 25, 2025): Yes, it does not matches, I really dislike GORM and it has caused me so much headache. @nblock discovered that the schema does not end up the same. I have a draft pr (https://github.com/juanfont/headscale/pull/2617) for: - Adding a reference schema so it doesn't diverge - rebuild the whole schema so it matches However I have not had time to finish it, not sure how to test it yet. It will not do any changes to Postgres.
Author
Owner

@kradalby commented on GitHub (Jul 2, 2025):

I'm working on this over in #2617 , more databases to add to the tests would be greatly appreciated. only schema helps, but with data would be even better.

I think we (@nblock) have a script for helping us randomise the data.

find my email in my profile

@kradalby commented on GitHub (Jul 2, 2025): I'm working on this over in #2617 , more databases to add to the tests would be greatly appreciated. only schema helps, but with data would be even better. I think we (@nblock) have a script for helping us randomise the data. find my email in my profile
Author
Owner

@lotheac commented on GitHub (Jul 2, 2025):

@kradalby I am confused. Your recent comments read to me like they are AI-generated, and so do your recent pull requests.

for example:

I think we (@nblock) have a script for helping us randomise the data.
find my email in my profile

this is nonsensical. what data? why does it need to be randomised? why is your email relevant?

@lotheac commented on GitHub (Jul 2, 2025): @kradalby I am confused. Your recent comments read to me like they are AI-generated, and so do your recent pull requests. for example: >I think we (@nblock) have a script for helping us randomise the data. >find my email in my profile this is nonsensical. what data? why does it need to be randomised? why is your email relevant?
Author
Owner

@kradalby commented on GitHub (Jul 2, 2025):

Sorry, let me try again. I use some AI, but not to write these, not sure if it is a compliment or not :P.

The state of the migrations and databases in headscale is a massive mess. It has not really been addressed and we didnt really realise how bad it was until we opened the can of worms.

There are a lot of databases out there which has not been properly migrated and it is causing all sort of weird behaviour.

@nblock discovered some time ago that one of his old databases had a different schema than a newly created one, which explains a lot as the migration was very broken.

I started the effort in #2617 to clean this up. Essentially we will introduce a file describing how the schema should end up, so we can validate every time we migrate, so this does not happen again.

As part of this, I am trying to write a migration which resets everything to how it should be, but I can only fix things I know how look. So the more potentially broken databases I have access to, the better the coverage of the testing will be and hopefully we will be done with this.

Since this issue seems related to broken database tables. Having a copy of your database would be very valuable.

The randomisation is relevant if you send us an database and is concerned about the data, then we can change the data, but keep the "bad behaviour". Same for the email, the place to send it.
If you have your database before and after your fix mentioned, then that would be meaningful.

@kradalby commented on GitHub (Jul 2, 2025): Sorry, let me try again. I use some AI, but not to write these, not sure if it is a compliment or not :P. The state of the migrations and databases in headscale is a massive mess. It has not really been addressed and we didnt really realise how bad it was until we opened the can of worms. There are a lot of databases out there which has not been properly migrated and it is causing all sort of weird behaviour. @nblock discovered some time ago that one of his old databases had a different schema than a newly created one, which explains a lot as the migration was very broken. I started the effort in #2617 to clean this up. Essentially we will introduce a file describing how the schema should end up, so we can validate every time we migrate, so this does not happen again. As part of this, I am trying to write a migration which resets everything to how it should be, but I can only fix things I know how look. So the more potentially broken databases I have access to, the better the coverage of the testing will be and hopefully we will be done with this. Since this issue seems related to broken database tables. Having a copy of your database would be very valuable. The randomisation is relevant if you send us an database and is concerned about the data, then we can change the data, but keep the "bad behaviour". Same for the email, the place to send it. If you have your database before and after your fix mentioned, then that would be meaningful.
Author
Owner

@lotheac commented on GitHub (Jul 2, 2025):

Okay, I think I see what you might have meant. It is not easy to decipher meaning from your messages. Sorry for calling you a bot.

When you say "send us an database" you mean that you wanted to be sent the data that exists in a previously created database to help you test and verify the refactoring in #2617.

When you say "find my email in my profile" you meant "find whatever script my colleague posted somewhere (maybe?), use it to anonymize your data before sending it to me, and then send it to me over email".

@lotheac commented on GitHub (Jul 2, 2025): Okay, I think I see what you might have meant. It is not easy to decipher meaning from your messages. Sorry for calling you a bot. When you say "send us an database" you mean that you wanted to be sent the data that exists in a previously created database to help you test and verify the refactoring in #2617. When you say "find my email in my profile" you meant "find whatever script my colleague posted somewhere (maybe?), use it to anonymize your data before sending it to me, and then send it to me over email".
Author
Owner

@kradalby commented on GitHub (Jul 3, 2025):

Yes, Ideally I just want databases with real data in them that doesnt work as expected so we can fix it. If you are concerned about the data, it would be good if you try to change up things to hide PII and so on. We have some tools, but they dont cover all cases.

@kradalby commented on GitHub (Jul 3, 2025): Yes, Ideally I just want databases with real data in them that doesnt work as expected so we can fix it. If you are concerned about the data, it would be good if you try to change up things to hide PII and so on. We have some tools, but they dont cover all cases.
Author
Owner

@lotheac commented on GitHub (Jul 3, 2025):

Yes, Ideally I just want databases with real data in them that doesnt work as expected so we can fix it. If you are concerned about the data, it would be good if you try to change up things to hide PII and so on. We have some tools, but they dont cover all cases.

alright, I think I can anonymize and send you the DB contents the next time I see the issue with disappearing ephemeral nodes. I don't have a reliable repro so it doesn't happen often; the previous occasion was on June 20th.

@lotheac commented on GitHub (Jul 3, 2025): > Yes, Ideally I just want databases with real data in them that doesnt work as expected so we can fix it. If you are concerned about the data, it would be good if you try to change up things to hide PII and so on. We have some tools, but they dont cover all cases. alright, I think I can anonymize and send you the DB contents the next time I see the issue with disappearing ephemeral nodes. I don't have a reliable repro so it doesn't happen often; the previous occasion was on June 20th.
Author
Owner

@kradalby commented on GitHub (Jul 3, 2025):

Great thank you.

@jemershaw, if you have a backup of your database before you applied the fixes mentioned, that would be really valuable, your 0.22.3 database.

@kradalby commented on GitHub (Jul 3, 2025): Great thank you. @jemershaw, if you have a backup of your database before you applied the fixes mentioned, that would be really valuable, your 0.22.3 database.
Author
Owner

@jemershaw commented on GitHub (Jul 9, 2025):

@kradalby I can give you a backup if needed but all you need to do is to create a fresh db using version 0.22.3 of headscale and then upgrades will keep the existing structure.

Happy to provide instructions or a db backup if you can't reproduce.

@jemershaw commented on GitHub (Jul 9, 2025): @kradalby I can give you a backup if needed but all you need to do is to create a fresh db using version 0.22.3 of headscale and then upgrades will keep the existing structure. Happy to provide instructions or a db backup if you can't reproduce.
Author
Owner

@kradalby commented on GitHub (Jul 10, 2025):

Yes please

@kradalby commented on GitHub (Jul 10, 2025): Yes please
Author
Owner

@lotheac commented on GitHub (Aug 26, 2025):

I saw this issue happen again today. I've sent the details and anonymized db dump to @kradalby's email.

@lotheac commented on GitHub (Aug 26, 2025): I saw this issue happen again today. I've sent the details and anonymized db dump to @kradalby's email.
Author
Owner

@kradalby commented on GitHub (Sep 10, 2025):

I suspect this is the same as https://github.com/juanfont/headscale/issues/2698 and https://github.com/juanfont/headscale/issues/2697, which should be fixed in https://github.com/juanfont/headscale/pull/2670

@kradalby commented on GitHub (Sep 10, 2025): I suspect this is the same as https://github.com/juanfont/headscale/issues/2698 and https://github.com/juanfont/headscale/issues/2697, which should be fixed in https://github.com/juanfont/headscale/pull/2670
Author
Owner

@lotheac commented on GitHub (Sep 17, 2025):

I updated to a local build at commit 3f6657ae last week, and have not seen this issue since despite numerous scale-ups/downs of ephemeral nodes, so I would say your fix worked. Thanks!

@lotheac commented on GitHub (Sep 17, 2025): I updated to a local build at commit 3f6657ae last week, and have not seen this issue since despite numerous scale-ups/downs of ephemeral nodes, so I would say your fix worked. Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#1019