[Bug] Node vanishes from the nodelist and is unreachable #1102

Closed
opened 2025-12-29 02:28:17 +01:00 by adam · 2 comments
Owner

Originally created by @peterforeman on GitHub (Sep 25, 2025).

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I have experienced this multiple times: a node suddenly disappears from the list and is unreachable. It's nowhere to be found anymore. This is not a support request but a (vague) bug which is hard to reproduce but happens more than once.

In this case, the node was connected before:

2025-09-25T13:36:58Z INF home/runner/work/headscale/headscale/hscontrol/poll.go:602 > node has connected, mapSession: 0xc0000f3c80, chan: 0xc00030dea0 node=c2k01 node.id=176 omitPeers=false readOnly=false stream=true
2025-09-25T13:36:58Z INF home/runner/work/headscale/headscale/hscontrol/poll.go:602 > node has disconnected, mapSession: 0xc000e46a80, chan: 0xc0002c0af0 node=c2k01 node.id=176 omitPeers=false readOnly=false stream=true

The node was connected using an pre-auth key, 9999 days valid, non-reusable, non-ephemeral. Authenticating the first time was no problem, all went fine and the node showed up and had connectivity.

When trying to connect to the node a while later, it was not found (no dns) anymore. I checked headplane, but the node was gone. I also checked the cli (headscale nodes list) but it disappeared there as well.

The (docker) log shows this:

2025-09-25T14:07:00Z ERR home/runner/work/headscale/headscale/hscontrol/poll.go:622 > Could not get machine from db error="record not found" node=c2k01 node.id=176 omitPeers=false readOnly=false stream=true
2025/09/25 14:07:00 http2: panic serving 172.18.0.2:39996: runtime error: invalid memory address or nil pointer dereference
goroutine 116383254 [running]:
golang.org/x/net/http2.(*serverConn).runHandler.func1()
	/home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2426 +0x13e
panic({0x1f530a0?, 0x3ce1af0?})
	/nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/runtime/panic.go:792 +0x132
github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll.func1()
	/home/runner/work/headscale/headscale/hscontrol/poll.go:203 +0xaa
github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll(0xc0000f3c80)
	/home/runner/work/headscale/headscale/hscontrol/poll.go:289 +0x16a5
github.com/juanfont/headscale/hscontrol.(*noiseServer).NoisePollNetMapHandler(0xc00089a240, {0x2af5a70, 0xc001353398}, 0xc0000ae8c0)
	/home/runner/work/headscale/headscale/hscontrol/noise.go:229 +0x258
net/http.HandlerFunc.ServeHTTP(0xc001460600?, {0x2af5a70?, 0xc001353398?}, 0x7fdf086a05c0?)
	/nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29
github.com/juanfont/headscale/hscontrol.prometheusMiddleware.func1({0x2af5a70, 0xc001353398}, 0xc0000ae8c0)
	/home/runner/work/headscale/headscale/hscontrol/metrics.go:82 +0x18e
net/http.HandlerFunc.ServeHTTP(0xc0000ae780?, {0x2af5a70?, 0xc001353398?}, 0x10?)
	/nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0011af740, {0x2af5a70, 0xc001353398}, 0xc0000ae640)
	/home/runner/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1e2
golang.org/x/net/http2.(*serverConn).runHandler(0x44d2f2?, 0xc0009d3fd0?, 0xacb685?, 0xc001011f00?)
	/home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2433 +0xf5
created by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 116383225
	/home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2367 +0x21d

Also the log is full of these lines:

2025-09-25T14:07:00Z ERR user msg: node not found code=404
2025-09-25T14:07:01Z ERR user msg: node not found code=404
2025-09-25T14:07:01Z ERR user msg: node not found code=404
2025-09-25T14:07:01Z ERR user msg: node not found code=404

In the log we now still find these lines (176 is the disappeared node):

2025-09-25T17:10:29Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=176
2025-09-25T17:10:42Z ERR user msg: node not found code=404
2025-09-25T17:10:56Z ERR user msg: node not found code=404
2025-09-25T17:10:58Z ERR user msg: node not found code=404
2025-09-25T17:11:01Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=176

The device itself is still up&running, but not reachable through tailscale anymore. 'tailscale status' on the node shows:

$ sudo tailscale status
100.64.0.92     c2k01                c2k-sds      linux   offline
100.64.0.26     xxxxx xxx@ macOS   offline
100.64.0.93     xxxx   xxxx@   macOS   idle, tx 929472 rx 316764
100.64.0.105    tailscale-subnet-router-xxxx  xxxx     linux   -

# Health check:
#     - Linux DNS config not ideal. /etc/resolv.conf overwritten. See https://tailscale.com/s/dns-fight
#     - Unable to connect to the Tailscale coordination server to synchronize the state of your tailnet. Peer reachability might degrade over time.

After rebooting the node tailscale shows 'Logged out' and 'Log in at '.

Expected Behavior

I would not expect a node to ever just disappear. Only when using an ephemeral key, but this was not the case. Also I would expect more logging to be able to investigate better.

Steps To Reproduce

I cannot reproduce it myself, but it has happened a couple of times already with different nodes that suddenly show "Logged out". First I though it was my own fault, but I am very sure this was not an ephemeral key and I am very sure I did not delete the node myself.

Environment

- OS: Debian 13.1 (linux/amd64)
- Docker version: 28.4.0
- Headscale version: 0.26.1
- Tailscale version: 1.88.1 (1.88.1-t032962f4b-gc5ad3b22f)
- Number of nodes: ~ 90

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Debug information

These logs are generated -after- rebooting and tailscale showing the node was logged out. Therefore the logs are, in my view, to be expected:

tailscale debug netmap = "null"

tailscale status --json:

{
  "Version": "1.88.1-t032962f4b-gc5ad3b22f",
  "TUN": true,
  "BackendState": "NeedsLogin",
  "HaveNodeKey": true,
  "AuthURL": "https://xxxx/register/5s65jGlWzVxiAcf6tCC-xxxx",
  "TailscaleIPs": null,
  "Self": {
    "ID": "",
    "PublicKey": "nodekey:0000000000000000000000000000000000000000000000000000000000000000",
    "HostName": "c2k01",
    "DNSName": "",
    "OS": "linux",
    "UserID": 0,
    "TailscaleIPs": null,
    "Addrs": [],
    "CurAddr": "",
    "Relay": "",
    "PeerRelay": "",
    "RxBytes": 0,
    "TxBytes": 0,
    "Created": "0001-01-01T00:00:00Z",
    "LastWrite": "0001-01-01T00:00:00Z",
    "LastSeen": "0001-01-01T00:00:00Z",
    "LastHandshake": "0001-01-01T00:00:00Z",
    "Online": false,
    "ExitNode": false,
    "ExitNodeOption": false,
    "Active": false,
    "PeerAPIURL": null,
    "TaildropTarget": 0,
    "NoFileSharingReason": "",
    "InNetworkMap": false,
    "InMagicSock": false,
    "InEngine": false
  },
  "Health": [
    "You are logged out. The last login error was: fetch control key: Get \"https://xxxxx/key?v=125\": dial tcp xxxxx:443: connect: network is unreachable"
  ],
  "MagicDNSSuffix": "",
  "CurrentTailnet": null,
  "CertDomains": null,
  "Peer": null,
  "User": null,
  "ClientVersion": null
}

I did not catch the log on the node before rebooting unfortunately.

After the reboot, I re-ran the tailscale shell command again:

tailscale up --accept-routes --login-server https://xxxx --auth-key f51c7c973c2cf06714306c14ba9fd062e05d9b78c20xxxx --reset
Originally created by @peterforeman on GitHub (Sep 25, 2025). ### Is this a support request? - [x] This is not a support request ### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior I have experienced this multiple times: a node suddenly disappears from the list and is unreachable. It's nowhere to be found anymore. This is not a support request but a (vague) bug which is hard to reproduce but happens more than once. In this case, the node was connected before: ``` 2025-09-25T13:36:58Z INF home/runner/work/headscale/headscale/hscontrol/poll.go:602 > node has connected, mapSession: 0xc0000f3c80, chan: 0xc00030dea0 node=c2k01 node.id=176 omitPeers=false readOnly=false stream=true 2025-09-25T13:36:58Z INF home/runner/work/headscale/headscale/hscontrol/poll.go:602 > node has disconnected, mapSession: 0xc000e46a80, chan: 0xc0002c0af0 node=c2k01 node.id=176 omitPeers=false readOnly=false stream=true ``` The node was connected using an pre-auth key, 9999 days valid, non-reusable, non-ephemeral. Authenticating the first time was no problem, all went fine and the node showed up and had connectivity. When trying to connect to the node a while later, it was not found (no dns) anymore. I checked headplane, but the node was gone. I also checked the cli (headscale nodes list) but it disappeared there as well. The (docker) log shows this: ``` 2025-09-25T14:07:00Z ERR home/runner/work/headscale/headscale/hscontrol/poll.go:622 > Could not get machine from db error="record not found" node=c2k01 node.id=176 omitPeers=false readOnly=false stream=true 2025/09/25 14:07:00 http2: panic serving 172.18.0.2:39996: runtime error: invalid memory address or nil pointer dereference goroutine 116383254 [running]: golang.org/x/net/http2.(*serverConn).runHandler.func1() /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2426 +0x13e panic({0x1f530a0?, 0x3ce1af0?}) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/runtime/panic.go:792 +0x132 github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll.func1() /home/runner/work/headscale/headscale/hscontrol/poll.go:203 +0xaa github.com/juanfont/headscale/hscontrol.(*mapSession).serveLongPoll(0xc0000f3c80) /home/runner/work/headscale/headscale/hscontrol/poll.go:289 +0x16a5 github.com/juanfont/headscale/hscontrol.(*noiseServer).NoisePollNetMapHandler(0xc00089a240, {0x2af5a70, 0xc001353398}, 0xc0000ae8c0) /home/runner/work/headscale/headscale/hscontrol/noise.go:229 +0x258 net/http.HandlerFunc.ServeHTTP(0xc001460600?, {0x2af5a70?, 0xc001353398?}, 0x7fdf086a05c0?) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29 github.com/juanfont/headscale/hscontrol.prometheusMiddleware.func1({0x2af5a70, 0xc001353398}, 0xc0000ae8c0) /home/runner/work/headscale/headscale/hscontrol/metrics.go:82 +0x18e net/http.HandlerFunc.ServeHTTP(0xc0000ae780?, {0x2af5a70?, 0xc001353398?}, 0x10?) /nix/store/rv9g1p18w52vip6652svdgy138wgx7dj-go-1.24.2/share/go/src/net/http/server.go:2294 +0x29 github.com/gorilla/mux.(*Router).ServeHTTP(0xc0011af740, {0x2af5a70, 0xc001353398}, 0xc0000ae640) /home/runner/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1e2 golang.org/x/net/http2.(*serverConn).runHandler(0x44d2f2?, 0xc0009d3fd0?, 0xacb685?, 0xc001011f00?) /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2433 +0xf5 created by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 116383225 /home/runner/go/pkg/mod/golang.org/x/net@v0.39.0/http2/server.go:2367 +0x21d ``` Also the log is full of these lines: ``` 2025-09-25T14:07:00Z ERR user msg: node not found code=404 2025-09-25T14:07:01Z ERR user msg: node not found code=404 2025-09-25T14:07:01Z ERR user msg: node not found code=404 2025-09-25T14:07:01Z ERR user msg: node not found code=404 ``` In the log we now still find these lines (176 is the disappeared node): ``` 2025-09-25T17:10:29Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=176 2025-09-25T17:10:42Z ERR user msg: node not found code=404 2025-09-25T17:10:56Z ERR user msg: node not found code=404 2025-09-25T17:10:58Z ERR user msg: node not found code=404 2025-09-25T17:11:01Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=176 ``` The device itself is still up&running, but not reachable through tailscale anymore. 'tailscale status' on the node shows: ``` $ sudo tailscale status 100.64.0.92 c2k01 c2k-sds linux offline 100.64.0.26 xxxxx xxx@ macOS offline 100.64.0.93 xxxx xxxx@ macOS idle, tx 929472 rx 316764 100.64.0.105 tailscale-subnet-router-xxxx xxxx linux - # Health check: # - Linux DNS config not ideal. /etc/resolv.conf overwritten. See https://tailscale.com/s/dns-fight # - Unable to connect to the Tailscale coordination server to synchronize the state of your tailnet. Peer reachability might degrade over time. ``` After rebooting the node tailscale shows 'Logged out' and 'Log in at <headscale URL>'. ### Expected Behavior I would not expect a node to ever just disappear. Only when using an ephemeral key, but this was not the case. Also I would expect more logging to be able to investigate better. ### Steps To Reproduce I cannot reproduce it myself, but it has happened a couple of times already with different nodes that suddenly show "Logged out". First I though it was my own fault, but I am very sure this was not an ephemeral key and I am very sure I did not delete the node myself. ### Environment ```markdown - OS: Debian 13.1 (linux/amd64) - Docker version: 28.4.0 - Headscale version: 0.26.1 - Tailscale version: 1.88.1 (1.88.1-t032962f4b-gc5ad3b22f) - Number of nodes: ~ 90 ``` ### Runtime environment - [x] Headscale is behind a (reverse) proxy - [x] Headscale runs in a container ### Debug information These logs are generated -after- rebooting and tailscale showing the node was logged out. Therefore the logs are, in my view, to be expected: tailscale debug netmap = "null" tailscale status --json: ``` { "Version": "1.88.1-t032962f4b-gc5ad3b22f", "TUN": true, "BackendState": "NeedsLogin", "HaveNodeKey": true, "AuthURL": "https://xxxx/register/5s65jGlWzVxiAcf6tCC-xxxx", "TailscaleIPs": null, "Self": { "ID": "", "PublicKey": "nodekey:0000000000000000000000000000000000000000000000000000000000000000", "HostName": "c2k01", "DNSName": "", "OS": "linux", "UserID": 0, "TailscaleIPs": null, "Addrs": [], "CurAddr": "", "Relay": "", "PeerRelay": "", "RxBytes": 0, "TxBytes": 0, "Created": "0001-01-01T00:00:00Z", "LastWrite": "0001-01-01T00:00:00Z", "LastSeen": "0001-01-01T00:00:00Z", "LastHandshake": "0001-01-01T00:00:00Z", "Online": false, "ExitNode": false, "ExitNodeOption": false, "Active": false, "PeerAPIURL": null, "TaildropTarget": 0, "NoFileSharingReason": "", "InNetworkMap": false, "InMagicSock": false, "InEngine": false }, "Health": [ "You are logged out. The last login error was: fetch control key: Get \"https://xxxxx/key?v=125\": dial tcp xxxxx:443: connect: network is unreachable" ], "MagicDNSSuffix": "", "CurrentTailnet": null, "CertDomains": null, "Peer": null, "User": null, "ClientVersion": null } ``` I did not catch the log on the node before rebooting unfortunately. After the reboot, I re-ran the tailscale shell command again: ``` tailscale up --accept-routes --login-server https://xxxx --auth-key f51c7c973c2cf06714306c14ba9fd062e05d9b78c20xxxx --reset ```
adam added the bug label 2025-12-29 02:28:17 +01:00
adam closed this issue 2025-12-29 02:28:17 +01:00
Author
Owner

@ademariag commented on GitHub (Nov 7, 2025):

@peterforeman why was this issue closed? did you find a solution? or was it a configuration error on your side?

@ademariag commented on GitHub (Nov 7, 2025): @peterforeman why was this issue closed? did you find a solution? or was it a configuration error on your side?
Author
Owner

@peterforeman commented on GitHub (Nov 7, 2025):

@peterforeman why was this issue closed? did you find a solution? or was it a configuration error on your side?

It was indeed a configuration error on my side. While I was pretty sure it wasn't an ephemeral key, it actually was after checking for the Nth time.

@peterforeman commented on GitHub (Nov 7, 2025): > [@peterforeman](https://github.com/peterforeman) why was this issue closed? did you find a solution? or was it a configuration error on your side? It was indeed a configuration error on my side. While I was pretty sure it wasn't an ephemeral key, it actually was after checking for the Nth time.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#1102