[Bug] Node Connection Issues(~600 nodes) in v0.23.0-alpha12 #721

New Issue

adam · 2025-12-29T02:22:50+01:00

adam commented

2025-12-29 02:22:50 +01:00

Originally created by @nadongjun on GitHub (Jun 4, 2024).

Is this a support request?

This is not a support request

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

To verify if issue #1656 persists in v0.23.0-alpha12, a connection test was conducted. When attempting to connect 600 tailscale nodes using the v0.23.0-alpha12 version of headscale, the following error occurs frequently and some nodes become offline after connecting. There was no CPU or memory overload.

Error log: ERR update not sent, context cancelled error="context deadline exceeded" node.id=xxxx

Expected Behavior

All 600 tailscale nodes should connect successfully to the headscale server and operate stably without error logs.

Steps To Reproduce

Prepare seven aws ec2 instances (type: t2.medium).
Deploy the headscale server in a container on one instance.
Deploy 100 tailscale containers on each of the remaining six instances. (total: 600)
Connect each tailscale container to the headscale server.
Check error logs and connection status.

Environment

- OS: Linux/Unix, Amazon Linux
- Headscale version: v0.23.0-alpha12
- Tailscale version: v1.66.4

Runtime environment

Headscale is behind a (reverse) proxy
Headscale runs in a container

Anything else?

headscale_log_2024-06-03.txt
headscale_node_list.txt

Attached are the container logs of the tested headscale and the node list when attempting to connect approximately 600 nodes.

Based on these logs, it appears that issue #1656 persists in v0.23.0-alpha12.

Originally created by @nadongjun on GitHub (Jun 4, 2024). ### Is this a support request? - [X] This is not a support request ### Is there an existing issue for this? - [X] I have searched the existing issues ### Current Behavior To verify if issue #1656 persists in v0.23.0-alpha12, a connection test was conducted. When attempting to connect 600 tailscale nodes using the v0.23.0-alpha12 version of headscale, the following error occurs frequently and some nodes become offline after connecting. There was no CPU or memory overload. - Error log: ERR update not sent, context cancelled error="context deadline exceeded" node.id=xxxx ### Expected Behavior All 600 tailscale nodes should connect successfully to the headscale server and operate stably without error logs. ### Steps To Reproduce 1. Prepare seven aws ec2 instances (type: t2.medium). 2. Deploy the headscale server in a container on one instance. 3. Deploy 100 tailscale containers on each of the remaining six instances. (total: 600) 4. Connect each tailscale container to the headscale server. 5. Check error logs and connection status. ### Environment ```markdown - OS: Linux/Unix, Amazon Linux - Headscale version: v0.23.0-alpha12 - Tailscale version: v1.66.4 ``` ### Runtime environment - [ ] Headscale is behind a (reverse) proxy - [X] Headscale runs in a container ### Anything else? [headscale_log_2024-06-03.txt](https://github.com/user-attachments/files/15545793/headscale_log_2024-06-03.txt) [headscale_node_list.txt](https://github.com/user-attachments/files/15545799/headscale_node_list.txt) Attached are the container logs of the tested headscale and the node list when attempting to connect approximately 600 nodes. Based on these logs, it appears that issue #1656 persists in v0.23.0-alpha12.

adam added the bug label 2025-12-29 02:22:50 +01:00

adam closed this issue

2025-12-29 02:22:50 +01:00

adam commented

2025-12-29 02:22:51 +01:00

@kradalby commented on GitHub (Jun 5, 2024):

did you verify that there was a problem with the connections between nodes, or are you saying that you do not expect any errors?

@kradalby commented on GitHub (Jun 5, 2024): did you verify that there was a problem with the connections between nodes, or are you saying that you do not expect any errors?

adam commented

2025-12-29 02:22:51 +01:00

@nadongjun commented on GitHub (Jun 6, 2024):

did you verify that there was a problem with the connections between nodes, or are you saying that you do not expect any errors?

I verified that there are two issues in the latest version:

(1) When 600 users join a single Headscale server, the error "ERR update not sent, context cancelled..." occurs in Headscale.

(2) Some of the joined 600 users are in an offline status when checked with headscale node list.

There are no issues with connections between users who are in an online status.

@nadongjun commented on GitHub (Jun 6, 2024): > did you verify that there was a problem with the connections between nodes, or are you saying that you do not expect any errors? I verified that there are two issues in the latest version: (1) When 600 users join a single Headscale server, the error "ERR update not sent, context cancelled..." occurs in Headscale. (2) Some of the joined 600 users are in an offline status when checked with headscale node list. There are no issues with connections between users who are in an online status.

adam commented

2025-12-29 02:22:51 +01:00

@kradalby commented on GitHub (Jun 6, 2024):

t2.medium sounds a bit optimistic, its unclear if its too small for the headscale, or for the test clients:

The error mentioned would mean one or more of:

The node has gone away and its not taking the update
The node is reconnecting and the update is being sent to the "closed" version
The node did not accept the message fast enough

The problem here might be either that the Headscale machine does not have enough resources to maintain all of the connections, or the VMs running 100s of client does not have enough resources to run them all.

The machine used in #1656 is significantly larger, its probably a bit overspecced with the new alpha.
Have you tried the same with 0.22.3 (latest stable)? It is a lot more inefficient so might struggle more on a t2.medium.

@kradalby commented on GitHub (Jun 6, 2024): t2.medium sounds a bit optimistic, its unclear if its too small for the headscale, or for the test clients: The error mentioned would mean one or more of: - The node has gone away and its not taking the update - The node is reconnecting and the update is being sent to the "closed" version - The node did not accept the message fast enough The problem here might be either that the Headscale machine does not have enough resources to maintain all of the connections, or the VMs running 100s of client does not have enough resources to run them all. The machine used in #1656 is significantly larger, its probably a bit overspecced with the new alpha. Have you tried the same with 0.22.3 (latest stable)? It is a lot more inefficient so might struggle more on a t2.medium.

adam commented

2025-12-29 02:22:51 +01:00

@jwischka commented on GitHub (Jun 6, 2024):

Another important question is whether you are running sqlite or postgres. If sqlite try enabling wal, or switching to postgres. Sounds like it could be a concurrency issue.

@jwischka commented on GitHub (Jun 6, 2024): Another important question is whether you are running sqlite or postgres. If sqlite try enabling wal, or switching to postgres. Sounds like it could be a concurrency issue.

adam commented

2025-12-29 02:22:53 +01:00

@nadongjun commented on GitHub (Jun 10, 2024):

I am currently using sqlite(without wal option). I will rerun the same tests on a higher performance instance using postgres.

@nadongjun commented on GitHub (Jun 10, 2024): I am currently using sqlite(without wal option). I will rerun the same tests on a higher performance instance using postgres.

adam commented

2025-12-29 02:22:56 +01:00

@kradalby commented on GitHub (Jun 10, 2024):

Please try with WAL first.

@kradalby commented on GitHub (Jun 10, 2024): Please try with WAL first.

adam commented

2025-12-29 02:22:57 +01:00

@kradalby commented on GitHub (Jun 20, 2024):

WAL on by default for SQLite is coming in #1985.

I will close this issue as it is more of a performance/scaling thing than a bug. We have a couple of hidden tuning options, which together with WAL might be good content for a "performance" or "scaling" guide in the future.

@kradalby commented on GitHub (Jun 20, 2024): WAL on by default for SQLite is coming in #1985. I will close this issue as it is more of a performance/scaling thing than a bug. We have a couple of hidden tuning options, which together with WAL might be good content for a "performance" or "scaling" guide in the future.

adam commented

2025-12-29 02:22:57 +01:00

@dustinblackman commented on GitHub (Aug 12, 2024):

Using Postgres I'm experiencing the same issue here using alpha 12 in a network of ~30 nodes, with a handful of ephemeral nodes coming in and out through the day. I've seen both regular users on laptops, and machines in the cloud be able to connect to Headscale, but then not be able to reach any other node in the network. Headscale outputs the same errors at stated at the beginning of the issue, though while digging through the new map session logic I'm unsure if the error and the issue is related. If I were to guess something is hanging in 8571513e3c/hscontrol/poll.go (L271)

I had the problem with a laptop connecting to a remote machine, so I had ran tailscale down && tailscale up on the remote machine, and it then fixed the problem. I'm betting there is an issue with connection recovery in the notifier, either to the node or database. I'll dig through logs later in the evening.

@dustinblackman commented on GitHub (Aug 12, 2024): Using Postgres I'm experiencing the same issue here using alpha 12 in a network of ~30 nodes, with a handful of ephemeral nodes coming in and out through the day. I've seen both regular users on laptops, and machines in the cloud be able to connect to Headscale, but then not be able to reach any other node in the network. Headscale outputs the same errors at stated at the beginning of the issue, though while digging through the new map session logic I'm unsure if the error and the issue is related. If I were to guess something is hanging in https://github.com/juanfont/headscale/blob/8571513e3c6d601deb10d2cca0a7f837dc466770/hscontrol/poll.go#L271 I had the problem with a laptop connecting to a remote machine, so I had ran `tailscale down && tailscale up` on the remote machine, and it then fixed the problem. I'm betting there is an issue with connection recovery in the notifier, either to the node or database. I'll dig through logs later in the evening.

adam referenced this issue

2025-12-29 02:30:52 +01:00

[PR #721] [CLOSED] docs(README): update contributors #1598

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#721