mirror of
https://github.com/juanfont/headscale.git
synced 2026-01-11 20:00:28 +01:00
[Bug] Node Connection Issues(~600 nodes) in v0.23.0-alpha12 #721
Closed
opened 2025-12-29 02:22:50 +01:00 by adam
·
8 comments
No Branch/Tag Specified
main
update_flake_lock_action
gh-pages
kradalby/release-v0.27.2
dependabot/go_modules/golang.org/x/crypto-0.45.0
dependabot/go_modules/github.com/opencontainers/runc-1.3.3
copilot/investigate-headscale-issue-2788
copilot/investigate-visibility-issue-2788
copilot/investigate-issue-2833
copilot/debug-issue-2846
copilot/fix-issue-2847
dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0
dependabot/go_modules/github.com/docker/docker-28.3.3incompatible
kradalby/cli-experiement3
doc/0.26.1
doc/0.25.1
doc/0.25.0
doc/0.24.3
doc/0.24.2
doc/0.24.1
doc/0.24.0
kradalby/build-docker-on-pr
topic/docu-versioning
topic/docker-kos
juanfont/fix-crash-node-id
juanfont/better-disclaimer
update-contributors
topic/prettier
revert-1893-add-test-stage-to-docs
add-test-stage-to-docs
remove-node-check-interval
fix-empty-prefix
fix-ephemeral-reusable
bug_report-debuginfo
autogroups
logs-to-stderr
revert-1414-topic/fix_unix_socket
rename-machine-node
port-embedded-derp-tests-v2
port-derp-tests
duplicate-word-linter
update-tailscale-1.36
warn-against-apache
ko-fi-link
more-acl-tests
fix-typo-standalone
parallel-nolint
tparallel-fix
rerouting
ssh-changelog-docs
oidc-cleanup
web-auth-flow-tests
kradalby-gh-runner
fix-proto-lint
remove-funding-links
go-1.19
enable-1.30-in-tests
0.16.x
cosmetic-changes-integration
tmp-fix-integration-docker
fix-integration-docker
configurable-update-interval
show-nodes-online
hs2021
acl-syntax-fixes
ts2021-implementation
fix-spurious-updates
unstable-integration-tests
mandatory-stun
embedded-derp
prtemplate-fix
v0.28.0-beta.1
v0.27.2-rc.1
v0.27.1
v0.27.0
v0.27.0-beta.2
v0.27.0-beta.1
v0.26.1
v0.26.0
v0.26.0-beta.2
v0.26.0-beta.1
v0.25.1
v0.25.0
v0.25.0-beta.2
v0.24.3
v0.25.0-beta.1
v0.24.2
v0.24.1
v0.24.0
v0.24.0-beta.2
v0.24.0-beta.1
v0.23.0
v0.23.0-rc.1
v0.23.0-beta.5
v0.23.0-beta.4
v0.23.0-beta3
v0.23.0-beta2
v0.23.0-beta1
v0.23.0-alpha12
v0.23.0-alpha11
v0.23.0-alpha10
v0.23.0-alpha9
v0.23.0-alpha8
v0.23.0-alpha7
v0.23.0-alpha6
v0.23.0-alpha5
v0.23.0-alpha4
v0.23.0-alpha4-docker-ko-test9
v0.23.0-alpha4-docker-ko-test8
v0.23.0-alpha4-docker-ko-test7
v0.23.0-alpha4-docker-ko-test6
v0.23.0-alpha4-docker-ko-test5
v0.23.0-alpha-docker-release-test-debug2
v0.23.0-alpha-docker-release-test-debug
v0.23.0-alpha4-docker-ko-test4
v0.23.0-alpha4-docker-ko-test3
v0.23.0-alpha4-docker-ko-test2
v0.23.0-alpha4-docker-ko-test
v0.23.0-alpha3
v0.23.0-alpha2
v0.23.0-alpha1
v0.22.3
v0.22.2
v0.23.0-alpha-docker-release-test
v0.22.1
v0.22.0
v0.22.0-alpha3
v0.22.0-alpha2
v0.22.0-alpha1
v0.22.0-nfpmtest
v0.21.0
v0.20.0
v0.19.0
v0.19.0-beta2
v0.19.0-beta1
v0.18.0
v0.18.0-beta4
v0.18.0-beta3
v0.18.0-beta2
v0.18.0-beta1
v0.17.1
v0.17.0
v0.17.0-beta5
v0.17.0-beta4
v0.17.0-beta3
v0.17.0-beta2
v0.17.0-beta1
v0.17.0-alpha4
v0.17.0-alpha3
v0.17.0-alpha2
v0.17.0-alpha1
v0.16.4
v0.16.3
v0.16.2
v0.16.1
v0.16.0
v0.16.0-beta7
v0.16.0-beta6
v0.16.0-beta5
v0.16.0-beta4
v0.16.0-beta3
v0.16.0-beta2
v0.16.0-beta1
v0.15.0
v0.15.0-beta6
v0.15.0-beta5
v0.15.0-beta4
v0.15.0-beta3
v0.15.0-beta2
v0.15.0-beta1
v0.14.0
v0.14.0-beta2
v0.14.0-beta1
v0.13.0
v0.13.0-beta3
v0.13.0-beta2
v0.13.0-beta1
upstream/v0.12.4
v0.12.4
v0.12.3
v0.12.2
v0.12.2-beta1
v0.12.1
v0.12.0-beta2
v0.12.0-beta1
v0.11.0
v0.10.8
v0.10.7
v0.10.6
v0.10.5
v0.10.4
v0.10.3
v0.10.2
v0.10.1
v0.10.0
v0.9.3
v0.9.2
v0.9.1
v0.9.0
v0.8.1
v0.8.0
v0.7.1
v0.7.0
v0.6.1
v0.6.0
v0.5.2
v0.5.1
v0.5.0
v0.4.0
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.1
v0.1.0
Labels
Clear labels
CLI
DERP
DNS
Nix
OIDC
SSH
bug
database
documentation
duplicate
enhancement
faq
good first issue
grants
help wanted
might-come
needs design doc
needs investigation
no-stale-bot
out of scope
performance
policy 📝
pull-request
question
regression
routes
stale
tags
tailscale-feature-gap
well described ❤️
wontfix
Mirrored from GitHub Pull Request
No Label
bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/headscale#721
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nadongjun on GitHub (Jun 4, 2024).
Is this a support request?
Is there an existing issue for this?
Current Behavior
To verify if issue #1656 persists in v0.23.0-alpha12, a connection test was conducted. When attempting to connect 600 tailscale nodes using the v0.23.0-alpha12 version of headscale, the following error occurs frequently and some nodes become offline after connecting. There was no CPU or memory overload.
Expected Behavior
All 600 tailscale nodes should connect successfully to the headscale server and operate stably without error logs.
Steps To Reproduce
Environment
Runtime environment
Anything else?
headscale_log_2024-06-03.txt
headscale_node_list.txt
Attached are the container logs of the tested headscale and the node list when attempting to connect approximately 600 nodes.
Based on these logs, it appears that issue #1656 persists in v0.23.0-alpha12.
@kradalby commented on GitHub (Jun 5, 2024):
did you verify that there was a problem with the connections between nodes, or are you saying that you do not expect any errors?
@nadongjun commented on GitHub (Jun 6, 2024):
I verified that there are two issues in the latest version:
(1) When 600 users join a single Headscale server, the error "ERR update not sent, context cancelled..." occurs in Headscale.
(2) Some of the joined 600 users are in an offline status when checked with headscale node list.
There are no issues with connections between users who are in an online status.
@kradalby commented on GitHub (Jun 6, 2024):
t2.medium sounds a bit optimistic, its unclear if its too small for the headscale, or for the test clients:
The error mentioned would mean one or more of:
The problem here might be either that the Headscale machine does not have enough resources to maintain all of the connections, or the VMs running 100s of client does not have enough resources to run them all.
The machine used in #1656 is significantly larger, its probably a bit overspecced with the new alpha.
Have you tried the same with 0.22.3 (latest stable)? It is a lot more inefficient so might struggle more on a t2.medium.
@jwischka commented on GitHub (Jun 6, 2024):
Another important question is whether you are running sqlite or postgres. If sqlite try enabling wal, or switching to postgres. Sounds like it could be a concurrency issue.
@nadongjun commented on GitHub (Jun 10, 2024):
I am currently using sqlite(without wal option). I will rerun the same tests on a higher performance instance using postgres.
@kradalby commented on GitHub (Jun 10, 2024):
Please try with WAL first.
@kradalby commented on GitHub (Jun 20, 2024):
WAL on by default for SQLite is coming in #1985.
I will close this issue as it is more of a performance/scaling thing than a bug. We have a couple of hidden tuning options, which together with WAL might be good content for a "performance" or "scaling" guide in the future.
@dustinblackman commented on GitHub (Aug 12, 2024):
Using Postgres I'm experiencing the same issue here using alpha 12 in a network of ~30 nodes, with a handful of ephemeral nodes coming in and out through the day. I've seen both regular users on laptops, and machines in the cloud be able to connect to Headscale, but then not be able to reach any other node in the network. Headscale outputs the same errors at stated at the beginning of the issue, though while digging through the new map session logic I'm unsure if the error and the issue is related. If I were to guess something is hanging in
8571513e3c/hscontrol/poll.go (L271)I had the problem with a laptop connecting to a remote machine, so I had ran
tailscale down && tailscale upon the remote machine, and it then fixed the problem. I'm betting there is an issue with connection recovery in the notifier, either to the node or database. I'll dig through logs later in the evening.