Integration tests timeout/deadlock #69

New Issue

adam · 2025-12-29T01:21:34+01:00

adam commented

2025-12-29 01:21:34 +01:00

Originally created by @kradalby on GitHub (Oct 26, 2021).

When running on GitHub Actions, the integration tests seem to enter a deadlock or something else that ends with them timing out and failing due to our 30m time constraint.

This needs to be investigated.

Originally created by @kradalby on GitHub (Oct 26, 2021). When running on GitHub Actions, the integration tests seem to enter a deadlock or something else that ends with them timing out and failing due to our 30m time constraint. This needs to be investigated.

adam closed this issue

2025-12-29 01:21:34 +01:00

adam commented

2025-12-29 01:21:35 +01:00

@enoperm commented on GitHub (Nov 28, 2021):

I can confirm the same happening on my machine most, but not all of the time. Even if I do not modify anything, I sometimes need to run the tests at least three times before I can get a successful run, whereas at other times, I can get them to run to completion the same number of times consecutively. From what I can see, the two steps that may hang are either tailscale join, or file sending. I do not believe the latter has much to do with the control server, so it might be an issue in the tailscale client. As for the join issue, I do not know - I have not delved that deep into the problem yet.

@enoperm commented on GitHub (Nov 28, 2021): I can confirm the same happening on my machine *most*, but not all of the time. Even if I do not modify anything, I sometimes need to run the tests at least three times before I can get a successful run, whereas at other times, I can get them to run to completion the same number of times consecutively. From what I can see, the two steps that may hang are either `tailscale join`, or file sending. I do not believe the latter has much to do with the control server, so it *might* be an issue in the tailscale client. As for the join issue, I do not know - I have not delved that deep into the problem yet.

adam commented

2025-12-29 01:21:35 +01:00

@enoperm commented on GitHub (Dec 31, 2021):

I have had some free time to poke around, and found a lifetime issue in poll.go, which I (hopefully) avoided by slightly refactoring the functions in question in this commit. I have only catered for the lifetimes with a shallow knowledge of why the function ended up looking the way it is, so I am not entirely confident that it is the best possible solution, but I have been able to run the integration tests to successful completion about a dozen times in a row since I implemented it, so things seem to work that way.

The original issue was that sometimes, channels have been closed before the goroutines that use them for sending messages have terminated, resulting in a panic as a result of sending on a closed channel. This crashed the headscale server in the integration test environment, causing the client(s) that had already been communicating with to hang on the join operation indefinitely, in turn causing the entire test suite to time out.

In my current attempt at a solution, the related channels are now scoped to the caller function, with all the spawned routines now receiving a context that is terminated before the channels are.

On a related note, one of the (potentially?) more significant changes I have performed was to replace sending to the unbuffered channels on a new goroutine (from my understanding, effectively creating an "infinitely buffered" channel) with an actually buffered, limited size channel to still provide a degree of asynchronicity. My current understanding is that if any of them hang, things are not working properly for that client anyway. I have not looked at how long lived the encompassing request is, but I figured, might as well limit the memory usage instead of spawning a potentially unbounded number of goroutines in case of a long lived one, as I have noted references to keep-alives and update notifications that to me suggest a potentially long lifetime.

@enoperm commented on GitHub (Dec 31, 2021): I have had some free time to poke around, and found a lifetime issue in `poll.go`, which I (hopefully) avoided by slightly refactoring the functions in question in [this commit](https://github.com/enoperm/headscale/commit/77ea9fec05117c8d81e1e2c52aa3490d460a03a7). I have only catered for the lifetimes with a shallow knowledge of why the function ended up looking the way it is, so I am not entirely confident that it is the best possible solution, but I have been able to run the integration tests to successful completion about a dozen times in a row since I implemented it, so things seem to work that way. The original issue was that sometimes, channels have been closed before the goroutines that use them for *sending* messages have terminated, resulting in a panic as a result of sending on a closed channel. This crashed the headscale server in the integration test environment, causing the client(s) that had already been communicating with to hang on the `join` operation indefinitely, in turn causing the entire test suite to time out. In my current attempt at a solution, the related channels are now scoped to the caller function, with all the spawned routines now receiving a context that is terminated before the channels are. On a related note, one of the (potentially?) more significant changes I have performed was to replace sending to the unbuffered channels on a new goroutine (from my understanding, effectively creating an "infinitely buffered" channel) with an actually buffered, limited size channel to still provide a degree of asynchronicity. My current understanding is that if any of them hang, things are not working properly for that client anyway. I have not looked at how long lived the encompassing request is, but I figured, might as well limit the memory usage instead of spawning a potentially unbounded number of goroutines in case of a long lived one, as I have noted references to keep-alives and update notifications that to me suggest a potentially long lifetime.

adam commented

2025-12-29 01:21:35 +01:00

@kradalby commented on GitHub (Mar 20, 2022):

I think this issue is dated and no longer represent the issues with the integration tests. They are better now after work, but still a bit flaky. If we see fit, lets open a new updated issue.

@kradalby commented on GitHub (Mar 20, 2022): I think this issue is dated and no longer represent the issues with the integration tests. They are better now after work, but still a bit flaky. If we see fit, lets open a new updated issue.

adam referenced this issue

2025-12-29 02:29:24 +01:00

[PR #69] [MERGED] Use 8080 as default port in the example config #1237

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#69