Integration tests timeout/deadlock #69

Closed
opened 2025-12-29 01:21:34 +01:00 by adam · 3 comments
Owner

Originally created by @kradalby on GitHub (Oct 26, 2021).

When running on GitHub Actions, the integration tests seem to enter a deadlock or something else that ends with them timing out and failing due to our 30m time constraint.

This needs to be investigated.

Originally created by @kradalby on GitHub (Oct 26, 2021). When running on GitHub Actions, the integration tests seem to enter a deadlock or something else that ends with them timing out and failing due to our 30m time constraint. This needs to be investigated.
adam closed this issue 2025-12-29 01:21:34 +01:00
Author
Owner

@enoperm commented on GitHub (Nov 28, 2021):

I can confirm the same happening on my machine most, but not all of the time. Even if I do not modify anything, I sometimes need to run the tests at least three times before I can get a successful run, whereas at other times, I can get them to run to completion the same number of times consecutively. From what I can see, the two steps that may hang are either tailscale join, or file sending. I do not believe the latter has much to do with the control server, so it might be an issue in the tailscale client. As for the join issue, I do not know - I have not delved that deep into the problem yet.

@enoperm commented on GitHub (Nov 28, 2021): I can confirm the same happening on my machine *most*, but not all of the time. Even if I do not modify anything, I sometimes need to run the tests at least three times before I can get a successful run, whereas at other times, I can get them to run to completion the same number of times consecutively. From what I can see, the two steps that may hang are either `tailscale join`, or file sending. I do not believe the latter has much to do with the control server, so it *might* be an issue in the tailscale client. As for the join issue, I do not know - I have not delved that deep into the problem yet.
Author
Owner

@enoperm commented on GitHub (Dec 31, 2021):

I have had some free time to poke around, and found a lifetime issue in poll.go, which I (hopefully) avoided by slightly refactoring the functions in question in this commit. I have only catered for the lifetimes with a shallow knowledge of why the function ended up looking the way it is, so I am not entirely confident that it is the best possible solution, but I have been able to run the integration tests to successful completion about a dozen times in a row since I implemented it, so things seem to work that way.

The original issue was that sometimes, channels have been closed before the goroutines that use them for sending messages have terminated, resulting in a panic as a result of sending on a closed channel. This crashed the headscale server in the integration test environment, causing the client(s) that had already been communicating with to hang on the join operation indefinitely, in turn causing the entire test suite to time out.

In my current attempt at a solution, the related channels are now scoped to the caller function, with all the spawned routines now receiving a context that is terminated before the channels are.

On a related note, one of the (potentially?) more significant changes I have performed was to replace sending to the unbuffered channels on a new goroutine (from my understanding, effectively creating an "infinitely buffered" channel) with an actually buffered, limited size channel to still provide a degree of asynchronicity. My current understanding is that if any of them hang, things are not working properly for that client anyway. I have not looked at how long lived the encompassing request is, but I figured, might as well limit the memory usage instead of spawning a potentially unbounded number of goroutines in case of a long lived one, as I have noted references to keep-alives and update notifications that to me suggest a potentially long lifetime.

@enoperm commented on GitHub (Dec 31, 2021): I have had some free time to poke around, and found a lifetime issue in `poll.go`, which I (hopefully) avoided by slightly refactoring the functions in question in [this commit](https://github.com/enoperm/headscale/commit/77ea9fec05117c8d81e1e2c52aa3490d460a03a7). I have only catered for the lifetimes with a shallow knowledge of why the function ended up looking the way it is, so I am not entirely confident that it is the best possible solution, but I have been able to run the integration tests to successful completion about a dozen times in a row since I implemented it, so things seem to work that way. The original issue was that sometimes, channels have been closed before the goroutines that use them for *sending* messages have terminated, resulting in a panic as a result of sending on a closed channel. This crashed the headscale server in the integration test environment, causing the client(s) that had already been communicating with to hang on the `join` operation indefinitely, in turn causing the entire test suite to time out. In my current attempt at a solution, the related channels are now scoped to the caller function, with all the spawned routines now receiving a context that is terminated before the channels are. On a related note, one of the (potentially?) more significant changes I have performed was to replace sending to the unbuffered channels on a new goroutine (from my understanding, effectively creating an "infinitely buffered" channel) with an actually buffered, limited size channel to still provide a degree of asynchronicity. My current understanding is that if any of them hang, things are not working properly for that client anyway. I have not looked at how long lived the encompassing request is, but I figured, might as well limit the memory usage instead of spawning a potentially unbounded number of goroutines in case of a long lived one, as I have noted references to keep-alives and update notifications that to me suggest a potentially long lifetime.
Author
Owner

@kradalby commented on GitHub (Mar 20, 2022):

I think this issue is dated and no longer represent the issues with the integration tests. They are better now after work, but still a bit flaky. If we see fit, lets open a new updated issue.

@kradalby commented on GitHub (Mar 20, 2022): I think this issue is dated and no longer represent the issues with the integration tests. They are better now after work, but still a bit flaky. If we see fit, lets open a new updated issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#69