[PR #2538] [MERGED] Fix goroutine leak in EphemeralGC on node cancel #2714

Closed
opened 2025-12-29 03:22:27 +01:00 by adam · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/juanfont/headscale/pull/2538
Author: @qoke
Created: 4/17/2025
Status: Merged
Merged: 4/23/2025
Merged by: @kradalby

Base: mainHead: fix-gc-goroutine-leak


📝 Commits (6)

  • 6f751f0 Fix goroutine leak in EphemeralGC on node cancel
  • 291cf8a Deal with timer firing whilst the GC is shutting down. Fix typos.
  • 94f1dd3 Add waitgroups to tests, as requested by aibot
  • fa68f2f Add constants
  • 772058d Add comments
  • b50e79b Typo

📊 Changes

2 files changed (+432 additions, -6 deletions)

View changed files

hscontrol/db/ephemeral_garbage_collector_test.go (+389 -0)
📝 hscontrol/db/node.go (+43 -6)

📄 Description

There is a goroutine leak in EphemeralGarbageCollector where each call to Schedule() creates a goroutine, but those goroutines are not properly terminated when a node is canceled or when the garbage collector is shut down i.e. leaving dangling goroutines.

Wrote a test to confirm:

=== RUN TestEphemeralGarbageCollectorGoRoutineLeak
ephemeral_garbage_collector_test.go:16: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:68: Final number of goroutines: 102
ephemeral_garbage_collector_test.go:71:
Error Trace: github.com/juanfont/headscale/hscontrol/db/ephemeral_garbage_collector_test.go:71
Error: "102" is not less than or equal to "7"
Test: TestEphemeralGarbageCollectorGoRoutineLeak
Messages: There are significantly more goroutines after GC usage, which suggests a leak
--- FAIL: TestEphemeralGarbageCollectorGoRoutineLeak (0.30s)
FAIL
FAIL github.com/juanfont/headscale/hscontrol/db 0.317s
FAIL

The fix ensures a proper cleanup of dangling goroutines and stopping all timers. Also added a safety check in Schedule() to prevent scheduling after Close() - There don't appear to be any current codepaths that call in this order currently, but I have added a hygiene check for this to ensure if someone does this in future, that it is handled correctly.

After this fix, also added a bunch of additional tests to ensure cancel and reschedule are not caught up in the same problem (which they are not).

=== RUN TestEphemeralGarbageCollectorGoRoutineLeak
ephemeral_garbage_collector_test.go:21: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:71: Final number of goroutines: 2
--- PASS: TestEphemeralGarbageCollectorGoRoutineLeak (0.50s)
=== RUN TestEphemeralGarbageCollectorReschedule
--- PASS: TestEphemeralGarbageCollectorReschedule (0.10s)
=== RUN TestEphemeralGarbageCollectorCancelAndReschedule
--- PASS: TestEphemeralGarbageCollectorCancelAndReschedule (0.20s)
=== RUN TestEphemeralGarbageCollectorCloseBeforeTimerFires
--- PASS: TestEphemeralGarbageCollectorCloseBeforeTimerFires (0.10s)
=== RUN TestEphemeralGarbageCollectorScheduleAfterClose
ephemeral_garbage_collector_test.go:205: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:248: Final number of goroutines: 2
--- PASS: TestEphemeralGarbageCollectorScheduleAfterClose (0.20s)
=== RUN TestEphemeralGarbageCollectorConcurrentScheduleAndClose
ephemeral_garbage_collector_test.go:260: Initial number of goroutines: 2
ephemeral_garbage_collector_test.go:331: Final number of goroutines: 2
--- PASS: TestEphemeralGarbageCollectorConcurrentScheduleAndClose (0.35s)
=== RUN TestEphemeralGarbageCollectorOrder
--- PASS: TestEphemeralGarbageCollectorOrder (7.00s)
=== RUN TestEphemeralGarbageCollectorLoads
--- PASS: TestEphemeralGarbageCollectorLoads (10.01s)
PASS
ok github.com/juanfont/headscale/hscontrol/db 18.487s

Given the amount of time these tests take to run, you might not want to run these by default.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/juanfont/headscale/pull/2538 **Author:** [@qoke](https://github.com/qoke) **Created:** 4/17/2025 **Status:** ✅ Merged **Merged:** 4/23/2025 **Merged by:** [@kradalby](https://github.com/kradalby) **Base:** `main` ← **Head:** `fix-gc-goroutine-leak` --- ### 📝 Commits (6) - [`6f751f0`](https://github.com/juanfont/headscale/commit/6f751f02267218b29277e42f607e1365009f4b9f) Fix goroutine leak in EphemeralGC on node cancel - [`291cf8a`](https://github.com/juanfont/headscale/commit/291cf8acc59b888dfb12df7e9e6b2226eb7d765d) Deal with timer firing whilst the GC is shutting down. Fix typos. - [`94f1dd3`](https://github.com/juanfont/headscale/commit/94f1dd3e124688efbc45390ee5f93cba6826d1fc) Add waitgroups to tests, as requested by aibot - [`fa68f2f`](https://github.com/juanfont/headscale/commit/fa68f2fbe75c25b6da8c4ba7d5a50e4128c9e244) Add constants - [`772058d`](https://github.com/juanfont/headscale/commit/772058d49c7d45c5fadf42b016b120e97da730ef) Add comments - [`b50e79b`](https://github.com/juanfont/headscale/commit/b50e79b9c289ce2e9987309d37b9a2a842bfddd9) Typo ### 📊 Changes **2 files changed** (+432 additions, -6 deletions) <details> <summary>View changed files</summary> ➕ `hscontrol/db/ephemeral_garbage_collector_test.go` (+389 -0) 📝 `hscontrol/db/node.go` (+43 -6) </details> ### 📄 Description There is a goroutine leak in EphemeralGarbageCollector where each call to Schedule() creates a goroutine, but those goroutines are not properly terminated when a node is canceled or when the garbage collector is shut down i.e. leaving dangling goroutines. Wrote a test to confirm: === RUN TestEphemeralGarbageCollectorGoRoutineLeak ephemeral_garbage_collector_test.go:16: Initial number of goroutines: 2 ephemeral_garbage_collector_test.go:68: Final number of goroutines: 102 ephemeral_garbage_collector_test.go:71: Error Trace: github.com/juanfont/headscale/hscontrol/db/ephemeral_garbage_collector_test.go:71 Error: "102" is not less than or equal to "7" Test: TestEphemeralGarbageCollectorGoRoutineLeak Messages: There are significantly more goroutines after GC usage, which suggests a leak --- FAIL: TestEphemeralGarbageCollectorGoRoutineLeak (0.30s) FAIL FAIL github.com/juanfont/headscale/hscontrol/db 0.317s FAIL The fix ensures a proper cleanup of dangling goroutines and stopping all timers. Also added a safety check in Schedule() to prevent scheduling after Close() - There don't appear to be any current codepaths that call in this order currently, but I have added a hygiene check for this to ensure if someone does this in future, that it is handled correctly. After this fix, also added a bunch of additional tests to ensure cancel and reschedule are not caught up in the same problem (which they are not). === RUN TestEphemeralGarbageCollectorGoRoutineLeak ephemeral_garbage_collector_test.go:21: Initial number of goroutines: 2 ephemeral_garbage_collector_test.go:71: Final number of goroutines: 2 --- PASS: TestEphemeralGarbageCollectorGoRoutineLeak (0.50s) === RUN TestEphemeralGarbageCollectorReschedule --- PASS: TestEphemeralGarbageCollectorReschedule (0.10s) === RUN TestEphemeralGarbageCollectorCancelAndReschedule --- PASS: TestEphemeralGarbageCollectorCancelAndReschedule (0.20s) === RUN TestEphemeralGarbageCollectorCloseBeforeTimerFires --- PASS: TestEphemeralGarbageCollectorCloseBeforeTimerFires (0.10s) === RUN TestEphemeralGarbageCollectorScheduleAfterClose ephemeral_garbage_collector_test.go:205: Initial number of goroutines: 2 ephemeral_garbage_collector_test.go:248: Final number of goroutines: 2 --- PASS: TestEphemeralGarbageCollectorScheduleAfterClose (0.20s) === RUN TestEphemeralGarbageCollectorConcurrentScheduleAndClose ephemeral_garbage_collector_test.go:260: Initial number of goroutines: 2 ephemeral_garbage_collector_test.go:331: Final number of goroutines: 2 --- PASS: TestEphemeralGarbageCollectorConcurrentScheduleAndClose (0.35s) === RUN TestEphemeralGarbageCollectorOrder --- PASS: TestEphemeralGarbageCollectorOrder (7.00s) === RUN TestEphemeralGarbageCollectorLoads --- PASS: TestEphemeralGarbageCollectorLoads (10.01s) PASS ok github.com/juanfont/headscale/hscontrol/db 18.487s Given the amount of time these tests take to run, you might not want to run these by default. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
adam added the pull-request label 2025-12-29 03:22:27 +01:00
adam closed this issue 2025-12-29 03:22:27 +01:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#2714