headscale

mirror of https://github.com/juanfont/headscale.git synced 2026-04-18 23:10:10 +02:00

Author	SHA1	Message	Date
Kristoffer Dalby	82c7efccf8	mapper/batcher: serialize per-node work to prevent out-of-order delivery processBatchedChanges queued each pending change for a node as a separate work item. Since multiple workers pull from the same channel, two changes for the same node could be processed concurrently by different workers. This caused two problems: 1. MapResponses delivered out of order — a later change could finish generating before an earlier one, so the client sees stale state. 2. updateSentPeers and computePeerDiff race against each other — updateSentPeers does Clear() + Store() which is not atomic relative to a concurrent Range() in computePeerDiff. Bundle all pending changes for a node into a single work item so one worker processes them sequentially. Add a per-node workMu that serializes processing across consecutive batch ticks, preventing a second worker from starting tick N+1 while tick N is still in progress. Fixes #3140	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	87b8507ac9	mapper/batcher: replace connected map with per-node disconnectedAt The Batcher's connected field (xsync.Map[types.NodeID, time.Time]) encoded three states via pointer semantics: - nil value: node is connected - non-nil time: node disconnected at that timestamp - key missing: node was never seen This was error-prone (nil meaning 'connected' inverts Go idioms), redundant with b.nodes + hasActiveConnections(), and required keeping two parallel maps in sync. It also contained a bug in RemoveNode where new(time.Now()) was used instead of &now, producing a zero time. Replace the separate connected map with a disconnectedAt field on multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly on the object that already manages the node's connections. Changes: - Add disconnectedAt field and helpers (markConnected, markDisconnected, isConnected, offlineDuration) to multiChannelNodeConn - Remove the connected field from Batcher - Simplify IsConnected from two map lookups to one - Simplify ConnectedMap and Debug from two-map iteration to one - Rewrite cleanupOfflineNodes to scan b.nodes directly - Remove the markDisconnectedIfNoConns helper - Update all tests and benchmarks Fixes #3141	2026-03-16 02:22:56 -07:00
Kristoffer Dalby	60317064fd	mapper/batcher: serialize per-node work to prevent out-of-order delivery processBatchedChanges queued each pending change for a node as a separate work item. Since multiple workers pull from the same channel, two changes for the same node could be processed concurrently by different workers. This caused two problems: 1. MapResponses delivered out of order — a later change could finish generating before an earlier one, so the client sees stale state. 2. updateSentPeers and computePeerDiff race against each other — updateSentPeers does Clear() + Store() which is not atomic relative to a concurrent Range() in computePeerDiff. Bundle all pending changes for a node into a single work item so one worker processes them sequentially. Add a per-node workMu that serializes processing across consecutive batch ticks, preventing a second worker from starting tick N+1 while tick N is still in progress. Fixes #3140	2026-03-16 02:22:46 -07:00
Kristoffer Dalby	2058343ad6	mapper: remove Batcher interface, rename to Batcher struct Remove the Batcher interface since there is only one implementation. Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into batcher.go. Drop type assertions in debug.go now that mapBatcher is a concrete *mapper.Batcher pointer.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	3ebe4d99c1	mapper/batcher: reduce lock contention with two-phase send Rewrite multiChannelNodeConn.send() to use a two-phase approach: 1. RLock: snapshot connections slice (cheap pointer copy) 2. Unlock: send to all connections (50ms timeouts happen here) 3. Lock: remove failed connections by pointer identity Previously, send() held the write lock for the entire duration of sending to all connections. With N stale connections each timing out at 50ms, this blocked addConnection/removeConnection for N50ms. The two-phase approach holds the lock only for O(N) pointer operations, not for N50ms I/O waits.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	da33795e79	mapper/batcher: fix race conditions in cleanup and lookups Replace the two-phase Load-check-Delete in cleanupOfflineNodes with xsync.Map.Compute() for atomic check-and-delete. This prevents the TOCTOU race where a node reconnects between the hasActiveConnections check and the Delete call. Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites to prevent nil pointer panics from concurrent cleanup races.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	57070680a5	mapper/batcher: restructure internals for correctness Move per-node pending changes from a shared xsync.Map on the batcher into multiChannelNodeConn, protected by a dedicated mutex. The new appendPending/drainPending methods provide atomic append and drain operations, eliminating data races in addToBatch and processBatchedChanges. Add sync.Once to multiChannelNodeConn.close() to make it idempotent, preventing panics from concurrent close calls on the same channel. Add started atomic.Bool to guard Start() against being called multiple times, preventing orphaned goroutines. Add comprehensive concurrency tests validating these changes.	2026-03-14 02:52:28 -07:00

7 Commits