headscale

mirror of https://github.com/juanfont/headscale.git synced 2026-03-28 12:12:06 +01:00

Author	SHA1	Message	Date
Kristoffer Dalby	87b8507ac9	mapper/batcher: replace connected map with per-node disconnectedAt The Batcher's connected field (xsync.Map[types.NodeID, time.Time]) encoded three states via pointer semantics: - nil value: node is connected - non-nil time: node disconnected at that timestamp - key missing: node was never seen This was error-prone (nil meaning 'connected' inverts Go idioms), redundant with b.nodes + hasActiveConnections(), and required keeping two parallel maps in sync. It also contained a bug in RemoveNode where new(time.Now()) was used instead of &now, producing a zero time. Replace the separate connected map with a disconnectedAt field on multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly on the object that already manages the node's connections. Changes: - Add disconnectedAt field and helpers (markConnected, markDisconnected, isConnected, offlineDuration) to multiChannelNodeConn - Remove the connected field from Batcher - Simplify IsConnected from two map lookups to one - Simplify ConnectedMap and Debug from two-map iteration to one - Rewrite cleanupOfflineNodes to scan b.nodes directly - Remove the markDisconnectedIfNoConns helper - Update all tests and benchmarks Fixes #3141	2026-03-16 02:22:56 -07:00
Kristoffer Dalby	60317064fd	mapper/batcher: serialize per-node work to prevent out-of-order delivery processBatchedChanges queued each pending change for a node as a separate work item. Since multiple workers pull from the same channel, two changes for the same node could be processed concurrently by different workers. This caused two problems: 1. MapResponses delivered out of order — a later change could finish generating before an earlier one, so the client sees stale state. 2. updateSentPeers and computePeerDiff race against each other — updateSentPeers does Clear() + Store() which is not atomic relative to a concurrent Range() in computePeerDiff. Bundle all pending changes for a node into a single work item so one worker processes them sequentially. Add a per-node workMu that serializes processing across consecutive batch ticks, preventing a second worker from starting tick N+1 while tick N is still in progress. Fixes #3140	2026-03-16 02:22:46 -07:00
Kristoffer Dalby	86e279869e	mapper/batcher: minor production code cleanup L1: Replace crypto/rand with an atomic counter for generating connection IDs. These identifiers are process-local and do not need cryptographic randomness; a monotonic counter is cheaper and produces shorter, sortable IDs. L5: Use getActiveConnectionCount() in Debug() instead of directly locking the mutex and reading the connections slice. This avoids bypassing the accessor that already exists for this purpose. L6: Extract the hardcoded 15*time.Minute cleanup threshold into the named constant offlineNodeCleanupThreshold. L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch directly. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	7881f65358	mapper: extract node connection types to node_conn.go Move connectionEntry, multiChannelNodeConn, generateConnectionID, and all their methods from batcher.go into a dedicated file. This reduces batcher.go from ~1170 lines to ~800 and separates per-node connection management from batcher orchestration. Pure move — no logic changes. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	50e8b21471	mapper/batcher: fix pointer retention, done-channel init, and connected-map races M7: Nil out trailing *connectionEntry pointers in the backing array after slice removal in removeConnectionAtIndexLocked and send(). Without this, the GC cannot collect removed entries until the slice is reallocated. M1: Initialize the done channel in NewBatcher instead of Start(). Previously, calling Close() or queueWork before Start() would select on a nil channel, blocking forever. Moving the make() to the constructor ensures the channel is always usable. M2: Move b.connected.Delete and b.totalNodes decrement inside the Compute callback in cleanupOfflineNodes. Previously these ran after the Compute returned, allowing a concurrent AddNode to reconnect between the delete and the bookkeeping update, which would wipe the fresh connected state. M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously, when initial map generation or send timed out, the connection was removed but b.connected retained its old nil (= connected) value, making IsConnected return true for a node with zero connections. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	57a38b5678	mapper/batcher: reduce hot-path log verbosity Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)), and mutex timing from send(), addConnection(), and removeConnectionByChannel(). Move per-broadcast summary and no-connection logs from Debug to Trace. Remove per-connection "attempting"/"succeeded" logs entirely; keep Warn for failures. These methods run on every MapResponse delivery, so the savings compound quickly under load. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	051a38a4c4	mapper/batcher: track worker goroutines and stop ticker on Close Close() previously closed the done channel and returned immediately, without waiting for worker goroutines to exit. This caused goroutine leaks in tests and allowed workers to race with connection teardown. The ticker was also never stopped, leaking its internal goroutine. Add a sync.WaitGroup to track the doWork goroutine and every worker it spawns. Close() now calls wg.Wait() after signalling shutdown, ensuring all goroutines have exited before tearing down connections. Also stop the ticker to prevent resource leaks. Document that a Batcher must not be reused after Close().	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	3276bda0c0	mapper/batcher: replace time.After with NewTimer to avoid timer leak connectionEntry.send() is on the hot path: called once per connection per broadcast tick. time.After allocates a timer that sits in the runtime timer heap until it fires (50 ms), even when the channel send succeeds immediately. At 1000 connected nodes, every tick leaks 1000 timers into the heap, creating continuous GC pressure. Replace with time.NewTimer + defer timer.Stop() so the timer is removed from the heap as soon as the fast-path send completes.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	2058343ad6	mapper: remove Batcher interface, rename to Batcher struct Remove the Batcher interface since there is only one implementation. Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into batcher.go. Drop type assertions in debug.go now that mapBatcher is a concrete *mapper.Batcher pointer.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	57070680a5	mapper/batcher: restructure internals for correctness Move per-node pending changes from a shared xsync.Map on the batcher into multiChannelNodeConn, protected by a dedicated mutex. The new appendPending/drainPending methods provide atomic append and drain operations, eliminating data races in addToBatch and processBatchedChanges. Add sync.Once to multiChannelNodeConn.close() to make it idempotent, preventing panics from concurrent close calls on the same channel. Add started atomic.Bool to guard Start() against being called multiple times, preventing orphaned goroutines. Add comprehensive concurrency tests validating these changes.	2026-03-14 02:52:28 -07:00
DM	4aca9d6568	poll: stop stale map sessions through an explicit teardown hook When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.	2026-03-12 01:27:34 -07:00
Kristoffer Dalby	ce580f8245	all: fix golangci-lint issues (#3064 )	2026-02-06 21:45:32 +01:00
Kristoffer Dalby	53cdeff129	hscontrol/mapper: use sub-loggers and zf constants Add sub-logger patterns to worker(), AddNode(), RemoveNode() and multiChannelNodeConn to eliminate repeated field calls. Use zf.* constants for consistent field naming. Changes in batcher_lockfree.go: - Add wlog sub-logger in worker() with worker.id context - Add log field to multiChannelNodeConn struct - Initialize mc.log with node.id in newMultiChannelNodeConn() - Add nlog sub-loggers in AddNode() and RemoveNode() - Update all connection methods to use mc.log Changes in batcher.go: - Use zf.NodeID and zf.Reason in handleNodeChange()	2026-02-06 07:40:29 +01:00
Kristoffer Dalby	3b4b9a4436	hscontrol: fix tag updates not propagating to node self view When SetNodeTags changed a node's tags, the node's self view wasn't updated. The bug manifested as: the first SetNodeTags call updates the server but the client's self view doesn't update until a second call with the same tag. Root cause: Three issues combined to prevent self-updates: 1. SetNodeTags returned PolicyChange which doesn't set OriginNode, so the mapper's self-update check failed. 2. The Change.Merge function didn't preserve OriginNode, so when changes were batched together, OriginNode was lost. 3. generateMapResponse checked OriginNode only in buildFromChange(), but PolicyChange uses RequiresRuntimePeerComputation which bypasses that code path entirely and calls policyChangeResponse() instead. The fix addresses all three: - state.go: Set OriginNode on the returned change - change.go: Preserve OriginNode (and TargetNode) during merge - batcher.go: Pass isSelfUpdate to policyChangeResponse so the origin node gets both self info AND packet filters - mapper.go: Add includeSelf parameter to policyChangeResponse Fixes #2978	2026-01-20 10:13:47 +01:00
Kristoffer Dalby	82d4275c3b	mapper: correct some variable names missed from change Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>	2025-12-17 13:19:26 +01:00
Kristoffer Dalby	5767ca5085	change: smarter change notifications This commit replaces the ChangeSet with a simpler bool based change model that can be directly used in the map builder to build the appropriate map response based on the change that has occured. Previously, we fell back to sending full maps for a lot of changes as that was consider "the safe" thing to do to ensure no updates were missed. This was slightly problematic as a node that already has a list of peers will only do full replacement of the peers if the list is non-empty, meaning that it was not possible to remove all nodes (if for example policy changed). Now we will keep track of last seen nodes, so we can send remove ids, but also we are much smarter on how we send smaller, partial maps when needed. Fixes #2389 Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>	2025-12-16 10:12:36 +01:00
Kristoffer Dalby	7fb0f9a501	batcher: send endpoint and derp only updates. (#2856 )	2025-11-13 20:38:49 +01:00
Kristoffer Dalby	ed3a9c8d6d	mapper: send change instead of full update (#2775 )	2025-09-17 14:23:21 +02:00
Kristoffer Dalby	233dffc186	lint and leftover Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00
Kristoffer Dalby	9d236571f4	state/nodestore: in memory representation of nodes Initial work on a nodestore which stores all of the nodes and their relations in memory with relationship for peers precalculated. It is a copy-on-write structure, replacing the "snapshot" when a change to the structure occurs. It is optimised for reads, and while batches are not fast, they are grouped together to do less of the expensive peer calculation if there are many changes rapidly. Writes will block until commited, while reads are never blocked. Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00
Kristoffer Dalby	8e25f7f9dd	bunch of qol (#2748 )	2025-08-27 17:09:13 +02:00
Kristoffer Dalby	a058bf3cd3	mapper: produce map before poll (#2628 )	2025-07-28 11:15:53 +02:00

22 Commits