headscale

mirror of https://github.com/juanfont/headscale.git synced 2026-03-29 21:52:12 +02:00

Author	SHA1	Message	Date
Kristoffer Dalby	b09af3846b	hscontrol/poll,state: fix grace period disconnect TOCTOU race When a node disconnects, serveLongPoll defers a cleanup that starts a grace period goroutine. This goroutine polls batcher.IsConnected() and, if the node has not reconnected within ~10 seconds, calls state.Disconnect() to mark it offline. A TOCTOU race exists: the node can reconnect (calling Connect()) between the IsConnected check and the Disconnect() call, causing the stale Disconnect() to overwrite the new session's online status. Fix with a monotonic per-node generation counter: - State.Connect() increments the counter and returns the current generation alongside the change list. - State.Disconnect() accepts the generation from the caller and rejects the call if a newer generation exists, making stale disconnects from old sessions a no-op. - serveLongPoll captures the generation at Connect() time and passes it to Disconnect() in the deferred cleanup. - RemoveNode's return value is now checked: if another session already owns the batcher slot (reconnect happened), the old session skips the grace period entirely. Update batcher_test.go to track per-node connect generations and pass them through to Disconnect(), matching production behavior. Fixes the following test failures: - server_state_online_after_reconnect_within_grace - update_history_no_false_offline - nodestore_correct_after_rapid_reconnect - rapid_reconnect_peer_never_sees_offline	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	afd3a6acbc	mapper/batcher: remove disabled X-prefixed test functions Remove XTestBatcherChannelClosingRace (~95 lines) and XTestBatcherScalability (~515 lines). These were disabled by prefixing with X (making them invisible to go test) and served as dead code. The functionality they covered is exercised by the active test suite. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	feaf85bfbc	mapper/batcher: clean up test constants and output L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go camelCase. Remove highLoad* and extremeLoad* constants that were only referenced by disabled (X-prefixed) tests. L10: Fix misleading assert message that said "1337" while checking for region ID 999. L12: Remove emoji from test log output to avoid encoding issues in CI environments. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	2058343ad6	mapper: remove Batcher interface, rename to Batcher struct Remove the Batcher interface since there is only one implementation. Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into batcher.go. Drop type assertions in debug.go now that mapBatcher is a concrete *mapper.Batcher pointer.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	57070680a5	mapper/batcher: restructure internals for correctness Move per-node pending changes from a shared xsync.Map on the batcher into multiChannelNodeConn, protected by a dedicated mutex. The new appendPending/drainPending methods provide atomic append and drain operations, eliminating data races in addToBatch and processBatchedChanges. Add sync.Once to multiChannelNodeConn.close() to make it idempotent, preventing panics from concurrent close calls on the same channel. Add started atomic.Bool to guard Start() against being called multiple times, preventing orphaned goroutines. Add comprehensive concurrency tests validating these changes.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	21e02e5d1f	mapper/batcher: add unit tests and benchmarks Add comprehensive unit tests for the LockFreeBatcher covering AddNode/RemoveNode lifecycle, addToBatch routing (broadcast, targeted, full update), processBatchedChanges deduplication, cleanup of offline nodes, close/shutdown behavior, IsConnected state tracking, and connected map consistency. Add benchmarks for connection entry send, multi-channel send and broadcast, peer diff computation, sentPeers updates, addToBatch at various scales (10/100/1000 nodes), processBatchedChanges, broadcast delivery, IsConnected lookups, connected map enumeration, connection churn, and concurrent send+churn scenarios. Widen setupBatcherWithTestData to accept testing.TB so benchmarks can reuse the same database-backed test setup as unit tests.	2026-03-14 02:52:28 -07:00
DM	4aca9d6568	poll: stop stale map sessions through an explicit teardown hook When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.	2026-03-12 01:27:34 -07:00
DM	b81d6c734d	mapper: handle RemoveNode after channel cleanup A connection can already be removed from multiChannelNodeConn by the stale-send cleanup path before serveLongPoll reaches its deferred RemoveNode call. In that case RemoveNode used to return early on "channel not found" and never updated the node's connected state. Drop that early return so RemoveNode still checks whether any active connections remain and marks the node disconnected when the last one is gone.	2026-03-12 01:27:34 -07:00
Kristoffer Dalby	cb3b6949ea	auth: generalise auth flow and introduce AuthVerdict Generalise the registration pipeline to a more general auth pipeline supporting both node registrations and SSH check auth requests. Rename RegistrationID to AuthID, unexport AuthRequest fields, and introduce AuthVerdict to unify the auth finish API. Add the urlParam generic helper for extracting typed URL parameters from chi routes, used by the new auth request handler. Updates #1850	2026-02-25 21:28:05 +01:00
Kristoffer Dalby	ce580f8245	all: fix golangci-lint issues (#3064 )	2026-02-06 21:45:32 +01:00
Shourya Gautam	4e1834adaf	db: use PolicyManager for RequestTags migration Refactor the RequestTags migration (202601121700-migrate-hostinfo-request-tags) to use PolicyManager.NodeCanHaveTag() instead of reimplementing tag validation. Changes: - NewHeadscaleDatabase now accepts types.Config to allow migrations access to policy configuration - Add loadPolicyBytes helper to load policy from file or DB based on config - Add standalone GetPolicy(tx gorm.DB) for use during migrations - Replace custom tag validation logic with PolicyManager Benefits: - Full HuJSON parsing support (not just JSON) - Proper group expansion via PolicyManager - Support for nested tags and autogroups - Works with both file and database policy modes - Single source of truth for tag validation Co-Authored-By: Shourya Gautam <shouryamgautam@gmail.com>	2026-01-21 15:10:29 +01:00
Kristoffer Dalby	f3767dddf8	batcher: ensure removal from batcher Fixes #2924 Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>	2025-12-17 13:19:26 +01:00
Kristoffer Dalby	5767ca5085	change: smarter change notifications This commit replaces the ChangeSet with a simpler bool based change model that can be directly used in the map builder to build the appropriate map response based on the change that has occured. Previously, we fell back to sending full maps for a lot of changes as that was consider "the safe" thing to do to ensure no updates were missed. This was slightly problematic as a node that already has a list of peers will only do full replacement of the peers if the list is non-empty, meaning that it was not possible to remove all nodes (if for example policy changed). Now we will keep track of last seen nodes, so we can send remove ids, but also we are much smarter on how we send smaller, partial maps when needed. Fixes #2389 Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>	2025-12-16 10:12:36 +01:00
Kristoffer Dalby	616c0e895d	batcher: fix closed panic Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-12-15 16:28:27 +01:00
Kristoffer Dalby	87bd67318b	golangci-lint: use forbidigo to block time.Sleep (#2946 )	2025-12-10 16:45:59 +00:00
Kristoffer Dalby	eec196d200	modernize: run gopls modernize to bring up to 1.25 (#2920 )	2025-12-01 19:40:25 +01:00
Kristoffer Dalby	db293e0698	hscontrol/state: make NodeStore batch configuration tunable (#2886 )	2025-11-28 16:38:29 +01:00
Kristoffer Dalby	7fb0f9a501	batcher: send endpoint and derp only updates. (#2856 )	2025-11-13 20:38:49 +01:00
Kristoffer Dalby	ed3a9c8d6d	mapper: send change instead of full update (#2775 )	2025-09-17 14:23:21 +02:00
Kristoffer Dalby	d41fb4d540	app: fix sigint hanging When the node notifier was replaced with batcher, we removed its closing, but forgot to add the batchers so it was never stopping node connections and waiting forever. Fixes #2751 Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-11 11:53:26 +02:00
Kristoffer Dalby	233dffc186	lint and leftover Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00
Kristoffer Dalby	9d236571f4	state/nodestore: in memory representation of nodes Initial work on a nodestore which stores all of the nodes and their relations in memory with relationship for peers precalculated. It is a copy-on-write structure, replacing the "snapshot" when a change to the structure occurs. It is optimised for reads, and while batches are not fast, they are grouped together to do less of the expensive peer calculation if there are many changes rapidly. Writes will block until commited, while reads are never blocked. Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00
Kristoffer Dalby	b6d5788231	mapper: produce map before poll Before this patch, we would send a message to each "node stream" that there is an update that needs to be turned into a mapresponse and sent to a node. Producing the mapresponse is a "costly" afair which means that while a node was producing one, it might start blocking and creating full queues from the poller and all the way up to where updates where sent. This could cause updates to time out and being dropped as a bad node going away or spending too time processing would cause all the other nodes to not get any updates. In addition, it contributed to "uncontrolled parallel processing" by potentially doing too many expensive operations at the same time: Each node stream is essentially a channel, meaning that if you have 30 nodes, we will try to process 30 map requests at the same time. If you have 8 cpu cores, that will saturate all the cores immediately and cause a lot of wasted switching between the processing. Now, all the maps are processed by workers in the mapper, and the number of workers are controlable. These would now be recommended to be a bit less than number of CPU cores, allowing us to process them as fast as we can, and then send them to the poll. When the poll recieved the map, it is only responsible for taking it and sending it to the node. This might not directly improve the performance of Headscale, but it will likely make the performance a lot more consistent. And I would argue the design is a lot easier to reason about. Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00
Kristoffer Dalby	b87567628a	derp: increase update frequency and harden on failures (#2741 )	2025-08-22 10:40:38 +02:00
Kristoffer Dalby	a058bf3cd3	mapper: produce map before poll (#2628 )	2025-07-28 11:15:53 +02:00

25 Commits