Rename all 594 test data files from .json to .hujson and add
descriptive header comments to each file documenting what policy
rules are under test and what outcome is expected.
Update test loaders in all 5 _test.go files to parse HuJSON via
hujson.Parse/Standardize/Pack before json.Unmarshal.
Add cross-dependency warning to via_compat_test.go documenting
that GRANT-V29/V30/V31/V36 are shared with TestGrantsCompat.
Add .gitignore exemption for testdata HuJSON files.
Remove unused error variables (ErrGrantViaNotSupported, ErrGrantEmptySources, ErrGrantEmptyDestinations, ErrGrantViaOnlyTag) and the stale TODO for via implementation. Update compat test skip reasons to reflect that user:*@passkey wildcard is a known unsupported feature, not a pending implementation.
Updates #2180
Remove TestGrantViaExitNodeSteering and TestGrantViaMixedSteering.
Exit node traffic forwarding through via grants cannot be validated
with curl/traceroute in Docker containers because Tailscale exit nodes
strip locally-connected subnets from their forwarding filter.
The correctness of via exit steering is validated by:
- Golden MapResponse comparison (TestViaGrantMapCompat with GRANT-V31
and GRANT-V36) comparing full netmap output against Tailscale SaaS
- Filter rule compatibility (TestGrantsCompat with GRANT-V14 through
GRANT-V36) comparing per-node PacketFilter rules against Tailscale SaaS
- TestGrantViaSubnetSteering (kept) validates via subnet steering with
actual curl/traceroute through Docker, which works for subnet routes
Updates #2180
Use per-node compilation path for via grants in BuildPeerMap and MatchersForNode to ensure via-granted nodes appear in peer maps. Fix ViaRoutesForPeer golden test route inference to correctly resolve via grant effects.
Updates #2180
Add golden test data for via exit route steering and fix via exit grant compilation to match Tailscale SaaS behavior. Includes MapResponse golden tests for via grant route steering verification.
Updates #2180
Add NetworkSpec struct with optional Subnet field to ScenarioSpec.Networks.
When Subnet is set, the Docker network is created with that specific CIDR
instead of Docker's auto-assigned RFC1918 range.
Fix all exit node integration tests to use curl + traceroute. Tailscale
exit nodes strip locally-connected subnets from their forwarding filter
(shrinkDefaultRoute + localInterfaceRoutes), so exit nodes cannot
forward to IPs on their Docker network via the default route alone.
This is by design: exit nodes provide internet access, not LAN access.
To also get LAN access, the subnet must be explicitly advertised as a
route — matching real-world Tailscale deployment requirements.
- TestSubnetRouterMultiNetworkExitNode: advertise usernet1 subnet
alongside exit route, upgraded from ping to curl + traceroute
- TestGrantViaExitNodeSteering: usernet1 subnet in via grants and
auto-approvers alongside autogroup:internet
- TestGrantViaMixedSteering: externet subnet in auto-approvers and
route advertisement for exit traffic
Updates #2180
Add three tests that verify control plane behavior for grant policies:
- TestGrantViaSubnetFilterRules: verifies the router's PacketFilter
contains destination rules for via-steered subnets. Without per-node
filter compilation for via grants, these rules were missing and the
router would drop forwarded traffic.
- TestGrantViaExitNodeFilterRules: same verification for exit nodes
with via grants steering autogroup:internet traffic.
- TestGrantIPv6OnlyPrefixACL: verifies that address-based aliases
(Prefix, Host) resolve to exactly the literal prefix and do not
expand to include the matching node's other IP addresses. An
IPv6-only host definition produces only IPv6 filter rules.
Updates #2180
Via grants compile filter rules that depend on the node's route state
(SubnetRoutes, ExitRoutes). Without per-node compilation, these rules
were only included in the global filter path which explicitly skips via
grants (compileFilterRules skips grants with non-empty Via fields).
Add a needsPerNodeFilter flag that is true when the policy uses either
autogroup:self or via grants. filterForNodeLocked now uses this flag
instead of usesAutogroupSelf alone, ensuring via grant rules are
compiled per-node through compileFilterRulesForNode/compileViaGrant.
The filter cache also needs to account for route-dependent compilation:
- nodesHavePolicyAffectingChanges now treats route changes as
policy-affecting when needsPerNodeFilter is true, so SetNodes
triggers updateLocked and clears caches through the normal flow.
- invalidateGlobalPolicyCache now clears compiledFilterRulesMap
(the unreduced per-node cache) alongside filterRulesMap when
needsPerNodeFilter is true and routes changed.
Updates #2180
Action.UnmarshalJSON produces the format
'action="unknown-action" is not supported: invalid ACL action',
not the reversed format the test expected.
Updates #2180
Add servertest grant policy control plane tests covering basic grants, via grants, and cap grants. Fix ReduceRoutes in State to apply route reduction to non-via routes first, then append via-included routes, preventing via grant routes from being incorrectly filtered.
Updates #2180
Test via route computation for viewer-peer pairs: self-steering returns
empty, viewer not in source returns empty, peer without advertised
destination returns empty, peer with/without via tag populates
Include/Exclude respectively, mixed prefix and autogroup:internet
destinations, and exit route steering.
7 subtests covering all code paths in ViaRoutesForPeer.
Updates #2180
Test companionCapGrantRules, sourcesHaveWildcard, sourcesHaveDangerAll,
srcIPsWithRoutes, the FilterAllowAll fix for grant-only policies,
compileViaGrant, compileGrantWithAutogroupSelf grant paths, and
destinationsToNetPortRange autogroup:internet skipping.
51 subtests across 8 test functions covering all grant-specific code
paths in filter.go that previously had no test coverage.
Updates #2180
Add NodeAttrsTaildriveShare and NodeAttrsTaildriveAccess to the node capability map, enabling Taildrive file sharing when granted via policy. Add integration test verifying the full cap/drive grant lifecycle.
Updates #2180
Add ConnectToNetwork to the TailscaleClient interface for multi-network test scenarios and implement peer relay ping support. Use these to test that cap/relay grants correctly enable peer-to-peer relay connections between tagged nodes.
Updates #2180
compileFilterRules checked only pol.ACLs == nil to decide whether
to return FilterAllowAll (permit-any). Policies that use only Grants
(no ACLs) had nil ACLs, so the function short-circuited before
compiling any CapGrant rules. This meant cap/relay, cap/drive, and
any other App-based grant capabilities were silently ignored.
Check both ACLs and Grants are empty before returning FilterAllowAll.
Updates #2180
Add integration tests validating that via grants correctly steer
routes to designated nodes per client group:
- TestGrantViaSubnetSteering: two routers advertise the same
subnet, via grants steer each client group to a specific router.
Verifies per-client route visibility, curl reachability, and
traceroute path.
- TestGrantViaExitNodeSteering: two exit nodes, via grants steer
each client group to a designated exit node. Verifies exit
routes are withdrawn from non-designated nodes and the client
rejects setting a non-designated exit node.
- TestGrantViaMixedSteering: cross-steering where subnet routes
and exit routes go to different servers per client group.
Verifies subnet traffic uses the subnet-designated server while
exit traffic uses the exit-designated server.
Also add autogroupp helper for constructing AutoGroup aliases in
grant policy configurations.
Updates #2180
Via grants steer routes to specific nodes per viewer. Until now,
all clients saw the same routes for each peer because route
assembly was viewer-independent. This implements per-viewer route
visibility so that via-designated peers serve routes only to
matching viewers, while non-designated peers have those routes
withdrawn.
Add ViaRouteResult type (Include/Exclude prefix lists) and
ViaRoutesForPeer to the PolicyManager interface. The v2
implementation iterates via grants, resolves sources against the
viewer, matches destinations against the peer's advertised routes
(both subnet and exit), and categorizes prefixes by whether the
peer has the via tag.
Add RoutesForPeer to State which composes global primary election,
via Include/Exclude filtering, exit routes, and ACL reduction.
When no via grants exist, it falls back to existing behavior.
Update the mapper to call RoutesForPeer per-peer instead of using
a single route function for all peers. The route function now
returns all routes (subnet + exit), and TailNode filters exit
routes out of the PrimaryRoutes field for HA tracking.
Updates #2180
compileViaGrant only handled *Prefix destinations, skipping
*AutoGroup entirely. This meant via grants with
dst=[autogroup:internet] produced no filter rules even when the
node was an exit node with approved exit routes.
Switch the destination loop from a type assertion to a type switch
that handles both *Prefix (subnet routes) and *AutoGroup (exit
routes via autogroup:internet). Also check ExitRoutes() in
addition to SubnetRoutes() so the function doesn't bail early
when a node only has exit routes.
Updates #2180
Add autogroup:danger-all as a valid source alias that matches ALL IP
addresses including non-Tailscale addresses. When used as a source,
it resolves to 0.0.0.0/0 + ::/0 internally but produces SrcIPs: ["*"]
in filter rules. When used as a destination, it is rejected with an
error matching Tailscale SaaS behavior.
Key changes:
- Add AutoGroupDangerAll constant and validation
- Add sourcesHaveDangerAll() helper and hasDangerAll parameter to
srcIPsWithRoutes() across all compilation paths
- Add ErrAutogroupDangerAllDst for destination rejection
- Remove 3 AUTOGROUP_DANGER_ALL skip entries (K6, K7, K8)
Updates #2180
Implement comprehensive grant validation including: accept empty sources/destinations (they produce no rules), validate grant ip/app field requirements, capability name format, autogroup constraints, via tag existence, and default route CIDR restrictions.
Updates #2180
Compile grants with "via" field into FilterRules that are placed only
on nodes matching the via tag and actually advertising the destination
subnets. Key behavior:
- Filter rules go exclusively to via-nodes with matching approved routes
- Destination subnets not advertised by the via node are silently dropped
- App-only via grants (no ip field) produce no packet filter rules
- Via grants are skipped in the global compileFilterRules since they
are node-specific
Reduces grant compat test skips from 41 to 30 (11 newly passing).
Updates #2180
Compile grant app fields into CapGrant FilterRules matching Tailscale
SaaS behavior. Key changes:
- Generate CapGrant rules in compileFilterRules and
compileGrantWithAutogroupSelf, with node-specific /32 and /128
Dsts for autogroup:self grants
- Add reversed companion rules for drive→drive-sharer and
relay→relay-target capabilities, ordered by original cap name
- Narrow broad CapGrant Dsts to node-specific prefixes in
ReduceFilterRules via new reduceCapGrantRule helper
- Skip merging CapGrant rules in mergeFilterRules to preserve
per-capability structure
- Remove ip+app mutual exclusivity validation (Tailscale accepts both)
- Add semantic JSON comparison for RawMessage types and netip.Prefix
comparators in test infrastructure
Reduces grant compat test skips from 99 to 41 (58 newly passing).
Updates #2180
When an ACL source list contains a wildcard (*) alongside explicit
sources (tags, groups, hosts, etc.), Tailscale preserves the individual
IPs from non-wildcard sources in SrcIPs alongside the merged wildcard
CGNAT ranges. Previously, headscale's IPSetBuilder would merge all
sources into a single set, absorbing the explicit IPs into the wildcard
range.
Track non-wildcard resolved addresses separately during source
resolution, then append their individual IP strings to the output
when a wildcard is also present. This fixes the remaining 5 ACL
compat test failures (K01 and M06 subtests).
Updates #2180
When an ACL has non-autogroup destinations (groups, users, tags, hosts)
alongside autogroup:self, emit non-self grants before self grants to
match Tailscale's filter rule ordering. ACLs with only autogroup
destinations (self + member) preserve the policy-defined order.
This fixes ACL-A17, ACL-SF07, and ACL-SF11 compat test failures.
Updates #2180
Add exit route check in ReduceFilterRules to prevent exit nodes from receiving packet filter rules for destinations that only overlap via exit routes. Remove resolved SUBNET_ROUTE_FILTER_RULES grant skip entries and update error message formatting for grant validation.
Updates #2180
Per Tailscale documentation, the wildcard (*) source includes "any
approved subnets" — the actually-advertised-and-approved routes from
nodes, not the autoApprover policy prefixes.
Change Asterix.resolve() to return just the base CGNAT+ULA set, and
add approved subnet routes as separate SrcIPs entries in the filter
compilation path. This preserves individual route prefixes that would
otherwise be merged by IPSet (e.g., 10.0.0.0/8 absorbing 10.33.0.0/16).
Also swap rule ordering in compileGrantWithAutogroupSelf() to emit
non-self destination rules before autogroup:self rules, matching the
Tailscale FilterRule wire format ordering.
Remove the unused AutoApproverPolicy.prefixes() method.
Updates #2180
Add routable_ips and approved_routes fields to the node topology
definitions in all golden test files. These represent the subnet
routes actually advertised by nodes on the Tailscale SaaS network
during data capture:
Routes topology (92 files, 6 router nodes):
big-router: 10.0.0.0/8
subnet-router: 10.33.0.0/16
ha-router1: 192.168.1.0/24
ha-router2: 192.168.1.0/24
multi-router: 172.16.0.0/24
exit-node: 0.0.0.0/0, ::/0
ACL topology (199 files, 1 router node):
subnet-router: 10.33.0.0/16
Grants topology (203 files, 1 router node):
subnet-router: 10.33.0.0/16
The route assignments were deduced from the golden data by analyzing
which router nodes receive FilterRules for which destination CIDRs
across all test files, and cross-referenced with the MTS setup
script (setup_grant_nodes.sh).
Updates #2180
Use ip.String() instead of netip.PrefixFrom(ip, ip.BitLen()).String()
when building DstPorts for autogroup:self destinations. This produces
bare IPs like "100.90.199.68" instead of CIDR notation like
"100.90.199.68/32", matching the Tailscale FilterRule wire format.
Updates #2180
AppendToIPSet now adds both IPv4 and IPv6 addresses for nodes, matching Tailscale's FilterRule wire format where identity-based aliases (tags, users, groups, autogroups) resolve to both address families. Update ReduceFilterRules test expectations to include IPv6 entries.
Updates #2180
Replace 8,286 lines of inline Go test expectations with 92 JSON golden files captured from Tailscale SaaS. The data-driven test driver validates route filtering, auto-approval, HA routing, and exit node behavior against real Tailscale output.
Updates #2180
Replace 9,937 lines of inline Go test expectations with 215 JSON golden files captured from Tailscale SaaS. The new data-driven test driver compares headscale's filter compilation output against real Tailscale behavior for each node in an 8-node topology.
Updates #2180
Introduce ResolvedAddresses type for structured IP set results. Refactor all Alias.Resolve() methods to return ResolvedAddresses instead of raw IPSets. Restrict identity-based aliases to matching address families, fix nil dereferences in partial resolution paths, and update test expectations for the new IP format (bare IPs, IP ranges instead of CIDR prefixes).
Updates #2180
Rename tailscale_compat_test.go to tailscale_acl_compat_test.go to make room for the grants compat test. Add 237 GRANT-*.json golden test files captured from Tailscale SaaS and a data-driven test driver that compares headscale's grant filter compilation against real Tailscale behavior.
Updates #2180
Add support for the Grant policy format as an alternative to ACL format,
following Tailscale's policy v2 specification. Grants provide a more
structured way to define network access rules with explicit separation
of IP-based and capability-based permissions.
Key changes:
- Add Grant struct with Sources, Destinations, InternetProtocols (ip),
and App (capabilities) fields
- Add ProtocolPort type for unmarshaling protocol:port strings
- Add Grant validation in Policy.validate() to enforce:
- Mutual exclusivity of ip and app fields
- Required ip or app field presence
- Non-empty sources and destinations
- Refactor compileFilterRules to support both ACLs and Grants
- Convert ACLs to Grants internally via aclToGrants() for unified
processing
- Extract destinationsToNetPortRange() helper for cleaner code
- Rename parseProtocol() to toIANAProtocolNumbers() for clarity
- Add ProtocolNumberToName mapping for reverse lookups
The Grant format allows policies to be written using either the legacy
ACL format or the new Grant format. ACLs are converted to Grants
internally, ensuring backward compatibility while enabling the new
format's benefits.
Updates #2180
WithTags was defined but never passed through to CreatePreAuthKey.
Fix NewClient to use CreateTaggedPreAuthKey when tags are specified,
enabling tests that need tagged nodes (e.g. via grant steering).
Updates #2180
When exit routes are approved, SubnetRoutes remains empty because exit
routes (0.0.0.0/0, ::/0) are classified separately. Without checking
ExitRoutes, the PolicyManager cache is not invalidated on exit route
approval, causing stale filter rules that lack via grant entries for
autogroup:internet destinations.
Updates #2180
Switch all integration test jobs (build, build-postgres, test
template) from ubuntu-latest (x86_64) to ubuntu-24.04-arm (aarch64).
ARM runners on GitHub Actions are free for public repos and tend
to have more consistent performance characteristics than the
shared x86_64 pool. This should reduce flakiness caused by
resource contention on congested runners.
Updates #3125
Apply CI-aware scaling to all remaining hardcoded timeouts:
- requireAllClientsOfflineStaged: scale the three internal stage
timeouts (15s/20s/60s) with ScaledTimeout.
- validateReloginComplete: scale requireAllClientsOnline (120s)
and requireAllClientsNetInfoAndDERP (3min) calls.
- WaitForTailscaleSyncPerUser callers in acl_test.go (3 sites, 60s).
- WaitForRunning callers in tags_test.go (10 sites): switch to
PeerSyncTimeout() to match convention.
- WaitForRunning/WaitForPeers direct callers in route_test.go.
- requireAllClientsOnline callers in general_test.go and
auth_key_test.go.
Replace pingAllHelper with assertPingAll/assertPingAllWithCollect:
- Wraps pings in EventuallyWithT so transient docker exec timeouts
are retried instead of immediately failing the test.
- Timeout scales with the ping matrix size (2s per ping budget for
2 full sweeps) so large tests get proportionally more time.
- Uses CollectT correctly, fixing the broken EventuallyWithT usage
in TestEphemeral where the old t.Errorf bypassed CollectT.
- Follows the established assert*/assertWithCollect naming.
Updates #3125
The default docker execute timeout (10s) is the root cause of
"dockertest command timed out" errors across many integration tests
on CI. On congested GitHub Actions runners, docker exec latency
alone can consume 2-5 seconds of this budget before the command
even starts inside the container.
Replace the hardcoded 10s constant with a function that returns
20s on CI, doubling the budget for all container commands
(tailscale status, headscale CLI, curl, etc.).
Similarly, scale the default tailscale ping timeout from 200ms to
400ms on CI. This doubles the per-attempt budget and the docker
exec timeout for pings (from 200ms*5=1s to 400ms*5=2s), giving
more headroom for docker exec overhead.
Updates #3125
TestACLTagPropagationPortSpecific failed twice on CI because it jumped
from SetNodeTags directly to checking curl, without first verifying the
tag change was applied on the server. This races against server-side
processing.
Add a tag verification step (matching TestACLTagPropagation's pattern)
and bump the Step 4 timeout from 60s to 90s since port-specific filter
changes require both endpoints to process the new PacketFilter from
the MapResponse while the WireGuard tunnel stays up.
Updates #3125
Wrap all 329 hardcoded EventuallyWithT timeouts across 12 test files
with integrationutil.ScaledTimeout(), which applies a 2x multiplier
on CI runners. This addresses the systemic issue where hardcoded
timeouts that work locally are insufficient under CI resource
contention.
Variable-based timeouts (propagationTime, assertTimeout in
route_test.go and totalWaitTime in auth_oidc_test.go) are wrapped
at their definition site so all downstream usages benefit.
The retry intervals (second duration parameter) are intentionally
NOT scaled, as they control polling frequency, not total wait time.
Updates #3125
Replace Curl() with CurlFailFast() in all negative curl assertions
(where the test expects the connection to fail). CurlFailFast uses
1 retry and 2s max time instead of 3 retries and 5s max, which
avoids wasting time on unnecessary retries when we expect the
connection to be blocked.
This affects 21 call sites across 7 test functions:
- TestACLAllowUser80Dst
- TestACLDenyAllPort80
- TestACLAllowUserDst
- TestACLAllowStarDst
- TestACLNamedHostsCanReach
- TestACLDevice1CanAccessDevice2
- TestPolicyUpdateWhileRunningWithCLIInDatabase
- TestACLAutogroupSelf
- TestACLPolicyPropagationOverTime
Where possible, the inline Curl+Error pattern is replaced with the
assertCurlFailWithCollect helper introduced in the previous commit.
Updates #3125
Add ScaledTimeout to scale EventuallyWithT timeouts by 2x on CI,
consistent with the existing PeerSyncTimeout (60s/120s) and
dockertestMaxWait (300s/600s) conventions.
Add assertCurlSuccessWithCollect and assertCurlFailWithCollect helpers
following the existing *WithCollect naming convention.
assertCurlFailWithCollect uses CurlFailFast internally for aggressive
timeouts, avoiding wasted retries when expecting blocked connections.
Apply these to the three flakiest ACL tests:
- TestACLTagPropagation: swap NetMap and curl verification order so
the fast NetMap check (confirms MapResponse arrived) runs before
the slower curl check. Use curl helpers and scaled timeouts.
- TestACLTagPropagationPortSpecific: use curl helpers and scaled
timeouts.
- TestACLHostsInNetMapTable: scale the 10s EventuallyWithT timeout.
Updates #3125
Move the container image and binary download details from the README
into a dedicated documentation page at setup/install/main. This gives
development builds a proper home in the docs site alongside the other
install methods. The README now links to the docs page instead.
Build and push multi-arch container images (linux/amd64, linux/arm64)
to GHCR and Docker Hub on every push to main that changes Go or Nix
files. Images are tagged as main-<short-sha> using ko with the same
distroless base image as release builds.
Cross-compiled binaries for linux and darwin (amd64, arm64) are
uploaded as workflow artifacts. The README links to these via
nightly.link for stable download URLs.
The project mkdocs-material is in maintenance-only mode and their
successor is not ready yet.
Use the modern, refreshed theme and drop the pymdownx.magiclink
extension.
Add a highly visible ASCII-art warning banner that is printed at
startup when the configured IP prefixes fall outside the standard
Tailscale CGNAT (100.64.0.0/10) or ULA (fd7a:115c:a1e0::/48) ranges.
The warning fires once even if both v4 and v6 are non-standard, and
the warnBanner() function is reusable for other critical configuration
warnings in the future.
Also updates config-example.yaml to clarify that subsets of the
default ranges are fine, but ranges outside CGNAT/ULA are not.
Closes#3055
Three corrections to issue tests that had wrong assumptions about
when data becomes available:
1. initial_map_should_include_peer_online_status: use WaitForCondition
instead of checking the initial netmap. Online status is set by
Connect() which sends a PeerChange patch after the initial
RegisterResponse, so it may not be present immediately.
2. disco_key_should_propagate_to_peers: use WaitForCondition. The
DiscoKey is sent in the first MapRequest (not RegisterRequest),
so peers may not see it until a subsequent map update.
3. approved_route_without_announcement: invert the test expectation.
Tailscale uses a strict advertise-then-approve model -- routes are
only distributed when the node advertises them (Hostinfo.RoutableIPs)
AND they are approved. An approval without advertisement is a dormant
pre-approval. The test now asserts the route does NOT appear in
AllowedIPs, matching upstream Tailscale semantics.
Also fix TestClient.Reconnect to clear the cached netmap and drain
pending updates before re-registering. Without this, WaitForPeers
returned immediately based on the old session's stale data.
Two fixes to how online status is handled during registration:
1. Re-registration (applyAuthNodeUpdate, HandleNodeFromPreAuthKey) no
longer resets IsOnline to false. Online status is managed exclusively
by Connect()/Disconnect() in the poll session lifecycle. The reset
caused a false offline blip: the auth handler's change notification
triggered a map regeneration showing the node as offline to peers,
even though Connect() would set it back to true moments later.
2. New node creation (createAndSaveNewNode) now explicitly sets
IsOnline=false instead of leaving it nil. This ensures peers always
receive a known online status rather than an ambiguous nil/unknown.
When a node disconnects, serveLongPoll defers a cleanup that starts a
grace period goroutine. This goroutine polls batcher.IsConnected() and,
if the node has not reconnected within ~10 seconds, calls
state.Disconnect() to mark it offline. A TOCTOU race exists: the node
can reconnect (calling Connect()) between the IsConnected check and
the Disconnect() call, causing the stale Disconnect() to overwrite
the new session's online status.
Fix with a monotonic per-node generation counter:
- State.Connect() increments the counter and returns the current
generation alongside the change list.
- State.Disconnect() accepts the generation from the caller and
rejects the call if a newer generation exists, making stale
disconnects from old sessions a no-op.
- serveLongPoll captures the generation at Connect() time and passes
it to Disconnect() in the deferred cleanup.
- RemoveNode's return value is now checked: if another session already
owns the batcher slot (reconnect happened), the old session skips
the grace period entirely.
Update batcher_test.go to track per-node connect generations and
pass them through to Disconnect(), matching production behavior.
Fixes the following test failures:
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- nodestore_correct_after_rapid_reconnect
- rapid_reconnect_peer_never_sees_offline
Add three test files designed to stress the control plane under
concurrent and adversarial conditions:
- race_test.go: 14 tests exercising concurrent mutations, session
replacement, batcher contention, NodeStore access, and map response
delivery during disconnect. All pass the Go race detector.
- poll_race_test.go: 8 tests targeting the poll.go grace period
interleaving. These confirm a logical TOCTOU race: when a node
disconnects and reconnects within the grace period, the old
session's deferred Disconnect() can overwrite the new session's
Connect(), leaving IsOnline=false despite an active poll session.
- stress_test.go: sustained churn, rapid mutations, rolling
replacement, data integrity checks under load, and verification
that rapid reconnects do not leak false-offline notifications.
Known failing tests (grace period TOCTOU race):
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- rapid_reconnect_peer_never_sees_offline
Split TestIssues into 7 focused test functions to stay under cyclomatic
complexity limits while testing more aggressively.
Issues surfaced (4 failing tests):
1. initial_map_should_include_peer_online_status: Initial MapResponse
has Online=nil for peers. Online status only arrives later via
PeersChangedPatch.
2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client
is not visible to peers. Peers see zero disco key.
3. approved_route_without_announcement_is_visible: Server-side route
approval without client-side announcement silently produces empty
SubnetRoutes (intersection of empty announced + approved = empty).
4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect
cycles, NodeStore reports node as offline despite having an active
poll session. The connect/disconnect grace period interleaving
leaves IsOnline in an incorrect state.
Passing tests (20) verify:
- IP uniqueness across 10 nodes
- IP stability across reconnect
- New peers have addresses immediately
- Node rename propagates to peers
- Node delete removes from all peer lists
- Hostinfo changes (OS field) propagate
- NodeStore/DB consistency after route mutations
- Grace period timing (8-20s window)
- Ephemeral node deletion (not just offline)
- 10-node simultaneous connect convergence
- Rapid sequential node additions
- Reconnect produces complete map
- Cross-user visibility with default policy
- Same-user multiple nodes get distinct IDs
- Same-hostname nodes get unique GivenNames
- Policy change during connect still converges
- DERP region references are valid
- User profiles present for self and peers
- Self-update arrives after route approval
- Route advertisement stored as AnnouncedRoutes
Extend the servertest harness with:
- TestClient.Direct() accessor for advanced operations
- TestClient.WaitForPeerCount and WaitForCondition helpers
- TestHarness.ChangePolicy for ACL policy testing
- AssertDERPMapPresent and AssertSelfHasAddresses
New test suites:
- content_test.go: self node, DERP map, peer properties, user profiles,
update history monotonicity, and endpoint update propagation
- policy_test.go: default allow-all, explicit policy, policy triggers
updates on all nodes, multiple policy changes, multi-user mesh
- ephemeral_test.go: ephemeral connect, cleanup after disconnect,
mixed ephemeral/regular, reconnect prevents cleanup
- routes_test.go: addresses in AllowedIPs, route advertise and approve,
advertised routes via hostinfo, CGNAT range validation
Also fix node_departs test to use WaitForCondition instead of
assert.Eventually, and convert concurrent_join_and_leave to
interleaved_join_and_leave with grace-period-tolerant assertions.
Add three test files exercising the servertest harness:
- lifecycle_test.go: connection, disconnection, reconnection, session
replacement, and mesh formation at various sizes.
- consistency_test.go: symmetric visibility, consistent peer state,
address presence, concurrent join/leave convergence.
- weather_test.go: rapid reconnects, flapping stability, reconnect
with various delays, concurrent reconnects, and scale tests.
All tests use table-driven patterns with subtests.
Add a new hscontrol/servertest package that provides a test harness
for exercising the full Headscale control protocol in-process, using
Tailscale's controlclient.Direct as the client.
The harness consists of:
- TestServer: wraps a Headscale instance with an httptest.Server
- TestClient: wraps controlclient.Direct with NetworkMap tracking
- TestHarness: orchestrates N clients against a single server
- Assertion helpers for mesh completeness, visibility, and consistency
Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey,
GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest
package can construct a working server from outside the hscontrol package.
This enables fast, deterministic tests of connection lifecycle, update
propagation, and network weather scenarios without Docker.
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:
1. MapResponses delivered out of order — a later change could finish
generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
updateSentPeers does Clear() + Store() which is not atomic relative
to a concurrent Range() in computePeerDiff.
Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.
Fixes#3140
Replace inline WithDockerEntrypoint shell scripts in
TestACLTagPropagation and TestACLTagPropagationPortSpecific with
the standard WithPackages and WithWebserver options.
The custom entrypoints used fragile fixed sleeps and lacked the
robust network/cert readiness waits that buildEntrypoint provides.
Updates #3139
Make embedded DERP server and TLS the default configuration for all
integration tests, replacing the per-test opt-in model that led to
inconsistent and flaky test behavior.
Infrastructure changes:
- DefaultConfigEnv() includes embedded DERP server settings
- New() auto-generates a proper CA + server TLS certificate pair
- CA cert is installed into container trust stores and returned by
GetCert() so clients and internal tools (curl) trust the server
- CreateCertificate() now returns (caCert, cert, key) instead of
discarding the CA certificate
- Add WithPublicDERP() and WithoutTLS() opt-out options
- Remove WithTLS(), WithEmbeddedDERPServerOnly(), and WithDERPAsIP()
since all their behavior is now the default or unnecessary
Test cleanup:
- Remove all redundant WithTLS/WithEmbeddedDERPServerOnly/WithDERPAsIP
calls from test files
- Give every test a unique WithTestName by parameterizing aclScenario,
sshScenario, and derpServerScenario helpers
- Add WithTestName to tests that were missing it
- Document all non-standard options with inline comments explaining
why each is needed
Updates #3139
The Batcher's connected field (*xsync.Map[types.NodeID, *time.Time])
encoded three states via pointer semantics:
- nil value: node is connected
- non-nil time: node disconnected at that timestamp
- key missing: node was never seen
This was error-prone (nil meaning 'connected' inverts Go idioms),
redundant with b.nodes + hasActiveConnections(), and required keeping
two parallel maps in sync. It also contained a bug in RemoveNode where
new(time.Now()) was used instead of &now, producing a zero time.
Replace the separate connected map with a disconnectedAt field on
multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly
on the object that already manages the node's connections.
Changes:
- Add disconnectedAt field and helpers (markConnected, markDisconnected,
isConnected, offlineDuration) to multiChannelNodeConn
- Remove the connected field from Batcher
- Simplify IsConnected from two map lookups to one
- Simplify ConnectedMap and Debug from two-map iteration to one
- Rewrite cleanupOfflineNodes to scan b.nodes directly
- Remove the markDisconnectedIfNoConns helper
- Update all tests and benchmarks
Fixes#3141
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:
1. MapResponses delivered out of order — a later change could finish
generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
updateSentPeers does Clear() + Store() which is not atomic relative
to a concurrent Range() in computePeerDiff.
Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.
Fixes#3140
The Noise handshake accepts any machine key without checking
registration, so all endpoints behind the Noise router are reachable
without credentials. Three handlers used io.ReadAll without size
limits, allowing an attacker to OOM-kill the server.
Fix:
- Add http.MaxBytesReader middleware (1 MiB) on the Noise router.
- Replace io.ReadAll + json.Unmarshal with json.NewDecoder in
PollNetMapHandler and RegistrationHandler.
- Stop reading the body in NotImplementedHandler entirely.
Remove XTestBatcherChannelClosingRace (~95 lines) and
XTestBatcherScalability (~515 lines). These were disabled by
prefixing with X (making them invisible to go test) and served
as dead code. The functionality they covered is exercised by the
active test suite.
Updates #2545
L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go
camelCase. Remove highLoad* and extremeLoad* constants that were
only referenced by disabled (X-prefixed) tests.
L10: Fix misleading assert message that said "1337" while checking
for region ID 999.
L12: Remove emoji from test log output to avoid encoding issues
in CI environments.
Updates #2545
L1: Replace crypto/rand with an atomic counter for generating
connection IDs. These identifiers are process-local and do not need
cryptographic randomness; a monotonic counter is cheaper and
produces shorter, sortable IDs.
L5: Use getActiveConnectionCount() in Debug() instead of directly
locking the mutex and reading the connections slice. This avoids
bypassing the accessor that already exists for this purpose.
L6: Extract the hardcoded 15*time.Minute cleanup threshold into
the named constant offlineNodeCleanupThreshold.
L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch
directly.
Updates #2545
Move connectionEntry, multiChannelNodeConn, generateConnectionID, and
all their methods from batcher.go into a dedicated file. This reduces
batcher.go from ~1170 lines to ~800 and separates per-node connection
management from batcher orchestration.
Pure move — no logic changes.
Updates #2545
- TestBatcher_CloseBeforeStart_DoesNotHang: verifies Close() before
Start() returns promptly now that done is initialized in NewBatcher.
- TestBatcher_QueueWorkAfterClose_DoesNotHang: verifies queueWork
returns via the done channel after Close(), even without Start().
- TestIsConnected_FalseAfterAddNodeFailure: verifies IsConnected
returns false after AddNode fails and removes the last connection.
- TestRemoveConnectionAtIndex_NilsTrailingSlot: verifies the backing
array slot is nil-ed after removal to avoid retaining pointers.
Updates #2545
M7: Nil out trailing *connectionEntry pointers in the backing array
after slice removal in removeConnectionAtIndexLocked and send().
Without this, the GC cannot collect removed entries until the slice
is reallocated.
M1: Initialize the done channel in NewBatcher instead of Start().
Previously, calling Close() or queueWork before Start() would select
on a nil channel, blocking forever. Moving the make() to the
constructor ensures the channel is always usable.
M2: Move b.connected.Delete and b.totalNodes decrement inside the
Compute callback in cleanupOfflineNodes. Previously these ran after
the Compute returned, allowing a concurrent AddNode to reconnect
between the delete and the bookkeeping update, which would wipe the
fresh connected state.
M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously,
when initial map generation or send timed out, the connection was
removed but b.connected retained its old nil (= connected) value,
making IsConnected return true for a node with zero connections.
Updates #2545
Add four unit tests guarding fixes introduced in recent commits:
- TestConnectionEntry_SendFastPath_TimerStopped: verifies the
time.NewTimer fix (H1) does not leak goroutines after many
fast-path sends on a buffered channel.
- TestBatcher_CloseWaitsForWorkers: verifies Close() blocks until all
worker goroutines exit (H3), preventing sends on torn-down channels.
- TestBatcher_CloseThenStartIsNoop: verifies the one-shot lifecycle
contract; Start() after Close() must not spawn new goroutines.
- TestBatcher_CloseStopsTicker: verifies Close() stops the internal
ticker to prevent resource leaks.
Updates #2545
Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)),
and mutex timing from send(), addConnection(), and
removeConnectionByChannel(). Move per-broadcast summary and
no-connection logs from Debug to Trace. Remove per-connection
"attempting"/"succeeded" logs entirely; keep Warn for failures.
These methods run on every MapResponse delivery, so the savings
compound quickly under load.
Updates #2545
Close() previously closed the done channel and returned immediately,
without waiting for worker goroutines to exit. This caused goroutine
leaks in tests and allowed workers to race with connection teardown.
The ticker was also never stopped, leaking its internal goroutine.
Add a sync.WaitGroup to track the doWork goroutine and every worker
it spawns. Close() now calls wg.Wait() after signalling shutdown,
ensuring all goroutines have exited before tearing down connections.
Also stop the ticker to prevent resource leaks.
Document that a Batcher must not be reused after Close().
connectionEntry.send() is on the hot path: called once per connection
per broadcast tick. time.After allocates a timer that sits in the
runtime timer heap until it fires (50 ms), even when the channel send
succeeds immediately. At 1000 connected nodes, every tick leaks 1000
timers into the heap, creating continuous GC pressure.
Replace with time.NewTimer + defer timer.Stop() so the timer is
removed from the heap as soon as the fast-path send completes.
Add embedded DERP server, TLS, and netfilter=off to match the
infrastructure configuration used by all other ACL integration tests.
Without these options, the test fails intermittently because traffic
routes through external DERP relays and iptables initialization fails
in Docker containers.
Updates #3139
Remove the Batcher interface since there is only one implementation.
Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into
batcher.go.
Drop type assertions in debug.go now that mapBatcher is a concrete
*mapper.Batcher pointer.
Rewrite multiChannelNodeConn.send() to use a two-phase approach:
1. RLock: snapshot connections slice (cheap pointer copy)
2. Unlock: send to all connections (50ms timeouts happen here)
3. Lock: remove failed connections by pointer identity
Previously, send() held the write lock for the entire duration of
sending to all connections. With N stale connections each timing out
at 50ms, this blocked addConnection/removeConnection for N*50ms.
The two-phase approach holds the lock only for O(N) pointer
operations, not for N*50ms I/O waits.
Replace the two-phase Load-check-Delete in cleanupOfflineNodes with
xsync.Map.Compute() for atomic check-and-delete. This prevents the
TOCTOU race where a node reconnects between the hasActiveConnections
check and the Delete call.
Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites
to prevent nil pointer panics from concurrent cleanup races.
Move per-node pending changes from a shared xsync.Map on the batcher
into multiChannelNodeConn, protected by a dedicated mutex. The new
appendPending/drainPending methods provide atomic append and drain
operations, eliminating data races in addToBatch and
processBatchedChanges.
Add sync.Once to multiChannelNodeConn.close() to make it idempotent,
preventing panics from concurrent close calls on the same channel.
Add started atomic.Bool to guard Start() against being called
multiple times, preventing orphaned goroutines.
Add comprehensive concurrency tests validating these changes.
Add comprehensive unit tests for the LockFreeBatcher covering
AddNode/RemoveNode lifecycle, addToBatch routing (broadcast, targeted,
full update), processBatchedChanges deduplication, cleanup of offline
nodes, close/shutdown behavior, IsConnected state tracking, and
connected map consistency.
Add benchmarks for connection entry send, multi-channel send and
broadcast, peer diff computation, sentPeers updates, addToBatch at
various scales (10/100/1000 nodes), processBatchedChanges, broadcast
delivery, IsConnected lookups, connected map enumeration, connection
churn, and concurrent send+churn scenarios.
Widen setupBatcherWithTestData to accept testing.TB so benchmarks can
reuse the same database-backed test setup as unit tests.
Add golang.org/x/tools/cmd/stress as a tool dependency for running
tests under repeated stress to surface flaky failures.
Update flake vendorHash for the new go.mod dependencies.
Buffer the AuthRequest verdict channel to prevent a race where the
sender blocks indefinitely if the receiver has already timed out, and
increase the auth followup test timeout from 100ms to 5s to prevent
spurious failures under load.
Skip postgres-backed tests when the postgres server is unavailable
instead of calling t.Fatal, which was preventing the rest of the test
suite from running.
Add TestMain to db, types, and policy/v2 packages to chdir to the
source directory before running tests. This ensures relative testdata/
paths resolve correctly when the test binary is executed from an
arbitrary working directory (e.g., via "go tool stress").
When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.
When the batcher timed out sending to a node, it removed the channel from multiChannelNodeConn but left the old serveLongPoll goroutine running on that channel. That left a live stale session behind: it no longer received new updates, but it could still keep the stream open and block shutdown.
Close the pruned channel when stale-send cleanup removes it so the old map session exits after draining any buffered update.
A connection can already be removed from multiChannelNodeConn by the stale-send cleanup path before serveLongPoll reaches its deferred RemoveNode call. In that case RemoveNode used to return early on "channel not found" and never updated the node's connected state.
Drop that early return so RemoveNode still checks whether any active connections remain and marks the node disconnected when the last one is gone.
Update mdformat and related packages from python313Packages to
python314Packages. All four packages (mdformat, mdformat-footnote,
mdformat-frontmatter, mdformat-mkdocs) are available in the updated
nixpkgs.
Updates #1261
Explicitly set derp.urls to an empty list in the NixOS VM test,
matching the upstream nixpkgs test. The VMs have no internet
access, so fetching the default Tailscale DERP map would silently
fail and add unnecessary timeout delay to the test run.
Add missing typed options from the upstream nixpkgs module:
- configFile: read-only option exposing the generated config path
for composability with other NixOS modules
- dns.split: split DNS configuration with proper type checking
- dns.extra_records: typed submodule with name/type/value validation
Sync descriptions and assertions with upstream:
- Use Tailscale doc link for override_local_dns description
- Remove redundant requirement note from nameservers.global
- Match upstream assertion message wording and expression style
Update systemd script to reference cfg.configFile instead of a
local let-binding, matching the upstream pattern.
Add end-to-end integration test that validates localpart:*@domain
SSH user mapping with real Tailscale clients. The test sets up an
SSH policy with localpart entries and verifies that users can SSH
into tagged servers using their email local-part as the username.
Updates #3049
Add support for localpart:*@<domain> entries in SSH policy users.
When a user SSHes into a target, their email local-part becomes the
OS username (e.g. alice@example.com → OS user alice).
Type system (types.go):
- SSHUser.IsLocalpart() and ParseLocalpart() for validation
- SSHUsers.LocalpartEntries(), NormalUsers(), ContainsLocalpart()
- Enforces format: localpart:*@<domain> (wildcard-only)
- UserWildcard.Resolve for user:*@domain SSH source aliases
- acceptEnv passthrough for SSH rules
Compilation (filter.go):
- resolveLocalparts: pure function mapping users to local-parts
by email domain. No node walking, easy to test.
- groupSourcesByUser: single walk producing per-user principals
with sorted user IDs, and tagged principals separately.
- ipSetToPrincipals: shared helper replacing 6 inline copies.
- selfPrincipalsForNode: self-access using pre-computed byUser.
The approach separates data gathering from rule assembly. Localpart
rules are interleaved per source user to match Tailscale SaaS
first-match-wins ordering.
Updates #3049
NodeView.CanAccess called node2.AsStruct() on every check. In peer-map construction we run CanAccess in O(n^2) pair scans (often twice per pair), so that per-call clone multiplied into large heap churn
Add ReadLog method to headscale integration container for log
inspection. Split SSH check mode tests into CLI and OIDC variants
and add comprehensive test coverage:
- TestSSHOneUserToOneCheckModeCLI: basic check mode with CLI approval
- TestSSHOneUserToOneCheckModeOIDC: check mode with OIDC approval
- TestSSHCheckModeUnapprovedTimeout: rejection on cache expiry
- TestSSHCheckModeCheckPeriodCLI: session expiry and re-auth
- TestSSHCheckModeAutoApprove: auto-approval within check period
- TestSSHCheckModeNegativeCLI: explicit rejection via CLI
Update existing integration tests to use headscale auth register.
Updates #1850
Add SSH check period tracking so that recently authenticated users
are auto-approved without requiring manual intervention each time.
Introduce SSHCheckPeriod type with validation (min 1m, max 168h,
"always" for every request) and encode the compiled check period
as URL query parameters in the HoldAndDelegate URL.
The SSHActionHandler checks recorded auth times before creating a
new HoldAndDelegate flow. Auth timestamps are stored in-memory:
- Default period (no explicit checkPeriod): auth covers any
destination, keyed by source node with Dst=0 sentinel
- Explicit period: auth covers only that specific destination,
keyed by (source, destination) pair
Auth times are cleared on policy changes.
Updates #1850
Add gRPC service definitions for managing auth requests:
AuthRegister to register interactive auth sessions and
AuthApprove/AuthReject to approve or deny pending requests
(used for SSH check mode).
Updates #1850
Implement the SSH "check" action which requires additional
verification before allowing SSH access. The policy compiler generates
a HoldAndDelegate URL that the Tailscale client calls back to
headscale. The SSHActionHandler creates an auth session and waits for
approval via the generalised auth flow.
Sort check (HoldAndDelegate) rules before accept rules to match
Tailscale's first-match-wins evaluation order.
Updates #1850
Extract shared HTML/CSS design into a common template and create
generalised auth success and web auth templates that work for both
node registration and SSH check authentication flows.
Updates #1850
Generalise the registration pipeline to a more general auth pipeline
supporting both node registrations and SSH check auth requests.
Rename RegistrationID to AuthID, unexport AuthRequest fields, and
introduce AuthVerdict to unify the auth finish API.
Add the urlParam generic helper for extracting typed URL parameters
from chi routes, used by the new auth request handler.
Updates #1850
Replace gorilla/mux with go-chi/chi as the HTTP router and add a
custom zerolog-based request logger to replace chi's default
stdlib-based middleware.Logger, consistent with the rest of the
application.
Updates #1850
Extract the existing-node lookup logic from HandleNodeFromPreAuthKey
into a separate method. This reduces the cyclomatic complexity from
32 to 28, below the gocyclo limit of 30.
Updates #3077
Tagged nodes no longer have user_id set, so ListNodes(user) cannot
find them. Update integration tests to use ListNodes() (all nodes)
when looking up tagged nodes.
Add a findNode helper to locate nodes by predicate from an
unfiltered list, used in ACL tests that have multiple nodes per
scenario.
Updates #3077
Test the tagged-node-survives-user-deletion scenario at two layers:
DB layer (users_test.go):
- success_user_only_has_tagged_nodes: tagged nodes with nil
user_id do not block user deletion and survive it
- error_user_has_tagged_and_owned_nodes: user-owned nodes
still block deletion even when tagged nodes coexist
App layer (grpcv1_test.go):
- TestDeleteUser_TaggedNodeSurvives: full registration flow
with tagged PreAuthKey verifies nil UserID after registration,
absence from nodesByUser index, user deletion succeeds, and
tagged node remains in global node list
Also update auth_tags_test.go assertions to expect nil UserID
on tagged nodes, consistent with the new invariant.
Updates #3077
Tagged nodes are owned by their tags, not a user. Enforce this
invariant at every write path:
- createAndSaveNewNode: do not set UserID for tagged PreAuthKey
registration; clear UserID when advertise-tags are applied
during OIDC/CLI registration
- SetNodeTags: clear UserID/User when tags are assigned
- processReauthTags: clear UserID/User when tags are applied
during re-authentication
- validateNodeOwnership: reject tagged nodes with non-nil UserID
- NodeStore: skip nodesByUser indexing for tagged nodes since
they have no owning user
- HandleNodeFromPreAuthKey: add fallback lookup for tagged PAK
re-registration (tagged nodes indexed under UserID(0)); guard
against nil User deref for tagged nodes in different-user check
Since tagged nodes now have user_id = NULL, ListNodesByUser
will not return them and DestroyUser naturally allows deleting
users whose nodes have all been tagged. The ON DELETE CASCADE
FK cannot reach tagged nodes through a NULL foreign key.
Also tone down shouty comments throughout state.go.
Fixes#3077
Tagged nodes are owned by their tags, not a user. Previously
user_id was kept as "created by" tracking, but this prevents
deleting users whose nodes have all been tagged, and the
ON DELETE CASCADE FK would destroy the tagged nodes.
Add a migration that sets user_id = NULL on all existing tagged
nodes. Subsequent commits enforce this invariant at write time.
Updates #3077
Add --disable flag to "headscale nodes expire" CLI command and
disable_expiry field handling in the gRPC API to allow disabling
key expiry for nodes. When disabled, the node's expiry is set to
NULL and IsExpired() returns false.
The CLI follows the new grpcRunE/RunE/printOutput patterns
introduced in the recent CLI refactor.
Also fix NodeSetExpiry to persist directly to the database instead
of going through persistNodeToDB which omits the expiry field.
Fixes#2681
Co-authored-by: Marco Santos <me@marcopsantos.com>
Add bool disable_expiry field (field 3) to ExpireNodeRequest proto
and regenerate all protobuf, gRPC gateway, and OpenAPI files.
Fixes#2681
Co-authored-by: Marco Santos <me@marcopsantos.com>
Add end-to-end test cases to TestUnmarshalPolicy that verify bracketed
IPv6 addresses are correctly parsed through the full policy pipeline
(JSON unmarshal -> splitDestinationAndPort -> parseAlias -> parsePortRange)
and survive JSON round-trips.
Cover single port, multiple ports, wildcard port, CIDR prefix, port
range, bracketed IPv4, and hostname rejection.
Updates #2754
Headscale rejects IPv6 addresses with square brackets in ACL policy
destinations (e.g. "[fd7a:115c:a1e0::87e1]:80,443"), while Tailscale
SaaS accepts them. The root cause is that splitDestinationAndPort uses
strings.LastIndex(":") which leaves brackets on the destination string,
and netip.ParseAddr does not accept brackets.
Add a bracket-handling branch at the top of splitDestinationAndPort that
uses net.SplitHostPort for RFC 3986 parsing when input starts with "[".
The extracted host is validated with netip.ParseAddr/ParsePrefix to
ensure brackets are only accepted around IP addresses and CIDR prefixes,
not hostnames or other alias types like tags and groups.
Fixes#2754
This just fixes a small issue I noticed reading the docs: the two 'scenarios' listed in the scaling section end up showing up as a numbered list of five items, instead of the desired two items + their descriptions.
Remove dead if-resp-nil checks in tagCmd and approveRoutesCmd; gRPC
returns either an error or a valid response, never (nil, nil).
Rename HasMachineOutputFlag to hasMachineOutputFlag since it has a
single internal caller in root.go.
Add bypassDatabase() to consolidate the repeated LoadServerConfig +
NewHeadscaleDatabase pattern in getPolicy and setPolicy.
Replace os.Open + io.ReadAll with os.ReadFile in setPolicy and
checkPolicy, removing the manual file-handle management.
Move errMissingParameter from users.go to utils.go alongside the
other shared sentinel errors; the variable is referenced by
api_key.go and preauthkeys.go.
Move the Error constant-error type from debug.go to mockoidc.go,
its only consumer.
Replace five hardcoded "2006-01-02 15:04:05" strings with the
HeadscaleDateTimeFormat constant already defined in utils.go.
Replace two literal 10 base arguments to strconv.FormatUint with
util.Base10 to match the convention in api_key.go and nodes.go.
Flags registered on a cobra.Command cannot fail to read at runtime;
GetString/GetUint64/GetStringSlice only error when the flag name is
unknown. The error-handling blocks for these calls are unreachable
dead code.
Adopt the value, _ := pattern already used in api_key.go,
preauthkeys.go and users.go, removing ~40 lines of dead code from
nodes.go and debug.go.
The --namespace flag on nodes list/register and debug create-node was
never wired to the --user flag, so its value was silently ignored.
Remove it along with the deprecateNamespaceMessage constant.
Also remove the namespace/ns command aliases on users and
machine/machines aliases on nodes, which have been deprecated since
the naming changes in 0.23.0.
Add expirationFromFlag helper that parses the --expiration flag into a
timestamppb.Timestamp, replacing identical duration-parsing blocks in
api_key.go and preauthkeys.go.
Add apiKeyIDOrPrefix helper to validate the mutually-exclusive --id and
--prefix flags, replacing the duplicated switch block in expireAPIKeyCmd
and deleteAPIKeyCmd.
Centralise the repeated force-flag-check + YesNo-prompt logic into a
single confirmAction(cmd, prompt) helper. Callers still decide what
to return on decline (error, message, or nil).
Replace three inconsistent MarkFlagRequired error-handling styles
(stdlib log.Fatal, zerolog log.Fatal, silently discarded) with a
single mustMarkRequired helper that panics on programmer error.
Also fixes a bug where renameNodeCmd.MarkFlagRequired("new-name")
targeted the wrong command (should be renameUserCmd), making the
--new-name flag effectively never required on "headscale users rename".
Add a helper that checks the --output flag and either serialises as
JSON/YAML or invokes a table-rendering callback. This removes the
repeated format,_ := cmd.Flags().GetString("output") + if-branch from
the five list commands.
Convert the 10 commands that were still using Run with
ErrorOutput/SuccessOutput or log.Fatal/os.Exit:
- backfillNodeIPsCmd: use grpcRunE-style manual connection with
error returns; simplify the confirm/force logic
- getPolicy, setPolicy, checkPolicy: replace ErrorOutput with
fmt.Errorf returns in both the bypass-gRPC and gRPC paths
- serveCmd, configTestCmd: replace log.Fatal with error returns
- mockOidcCmd: replace log.Error+os.Exit with error return
- versionCmd, generatePrivateKeyCmd: replace SuccessOutput with
printOutput
- dumpConfigCmd: return the error instead of swallowing it
Rename grpcRun to grpcRunE: the inner closure now returns error
and the wrapper returns a cobra RunE-compatible function.
Change newHeadscaleCLIWithConfig to return an error instead of
calling log.Fatal/os.Exit, making connection failures propagate
through the normal error path.
Add formatOutput (returns error) and printOutput (writes to stdout)
as non-exiting replacements for the old output/SuccessOutput pair.
Extract output format string literals into package-level constants.
Mark the old ErrorOutput, SuccessOutput and output helpers as
deprecated; they remain temporarily for the unconverted commands.
Convert all 22 grpcRunE commands from Run+ErrorOutput+SuccessOutput
to RunE+fmt.Errorf+printOutput. Change usernameAndIDFromFlag to
return an error instead of calling ErrorOutput directly.
Update backfillNodeIPsCmd and policy.go callers of
newHeadscaleCLIWithConfig for the new 5-return signature while
keeping their Run-based pattern for now.
Set SilenceErrors and SilenceUsage on the root command so that
cobra never prints usage text for runtime errors. A SetFlagErrorFunc
callback re-enables usage output specifically for flag-parsing
errors (the kubectl pattern).
Add printError to utils.go and switch Execute() to ExecuteC() so
the returned error is formatted as JSON/YAML when --output requests
machine-readable output.
Add a grpcRun helper that wraps cobra RunFuncs, injecting a ready
gRPC client and context. The connection lifecycle (cancel, close)
is managed by the wrapper, eliminating the duplicated 3-line
boilerplate (newHeadscaleCLIWithConfig + defer cancel + defer
conn.Close) from 22 command handlers across 7 files.
Three call sites are intentionally left unconverted:
- backfillNodeIPsCmd: creates the client only after user confirmation
- getPolicy/setPolicy: conditionally use gRPC vs direct DB access
Docker 29 (shipped with runner-images 20260209.23.1) breaks docker
build via Go client libraries (broken pipe writing build context)
and docker load/save with certain tarball formats. Add Docker's
official apt repository and install docker-ce 28.5.x in all CI
jobs that interact with Docker.
See https://github.com/actions/runner-images/issues/13474
Updates #3058
Use new(users["name"]) instead of extracting to intermediate
variables that staticcheck does not recognise as used with
Go 1.26 new(value) syntax.
Updates #3058
Fix issues found by the upgraded golangci-lint:
- wsl_v5: add required whitespace in CLI files
- staticcheck SA4006: replace new(var.Field) with &localVar
pattern since staticcheck does not recognize Go 1.26
new(value) as a use of the variable
- staticcheck SA5011: use t.Fatal instead of t.Error for
nil guard checks so execution stops
- unused: remove dead ptrTo helper function
Use GORM AutoMigrate instead of raw SQL to create the
database_versions table, since PostgreSQL does not support the
datetime type used in the raw SQL (it requires timestamp).
Add a version check that runs before database migrations to ensure
users do not skip minor versions or downgrade. This protects database
migrations and allows future cleanup of old migration code.
Rules enforced:
- Same minor version: always allowed (patch changes either way)
- Single minor upgrade (e.g. 0.27 -> 0.28): allowed
- Multi-minor upgrade (e.g. 0.25 -> 0.28): blocked with guidance
- Any minor downgrade: blocked
- Major version change: blocked
- Dev builds: warn but allow, preserve stored version
The version is stored in a purpose-built database_versions table
after migrations succeed. The table is created with raw SQL before
gormigrate runs to avoid circular dependencies.
Updates #3058
Replace tiangolo/issue-manager with custom logic that distinguishes
bot comments from human responses. The issue-manager action treated
all comments equally, so the bot's own instruction comment would
trigger label removal on the next scheduled run.
Split into two jobs:
- remove-label-on-response: triggers on issue_comment from non-bot
users, removes the needs-more-info label immediately
- close-stale: runs on daily schedule, uses nushell to iterate open
needs-more-info issues, checks for human comments after the label
was added, and closes after 3 days with no response
Remove the `issues: labeled` trigger from the timer workflow.
When both workflows triggered on label addition, the comment workflow
would post the bot comment, and by the time the timer workflow ran,
issue-manager would see "a comment was added after the label" and
immediately remove the label due to `remove_label_on_comment: true`.
The timer workflow now only runs on:
- Daily cron (to close stale issues)
- issue_comment (to remove label when humans respond)
- workflow_dispatch (for manual testing)
Add workflow that automatically closes issues labeled as
support-request with a message directing users to Discord
for configuration and support questions.
The workflow:
- Triggers when support-request label is added
- Posts a comment explaining this tracker is for bugs/features
- Links to documentation and Discord
- Closes the issue as "not planned"
Add GitHub Actions automation that helps manage issues requiring
additional information from reporters:
- Post an instruction comment when 'needs-more-info' label is added,
requesting environment details, debug logs from multiple nodes,
configuration files, and proper formatting
- Automatically remove the label when anyone comments
- Close the issue after 3 days if no response is provided
- Exempt needs-more-info labeled issues from the stale bot
The instruction comment includes guidance on:
- Required environment and debug information
- Collecting logs from both connecting and connected-to nodes
- Proper redaction rules (replace consistently, never remove IPs)
- Formatting requirements for attachments and Markdown
- Encouragement to discuss on Discord before filing issues
This commit upgrades the codebase from Go 1.25.5 to Go 1.26rc2 and
adopts new language features.
Toolchain updates:
- go.mod: go 1.25.5 → go 1.26rc2
- flake.nix: buildGo125Module → buildGo126Module, go_1_25 → go_1_26
- flake.nix: build golangci-lint from source with Go 1.26
- Dockerfile.integration: golang:1.25-trixie → golang:1.26rc2-trixie
- Dockerfile.tailscale-HEAD: golang:1.25-alpine → golang:1.26rc2-alpine
- Dockerfile.derper: golang:alpine → golang:1.26rc2-alpine
- .goreleaser.yml: go mod tidy -compat=1.25 → -compat=1.26
- cmd/hi/run.go: fallback Go version 1.25 → 1.26rc2
- .pre-commit-config.yaml: simplify golangci-lint hook entry
Code modernization using Go 1.26 features:
- Replace tsaddr.SortPrefixes with slices.SortFunc + netip.Prefix.Compare
- Replace ptr.To(x) with new(x) syntax
- Replace errors.As with errors.AsType[T]
Lint rule updates:
- Add forbidigo rules to prevent regression to old patterns
Update the 0.29.0 changelog entry to document the minimum
supported Tailscale client version (v1.76.0), which corresponds
to capability version 106 based on the 10-version support window.
Errors should not start capitalised and they should not contain the word error
or state that they "failed" as we already know it is an error
Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
Go style recommends that log messages and error strings should not be
capitalized (unless beginning with proper nouns or acronyms) and should
not end with punctuation.
This change normalizes all zerolog .Msg() and .Msgf() calls to start
with lowercase letters, following Go conventions and making logs more
consistent across the codebase.
Replace raw string field names with zf constants in state.go and
db/node.go for consistent, type-safe logging.
state.go changes:
- User creation, hostinfo validation, node registration
- Tag processing during reauth (processReauthTags)
- Auth path and PreAuthKey handling
- Route auto-approval and MapRequest processing
db/node.go changes:
- RegisterNodeForTest logging
- Invalid hostname replacement logging
Add sub-logger patterns to worker(), AddNode(), RemoveNode() and
multiChannelNodeConn to eliminate repeated field calls. Use zf.*
constants for consistent field naming.
Changes in batcher_lockfree.go:
- Add wlog sub-logger in worker() with worker.id context
- Add log field to multiChannelNodeConn struct
- Initialize mc.log with node.id in newMultiChannelNodeConn()
- Add nlog sub-loggers in AddNode() and RemoveNode()
- Update all connection methods to use mc.log
Changes in batcher.go:
- Use zf.NodeID and zf.Reason in handleNodeChange()
Replace manual field extraction with EmbedObject for node logging
in gRPC handlers. Use zf.* constants for consistent field naming.
Changes:
- RegisterNode: use EmbedObject(node), zf.RegistrationKey, etc.
- SetTags: use EmbedObject(node)
- ExpireNode: use EmbedObject(node), zf.ExpiresAt
- RenameNode: use EmbedObject(node), zf.NewName
- SetApprovedRoutes: use zf.NodeID
Add sub-logger pattern to SetRoutes() to eliminate repeated node.id
field calls. Replace raw strings with zf.* constants throughout
the primary routes code for consistent field naming.
Changes:
- Add nlog sub-logger in SetRoutes() with node.id context
- Replace "prefix" with zf.Prefix
- Replace "changed" with zf.Changes
- Replace "newState" with zf.NewState
- Replace "finalState" with zf.FinalState
Replace the helper functions (logf, infof, tracef, errf) with a
zerolog sub-logger initialized in newMapSession(). The sub-logger
is pre-populated with session context (component, node, omitPeers,
stream) eliminating repeated field calls throughout the code.
Changes:
- Add log field to mapSession struct
- Initialize sub-logger with EmbedObject(node) and request context
- Remove logf/infof/tracef/errf helper functions
- Update all callers to use m.log.Level().Caller()... pattern
- Update noise.go to use sess.log instead of sess.tracef
This reduces code by ~20 lines and eliminates ~15 repeated field
calls per log statement.
Add a lint rule to enforce use of zf.* constants for zerolog field
names instead of inline string literals. This catches at lint time
any new code that doesn't follow the convention.
The rule matches common zerolog field methods (Str, Int, Bool, etc.)
and flags any usage with a string literal first argument.
Add hscontrol/util/zlog package with:
- zf subpackage: field name constants for compile-time safety
- SafeHostinfo: wrapper that redacts device fingerprinting data
- SafeMapRequest: wrapper that redacts client endpoints
The zf (zerolog fields) subpackage provides short constant names
(e.g., zf.NodeID instead of inline "node.id" strings) ensuring
consistent field naming across all log statements.
Security considerations:
- SafeHostinfo never logs: OSVersion, DeviceModel, DistroName
- SafeMapRequest only logs endpoint counts, not actual IPs
Update integration test expectations to match current policy behavior:
1. IPProto defaults include all four protocols (TCP, UDP, ICMPv4,
ICMPv6) for port-range ACL rules, not just TCP and UDP.
2. Filter rules with identical SrcIPs and IPProto are now merged
into a single rule with combined DstPorts, so the subnet router
receives one filter rule instead of two.
Updates #3036
According to Tailscale SaaS behavior, autogroup:internet is handled
by exit node routing via AllowedIPs, not by packet filtering. ACL
rules with autogroup:internet as destination should produce no
filter rules for any node.
Previously, Headscale expanded autogroup:internet to public CIDR
ranges and distributed filters to exit nodes (because 0.0.0.0/0
"covers" internet destinations). This was incorrect.
Add detection for AutoGroupInternet in filter compilation to skip
filter generation for this autogroup. Update test expectations
accordingly.
Fix two compatibility issues discovered in Tailscale SaaS testing:
1. Wildcard DstPorts format: Headscale was expanding wildcard
destinations to CGNAT ranges (100.64.0.0/10, fd7a:115c:a1e0::/48)
while Tailscale uses {IP: "*"} directly. Add detection for
wildcard (Asterix) alias type in filter compilation to use the
correct format.
2. proto:icmp handling: The "icmp" protocol name was returning both
ICMPv4 (1) and ICMPv6 (58), but Tailscale only returns ICMPv4.
Users should use "ipv6-icmp" or protocol number 58 explicitly
for IPv6 ICMP.
Update all test expectations accordingly. This significantly reduces
test file line count by replacing duplicated CGNAT range patterns
with single wildcard entries.
Update test expectations across policy tests to expect merged
FilterRule entries instead of separate ones. Tests now expect:
- Single FilterRule with combined DstPorts for same source
- Reduced matcher counts for exit node tests
Updates #3036
Tailscale merges multiple ACL rules into fewer FilterRule entries
when they have identical SrcIPs and IPProto, combining their DstPorts
arrays. This change implements the same behavior in Headscale.
Add mergeFilterRules() which uses O(n) hash map lookup to merge rules
with identical keys. DstPorts are NOT deduplicated to match Tailscale
behavior.
Also fix DestsIsTheInternet() to handle merged filter rules where
TheInternet is combined with other destinations - now uses superset
check instead of equality check.
Updates #3036
Change Asterix.Resolve() to use Tailscale's CGNAT range (100.64.0.0/10)
and ULA range (fd7a:115c:a1e0::/48) instead of all IPs (0.0.0.0/0 and
::/0).
This better matches Tailscale's security model where wildcard (*) means
"any node in the tailnet" rather than literally "any IP address on the
internet".
Updates #3036
Updates #3036
Tailscale validates that autogroup:self destinations in ACL rules can
only be used when ALL sources are users, groups, autogroup:member, or
wildcard (*). Previously, Headscale only performed this validation for
SSH rules.
Add validateACLSrcDstCombination() to enforce that tags, autogroup:tagged,
hosts, and raw IPs cannot be used as sources with autogroup:self
destinations. Invalid policies like `tag:client → autogroup:self:*` are
now rejected at validation time, matching Tailscale behavior.
Wildcard (*) is allowed because autogroup:self evaluation narrows it
per-node to only the node's own IPs.
Updates #3036
When ACL rules don't specify a protocol, Headscale now defaults to
[TCP, UDP, ICMP, ICMPv6] instead of just [TCP, UDP], matching
Tailscale's behavior.
Also export protocol number constants (ProtocolTCP, ProtocolUDP, etc.)
for use in external test packages, renaming the string protocol
constants to ProtoNameTCP, ProtoNameUDP, etc. to avoid conflicts.
This resolves 78 ICMP-related TODOs in the Tailscale compatibility
tests, reducing the total from 165 to 87.
Updates #3036
Add extensive test coverage verifying Headscale's ACL policy behavior
matches Tailscale's coordination server. Tests cover:
- Source/destination resolution for users, groups, tags, hosts, IPs
- autogroup:member, autogroup:tagged, autogroup:self behavior
- Filter rule deduplication and merging semantics
- Multi-rule interaction patterns
- Error case validation
Key behavioral differences documented:
- Headscale creates separate filter entries per ACL rule; Tailscale
merges rules with identical sources
- Headscale deduplicates Dsts within a rule; Tailscale does not
- Headscale does not validate autogroup:self source restrictions for
ACL rules (only SSH rules); Tailscale rejects invalid sources
Tests are based on real Tailscale coordination server responses
captured from a test environment with 5 nodes (1 user-owned, 4 tagged).
Updates #3036
Skip autogroup:self destination processing for tagged nodes since they
can never match autogroup:self (which only applies to user-owned nodes).
Also reorder the IsTagged() check to short-circuit before accessing
User() to avoid potential nil pointer access on tagged nodes.
Updates #3036
Add TestTagsAuthKeyConvertToUserViaCLIRegister that reproduces the
exact panic from #3038: register a node with a tags-only PreAuthKey
(no user), force reauth with empty tags, then register via CLI with
a user. The mapper panics on node.Owner().Model().ID when User is nil.
The critical detail is using a tags-only PreAuthKey (User: nil). When
the key is created under a user, the node inherits the User pointer
from createAndSaveNewNode and the bug is masked.
Also add Owner() validity assertions to the existing unit test
TestTaggedNodeWithoutUserToDifferentUser to catch the nil pointer
at the unit test level.
Updates #3038
processReauthTags sets UserID when converting a tagged node to
user-owned, but does not set the User pointer. When the node was
registered with a tags-only PreAuthKey (User: nil), the in-memory
NodeStore cache holds a node with User=nil. The mapper's
generateUserProfiles then calls node.Owner().Model().ID, which
dereferences the nil pointer and panics.
Set node.User alongside node.UserID in processReauthTags. Also add
defensive nil checks in generateUserProfiles to gracefully handle
nodes with invalid owners rather than panicking.
Fixes#3038
These were thin wrappers around applyAuthNodeUpdate that only added
logging. Move the logging into applyAuthNodeUpdate and call it directly
from HandleNodeFromAuthPath.
This simplifies the code structure without changing behavior.
Updates #3038
Move tag validation before the UpdateNode callback in applyAuthNodeUpdate.
Previously, tag validation happened inside the callback, and the error
check occurred after UpdateNode had already committed changes to the
NodeStore. This left the NodeStore in an inconsistent state when tags
were rejected.
Now validation happens first, and UpdateNode is only called when we know
the operation will succeed. This follows the principle that UpdateNode
should only be called when we have all information and are ready to commit.
Also extract validateRequestTags as a reusable function and use it in
createAndSaveNewNode to deduplicate the tag validation logic.
Updates #3038
Updates #3048
Previously, expiry handling ran BEFORE processReauthTags(), using the
old tagged status to determine whether to set/clear expiry. This caused:
- Personal → Tagged: Expiry remained set (should be cleared to nil)
- Tagged → Personal: Expiry remained nil (should be set from client)
Move expiry handling after tag processing and handle all four transition
cases based on the new tagged status:
- Tagged → Personal: Set expiry from client request
- Personal → Tagged: Clear expiry (tagged nodes don't expire)
- Personal → Personal: Update expiry from client
- Tagged → Tagged: Keep existing nil expiry
Fixes#3048
Reorganize HandleNodeFromAuthPath (~300 lines) into a cleaner structure
with named conditions and extracted helper functions.
Changes:
- Add authNodeUpdateParams struct for shared update logic
- Extract applyAuthNodeUpdate for common reauth/convert operations
- Extract reauthExistingNode and convertTaggedNodeToUser handlers
- Extract createNewNodeFromAuth for new node creation
- Use named boolean conditions (nodeExistsForSameUser, existingNodeIsTagged,
existingNodeOwnedByOtherUser) instead of compound if conditions
- Create logger with common fields (registration_id, user.name, machine.key,
method) to reduce log statement verbosity
Updates #3038
When a node was registered with a tags-only PreAuthKey (no user
associated), the node had User=nil and UserID=nil. When attempting to
re-register this node to a different user via HandleNodeFromAuthPath,
two issues occurred:
1. The code called oldUser.Name() without checking if oldUser was valid,
causing a nil pointer dereference panic.
2. The existing node lookup logic didn't find the tagged node because it
searched by (machineKey, userID), but tagged nodes have no userID.
This caused a new node to be created instead of updating the existing
tagged node.
Fix this by restructuring HandleNodeFromAuthPath to:
1. First check if a node exists for the same user (existing behavior)
2. If not found, check if an existing TAGGED node exists with the same
machine key (regardless of userID)
3. If a tagged node exists, UPDATE it to convert from tagged to
user-owned (preserving the node ID)
4. Only create a new node if the existing node is user-owned by a
different user
This ensures consistent behavior between:
- personal → tagged → personal (same node, same owner)
- tagged (no user) → personal (same node, new owner)
Add a test that reproduces the panic and conversion scenario by:
1. Creating a tags-only PreAuthKey (no user)
2. Registering a node with that key
3. Re-registering the same machine to a different user
4. Verifying the node ID stays the same (conversion, not creation)
Fixes#3038
Both compileFilterRules and compileSSHPolicy include .Caller() on
their resolution error log statements, but compileACLWithAutogroupSelf
does not. Add .Caller() to the three log sites (source resolution
error, destination resolution error, nil destination) for consistent
debuggability across all compilation paths.
Updates #2990
In compileSSHPolicy, when resolving other (non-autogroup:self)
destinations, the code discards the entire result on error via
`continue`. If a destination alias (e.g., a tag owned by a group
with a non-existent user) returns a partial IPSet alongside an
error, valid IPs are lost.
Both ACL compilation paths (compileFilterRules and
compileACLWithAutogroupSelf) already handle this correctly by
logging the error and using the IPSet if non-nil.
Remove the `continue` so the SSH path is consistent with the
ACL paths.
Fixes#2990
Three related issues where User().ID() is called on potentially tagged
nodes without first checking IsTagged():
1. compileACLWithAutogroupSelf: the autogroup:self block at line 166
lacks the !node.IsTagged() guard that compileSSHPolicy already has.
If a tagged node is the compilation target, node.User().ID() may
panic. Tagged nodes should never participate in autogroup:self.
2. compileSSHPolicy: the IsTagged() check is on the right side of &&,
so n.User().ID() evaluates first and may panic before short-circuit
can prevent it. Swap to !n.IsTagged() && n.User().ID() == ... to
match the already-correct order in compileACLWithAutogroupSelf.
3. invalidateAutogroupSelfCache: calls User().ID() at ~10 sites
without IsTagged() guards. Tagged nodes don't participate in
autogroup:self, so they should be skipped when collecting affected
users and during cache lookup. Tag status transitions are handled
by using the non-tagged version's user ID.
Fixes#2990
In compileACLWithAutogroupSelf, when a group contains a non-existent
user, Group.Resolve() returns a partial IPSet (with IPs from valid
users) alongside an error. The code was discarding the entire result
via `continue`, losing valid IPs. The non-autogroup-self path
(compileFilterRules) already handles this correctly by logging the
error and using the IPSet if non-empty.
Remove the `continue` on error for both source and destination
resolution, matching the existing behavior in compileFilterRules.
Also reorder the IsTagged check before User().ID() comparison
in the same-user node filter to prevent nil dereference on tagged
nodes that have no User set.
Fixes#2990
Add test reproducing the exact scenario from issue #2990 where:
- One user (user1) in group:admin
- node1: user device (not tagged)
- node2: tagged with tag:admin, same user
The test verifies that peer visibility and packet filters are correct.
Updates #2990
Previously, nodes with empty filter rules (e.g., tagged servers that are
only destinations, never sources) were skipped entirely in BuildPeerMap.
This could cause visibility issues when using autogroup:self with
multiple user groups.
Remove the len(filter) == 0 skip condition so all nodes are included in
nodeMatchers. Empty filters result in empty matchers where CanAccess()
returns false, but the node still needs to be in the map so symmetric
visibility works correctly: if node A can access node B, both should see
each other regardless of B's filter rules.
Add comprehensive tests for:
- Multi-group scenarios where autogroup:self is used by privileged users
- Nodes with empty filters remaining visible to authorized peers
- Combined access rules (autogroup:self + tags in same rule)
Updates #2990
When a PreAuthKey is deleted, the database correctly sets auth_key_id
to NULL on referencing nodes via ON DELETE SET NULL. However, the
NodeStore (in-memory cache) retains the old AuthKeyID value.
When nodes send MapRequests (e.g., after tailscaled restart), GORM's
Updates() tries to persist the stale AuthKeyID, causing a foreign key
constraint error when trying to reference a deleted PreAuthKey.
Fix this by adding AuthKeyID and AuthKey to the Omit() call in all
three places where nodes are updated via GORM's Updates():
- persistNodeToDB (MapRequest processing)
- HandleNodeFromAuthPath (re-auth via web/OIDC)
- HandleNodeFromPreAuthKey (re-registration with preauth key)
This tells GORM to never touch the auth_key_id column or AuthKey
association during node updates, letting the database handle the
foreign key relationship correctly.
Added TestDeletedPreAuthKeyNotRecreatedOnNodeUpdate to verify that
deleted PreAuthKeys are not recreated when nodes send MapRequests.
Update integration tests to use valid SSH patterns:
- TestSSHOneUserToAll: use autogroup:member and autogroup:tagged
instead of wildcard destination
- TestSSHMultipleUsersAllToAll: use autogroup:self instead of
username destinations for group-to-user SSH access
Updates #3009
Updates #3010
Update unit tests to use valid SSH patterns that conform to Tailscale's
security model:
- Change group->user destinations to group->tag
- Change tag->user destinations to tag->tag
- Update expected error messages for new validation format
- Add proper tagged/untagged node setup in filter tests
Updates #3009
Updates #3010
Add validation for SSH source/destination combinations that enforces
Tailscale's security model:
- Tags/autogroup:tagged cannot SSH to user-owned devices
- autogroup:self destination requires source to contain only users/groups
- Username destinations require source to be that same single user only
- Wildcard (*) is no longer supported as SSH destination; use
autogroup:member or autogroup:tagged instead
The validateSSHSrcDstCombination() function is called during policy
validation to reject invalid configurations at load time.
Fixes#3009Fixes#3010
Refactor the RequestTags migration (202601121700-migrate-hostinfo-request-tags)
to use PolicyManager.NodeCanHaveTag() instead of reimplementing tag validation.
Changes:
- NewHeadscaleDatabase now accepts *types.Config to allow migrations
access to policy configuration
- Add loadPolicyBytes helper to load policy from file or DB based on config
- Add standalone GetPolicy(tx *gorm.DB) for use during migrations
- Replace custom tag validation logic with PolicyManager
Benefits:
- Full HuJSON parsing support (not just JSON)
- Proper group expansion via PolicyManager
- Support for nested tags and autogroups
- Works with both file and database policy modes
- Single source of truth for tag validation
Co-Authored-By: Shourya Gautam <shouryamgautam@gmail.com>
When autogroup:self was combined with other ACL rules (e.g., group:admin
-> *:*), tagged nodes became invisible to users who should have access.
The BuildPeerMap function had two code paths:
- Global filter path: used symmetric OR logic (if either can access, both
see each other)
- Autogroup:self path: used asymmetric logic (only add peer if that
specific direction has access)
This caused problems with one-way rules like admin -> tagged-server. The
admin could access the server, but since the server couldn't access the
admin, neither was added to the other's peer list.
Fix by using symmetric visibility in the autogroup:self path, matching
the global filter path behavior: if either node can access the other,
both should see each other as peers.
Credit: vdovhanych <vdovhanych@users.noreply.github.com>
Fixes#2990
Extend TestApiKeyCommand to test the new --id flag for expire and
delete commands, verifying that API keys can be managed by their
database ID in addition to the existing --prefix method.
Updates #2986
Add --id flag as an alternative to --prefix for expiring and
deleting API keys. This allows users to use the ID shown in
'headscale apikeys list' output, which is more convenient than
the prefix.
Either --id or --prefix must be provided; both flags are optional
but at least one is required.
Updates #2986
Update ExpireApiKey and DeleteApiKey handlers to accept either ID or
prefix for identifying the API key. Returns InvalidArgument error if
neither or both are provided.
Add tests for:
- Expire by ID
- Expire by prefix (backwards compatibility)
- Delete by ID
- Delete by prefix (backwards compatibility)
- Error when neither ID nor prefix provided
- Error when both ID and prefix provided
Updates #2986
Add id field to ExpireApiKeyRequest and DeleteApiKeyRequest messages.
This allows API keys to be expired or deleted by their database ID
in addition to the existing prefix-based lookup.
Updates #2986
Add GetAPIKeyByID method to the state layer, delegating to the existing
database layer function. This enables API key lookup by ID in addition
to the existing prefix-based lookup.
Updates #2986
Convert config loading tests from gopkg.in/check.v1 Suite-based testing
to standard Go tests with testify assert/require.
Changes:
- Remove Suite boilerplate (Test, Suite type, SetUpSuite, TearDownSuite)
- Convert TestConfigFileLoading and TestConfigLoading to standalone tests
- Replace check assertions with testify equivalents
Migrate all database tests from gopkg.in/check.v1 Suite-based testing
to standard Go tests with testify assert/require.
Changes:
- Remove empty Suite files (hscontrol/suite_test.go, hscontrol/mapper/suite_test.go)
- Convert hscontrol/db/suite_test.go to modern helpers only
- Convert 6 Suite test methods in node_test.go to standalone tests
- Convert 5 Suite test methods in api_key_test.go to standalone tests
- Fix stale global variable reference in db_test.go
The legacy TestListPeers Suite method was renamed to TestListPeersManyNodes
to avoid conflict with the existing modern TestListPeers function, as they
test different aspects (basic peer listing vs ID filtering).
Make DeleteUser call updatePolicyManagerUsers() to refresh the policy
manager's cached user list after user deletion. This ensures consistency
with CreateUser, UpdateUser, and RenameUser which all update the policy
manager.
Previously, DeleteUser only removed the user from the database without
updating the policy manager. This could leave stale user references in
the cached user list, potentially causing issues when policy is
re-evaluated.
The gRPC handler now uses the change returned from DeleteUser instead of
manually constructing change.UserRemoved().
Fixes#2967
Add DeleteUser method to ControlServer interface and implement it in
HeadscaleInContainer to enable testing user deletion scenarios.
Add two integration tests for issue #2967:
- TestACLGroupWithUnknownUser: tests that valid users can communicate
when a group references a non-existent user
- TestACLGroupAfterUserDeletion: tests connectivity after deleting a
user that was referenced in an ACL group
These tests currently pass but don't fully reproduce the reported issue
where deleted users break connectivity for the entire group.
Updates #2967
Replace the Tags column with an Owner column that displays:
- Tags (newline-separated) if the key has ACL tags
- User name if the key is associated with a user
- Dash (-) if neither is present
This aligns the CLI output with the tags-as-identity model where
preauthkeys can be created with either tags or user ownership.
- Rename TestTagsAuthKeyWithoutUserIgnoresAdvertisedTags to
TestTagsAuthKeyWithoutUserRejectsAdvertisedTags to reflect actual
behavior (PreAuthKey registrations reject advertised tags)
- Fix TestTagsAuthKeyWithoutUserInheritsTags to use ListNodes() without
user filter since tags-only nodes don't have a user association
Updates #2977
HandleNodeFromPreAuthKey assumed pak.User was always set, but
tags-only PreAuthKeys have nil User. This caused nil pointer
dereference when registering nodes with tags-only keys.
Also updates integration tests to use GetTags() instead of the
removed GetValidTags() method.
Updates #2977
When SetNodeTags changed a node's tags, the node's self view wasn't
updated. The bug manifested as: the first SetNodeTags call updates
the server but the client's self view doesn't update until a second
call with the same tag.
Root cause: Three issues combined to prevent self-updates:
1. SetNodeTags returned PolicyChange which doesn't set OriginNode,
so the mapper's self-update check failed.
2. The Change.Merge function didn't preserve OriginNode, so when
changes were batched together, OriginNode was lost.
3. generateMapResponse checked OriginNode only in buildFromChange(),
but PolicyChange uses RequiresRuntimePeerComputation which
bypasses that code path entirely and calls policyChangeResponse()
instead.
The fix addresses all three:
- state.go: Set OriginNode on the returned change
- change.go: Preserve OriginNode (and TargetNode) during merge
- batcher.go: Pass isSelfUpdate to policyChangeResponse so the
origin node gets both self info AND packet filters
- mapper.go: Add includeSelf parameter to policyChangeResponse
Fixes#2978
Update 8 tests that involve admin tag assignment via SetNodeTags()
to verify both server-side state and node self view updates:
- TestTagsAuthKeyWithTagAdminOverrideReauthPreserves
- TestTagsAuthKeyWithTagCLICannotModifyAdminTags
- TestTagsAuthKeyWithoutTagCLINoOpAfterAdminWithReset
- TestTagsAuthKeyWithoutTagCLINoOpAfterAdminWithEmptyAdvertise
- TestTagsAuthKeyWithoutTagCLICannotReduceAdminMultiTag
- TestTagsUserLoginCLINoOpAfterAdminAssignment
- TestTagsUserLoginCLICannotRemoveAdminTags
- TestTagsAdminAPICanSetUnownedTag
Each test now validates that tag updates propagate to the node's
own self view using assertNodeSelfHasTagsWithCollect, addressing
the issue #2978 scenario where tag changes were observed to
propagate to peers but not to the node itself.
Updates #2978
Add TestTagsIssue2978ReproTagReplacement that specifically tests the
scenario from issue #2978:
- Register node with tag:foo via web auth with --advertise-tags
- Admin changes tag to tag:bar via SetNodeTags
- Verify client's self view updates (not just server-side)
The test performs multiple tag replacements with timing checks to
verify whether tag updates propagate to the node's self view after
the first call (fixed behavior) or only after a redundant second
call (bug behavior).
Add helper functions for test validation:
- assertNodeSelfHasTagsWithCollect: validates client's status.Self.Tags
- assertNetmapSelfHasTagsWithCollect: validates client's netmap.SelfNode.Tags
Updates #2978
Add TestTagsUserLoginReauthWithEmptyTagsRemovesAllTags to validate that
nodes can be untagged via `tailscale up --advertise-tags= --force-reauth`.
The test verifies:
- Node starts with tags and is owned by tagged-devices
- After reauth with empty tags, all tags are removed
- Node ownership returns to the authenticating user
Updates #2979
When a node re-authenticates via OIDC/web auth with empty RequestTags
(from `tailscale up --advertise-tags= --force-reauth`), remove all tags
and return ownership to the authenticating user.
This allows nodes to transition from any tagged state (including nodes
originally registered with a tagged pre-auth key) back to user-owned.
Fixes#2979
Extends #2971 fix to also cover nodes that authenticate as users but
become tagged immediately via --advertise-tags. When RequestTags are
approved by policy, the node's expiry is now disabled, consistent with
nodes registered via tagged PreAuthKeys.
Nodes registered with tagged PreAuthKeys now have key expiry disabled,
matching Tailscale's behavior. User-owned nodes continue to use the
client-requested expiry.
On re-authentication, tagged nodes preserve their disabled expiry while
user-owned nodes can update their expiry from the client request.
Fixes#2971
Just the tags tag:router and tag:exit are owned by alice. Upon join,
those nodes will have their ownership transferred from alice to the
system user "tagged-devices".
User.Proto() was returning u.Name directly, which is empty for OIDC
users who have their identifier in the Email field instead. This caused
"headscale nodes list" to show empty user names for OIDC-authenticated
nodes.
Only fall back to Username() when Name is empty, which provides a
display-friendly identifier (Email > ProviderIdentifier > ID). This
ensures OIDC users display their email while CLI users retain their
original Name.
Fixes#2972
Simplifies the API by exposing a single tags field instead of three
separate fields for forced, invalid, and valid tags. The distinction
between these was an internal implementation detail that should not
be exposed in the public API.
Marks fields 18-20 as reserved to prevent field number reuse.
Update documentation to reflect the new concurrent test execution
capabilities and add guidance on run ID isolation.
AGENTS.md:
- Add examples for running multiple tests concurrently
- Document run ID format and container naming conventions
- Update "Critical Notes" to explain isolation mechanisms
.claude/agents/headscale-integration-tester.md:
- Add "Concurrent Execution and Run ID Isolation" section
- Document forbidden and safe operations for cleanup
- Add "Agent Session Isolation Rules" for multi-agent environments
- Add 6th core responsibility about concurrent execution awareness
- Add ISOLATION PRINCIPLE to critical principles
- Update pre-test cleanup documentation
Remove the concurrent test prevention logic and update cleanup to use
run ID-based isolation, allowing multiple tests to run simultaneously.
Changes:
- cleanup: Add killTestContainersByRunID() to clean only containers
belonging to a specific run, add cleanupStaleTestContainers() to
remove only stopped/exited containers without affecting running tests
- docker: Remove RunningTestInfo, checkForRunningTests(), and related
error types, update cleanupAfterTest() to use run ID-based cleanup
- run: Remove Force flag and concurrent test prevention check
The test runner now:
- Allows multiple concurrent test runs on the same Docker daemon
- Cleans only stale containers before tests (not running ones)
- Cleans only containers with matching run ID after tests
- Prints run ID and monitoring info for operator visibility
Add run ID-based isolation to container naming and network setup to
enable multiple integration tests to run concurrently on the same
Docker daemon without conflicts.
Changes:
- hsic: Add run ID prefix to headscale container names and use dynamic
port allocation for metrics endpoint (port 0 lets kernel assign)
- tsic: Add run ID prefix to tailscale container names
- dsic: Add run ID prefix to DERP container names
- scenario: Use run ID-aware test suite container name for network setup
Container naming now follows: {type}-{runIDShort}-{identifier}-{hash}
Example: ts-mdjtzx-1-74-fgdyls, hs-mdjtzx-pingallbyip-abc123
The run ID is obtained from HEADSCALE_INTEGRATION_RUN_ID environment
variable via dockertestutil.GetIntegrationRunID().
This commit replaces the ChangeSet with a simpler bool based
change model that can be directly used in the map builder to
build the appropriate map response based on the change that
has occured. Previously, we fell back to sending full maps
for a lot of changes as that was consider "the safe" thing to
do to ensure no updates were missed.
This was slightly problematic as a node that already has a list
of peers will only do full replacement of the peers if the list
is non-empty, meaning that it was not possible to remove all
nodes (if for example policy changed).
Now we will keep track of last seen nodes, so we can send remove
ids, but also we are much smarter on how we send smaller, partial
maps when needed.
Fixes#2389
Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
This commit adds tests to validate that there are
issues with how we propagate tag changes in the system.
This replicates #2389
Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
This commit changes so that node changes to the policy is
calculated if any of the nodes has changed in a way that might
affect the policy.
Previously we just checked if the number of nodes had changed,
which meant that if a node was added and removed, we would be
in a bad state.
Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
This PR restructures the integration tests and prebuilds all common assets used in all tests:
Headscale and Tailscale HEAD image
hi binary that is used to run tests
go cache is warmed up for compilation of the test
This essentially means we spend 6-10 minutes building assets before any tests starts, when that is done, all tests can just sprint through.
It looks like we are saving 3-9 minutes per test, and since we are limited to running max 20 concurrent tests across the repo, that means we had a lot of double work.
There is currently 113 checks, so we have to do five runs of 20, and the saving should be quite noticeable! I think the "worst case" saving would be 20+min and "best case" probably towards an hour.
Fixes#2927
In v0.27.0, the list-routes command with -i flag and -o json output
was returning all nodes instead of just the specified node.
The issue was that JSON output was happening before the identifier
filtering logic. This change moves the JSON output to after both
the identifier filter and route existence filter are applied,
ensuring the correct filtered results are returned.
This restores the v0.26.1 behavior where:
headscale nodes list-routes -i 12 -o json
correctly returns only node 12's route information.
This PR investigates, adds tests and aims to correctly implement Tailscale's model for how Tags should be accepted, assigned and used to identify nodes in the Tailscale access and ownership model.
When evaluating in Headscale's policy, Tags are now only checked against a nodes "tags" list, which defines the source of truth for all tags for a given node. This simplifies the code for dealing with tags greatly, and should help us have less access bugs related to nodes belonging to tags or users.
A node can either be owned by a user, or a tag.
Next, to ensure the tags list on the node is correctly implemented, we first add tests for every registration scenario and combination of user, pre auth key and pre auth key with tags with the same registration expectation as observed by trying them all with the Tailscale control server. This should ensure that we implement the correct behaviour and that it does not change or break over time.
Lastly, the missing parts of the auth has been added, or changed in the cases where it was wrong. This has in large parts allowed us to delete and simplify a lot of code.
Now, tags can only be changed when a node authenticates or if set via the CLI/API. Tags can only be fully overwritten/replaced and any use of either auth or CLI will replace the current set if different.
A user owned device can be converted to a tagged device, but it cannot be changed back. A tagged device can never remove the last tag either, it has to have a minimum of one.
This PR changes tags to be something that exists on nodes in addition to users, to being its own thing. It is part of moving our tags support towards the correct tailscale compatible implementation.
There are probably rough edges in this PR, but the intention is to get it in, and then start fixing bugs from 0.28.0 milestone (long standing tags issue) to discover what works and what doesnt.
Updates #2417Closes#2619
This improves security and explicitly fails on startup when a user picks
the wrong directory to store its data.
- Run in read-only mode
- Make /var/run/headscale a read-write tmpfs
- Mount the config volume read-only
- Use the /health endpoint to check if Headscale is up
Previously we tested migrations on schemas and dumps
of old databases.
The problems with testing migrations against the schemas
is that the migration table is empty, so we try to run
migrations that are already ran on that schema, which might
blow up.
This commit removes the schema approach and just leaves all
the dumps, which include the migration table.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Document the API endpoint and the built-in swagger docs at /swagger. The
remote control docs are just a use case for gRPC - move it in the API
docs and update links to it.
The former is a no-op in the base images (45491f2c5c/scripts/debuerreotype-minimizing-config (L87-L109)), and `apt-get dist-clean` is a safer/better version of the `rm -rf /var/lib/apt/lists/*` that keeps the cryptographic bits that help prevent downgrade attacks.
Move favicon.png, style.css, and headscale.svg to hscontrol/assets/
and create a single assets.go file with all embed directives.
Update hscontrol/handlers.go and hscontrol/templates/general.go to
use the centralized assets package.
Add test to validate HTML template output consistency across all
templates (OIDC callback, registration, Windows, Apple).
Verifies all templates produce valid HTML5 with:
- Proper DOCTYPE declaration
- HTML5 lang attribute
- UTF-8 charset
- Viewport meta tag
- Semantic HTML structure
Ensures template refactoring maintains standards compliance.
Refactor template system to use go:embed for external assets and
CSS classes for styling instead of inline styles:
- general.go: Add go:embed directives for style.css and headscale.svg,
replace inline styles with CSS classes (H1, H2, H3, P, etc.),
add mdTypesetBody wrapper with Material for MkDocs styling
- apple.go, oidc_callback.go, register_web.go, windows.go:
Update to use new CSS-based helper functions (H1, H2, P, etc.)
and mdTypesetBody for consistent layout
This separates content from presentation, making templates easier
to maintain and update. All styling is now centralized in style.css
with Material for MkDocs design system.
Add design system assets for HTML templates:
- headscale.svg: Logo with optimized viewBox for proper alignment
- style.css: Material for MkDocs CSS variables and typography
- design.go: Design system constants for consistent styling
The logo viewBox is adjusted to 32.92 0 1247.08 640 to eliminate
whitespace from the original export and ensure left alignment with
text content.
Replace html/template with type-safe elem-go templating for OIDC
callback page. Improves consistency with other templates and provides
compile-time safety. All UI elements and styling preserved.
When tailscaled restarts, it sends RegisterRequest with Auth=nil and
Expiry=zero. Previously this was treated as a logout because
time.Time{}.Before(time.Now()) returns true.
Add early return in handleRegister() to detect this case and preserve
the existing node state without modification.
Fixes#2862
Changed UpdateUser and re-registration flows to use Updates() which only
writes modified fields, preventing unintended overwrites of unchanged fields.
Also updated UsePreAuthKey to use Model().Update() for single field updates
and removed unused NodeSave wrapper.
Fixes a regression introduced in v0.27.0 where node expiry times were
being reset to zero when tailscaled restarts and sends a MapRequest.
The issue was caused by using GORM's Save() method in persistNodeToDB(),
which overwrites ALL fields including zero values. When a MapRequest
updates a node (without including expiry information), Save() would
overwrite the database expiry field with a zero value.
Changed to use Updates() which only updates non-zero values, preserving
existing database values when struct pointer fields are nil.
In BackfillNodeIPs, we need to explicitly update IPv4/IPv6 fields even
when nil (to remove IPs), so we use Select() to specify those fields.
Added regression test that validates expiry is preserved after MapRequest.
Fixes#2862
Skip auth key validation for existing nodes re-registering with the same
NodeKey. Pre-auth keys are only required for initial authentication.
NodeKey rotation still requires a valid auth key as it is a security-sensitive
operation that changes the node's cryptographic identity.
Fixes#2830
When we encounter a source we cannot resolve, we skipped the whole rule,
even if some of the srcs could be resolved. In this case, if we had one user
that exists and one that does not.
In the regular policy, we log this, and still let a rule be created from what
does exist, while in the SSH policy we did not.
This commit fixes it so the behaviour is the same.
Fixes#2863
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
On Windows, if the user clicks the Tailscale icon in the system tray,
it opens a login URL in the browser.
When the login URL is opened, `state/nonce` cookies are set for that particular URL.
If the user clicks the icon again, a new login URL is opened in the browser,
and new cookies are set.
If the user proceeds with auth in the first tab,
the redirect results in a "state did not match" error.
This patch ensures that each opened login URL sets an individual cookie
that remains valid on the `/oidc/callback` page.
`TestOIDCMultipleOpenedLoginUrls` illustrates and tests this behavior.
When we fixed the issue of node visibility of nodes
that only had access to eachother because of a subnet
route, we gave all nodes access to all exit routes by
accident.
This commit splits exit nodes and subnet routes in the
access.
If a matcher indicates that the node should have access to
any part of the subnet routes, we do not remove it from the
node list.
If a matcher destination is equal to the internet, and the
target node is an exit node, we also do not remove the access.
Fixes#2784Fixes#2788
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
There are situations where the subnet routes and exit nodes
must be treated differently. This splits it so SubnetRoutes
only returns routes that are not exit routes.
It adds `IsExitRoutes` and `AllApprovedRoutes` for convenience.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Correctly identify Viper's ConfigFileNotFoundError in LoadConfig to log a warning and use defaults, unifying behavior with empty config files. Fixes fatal error when no config file is present for CLI commands relying on environment variables.
This PR addresses some consistency issues that was introduced or discovered with the nodestore.
nodestore:
Now returns the node that is being put or updated when it is finished. This closes a race condition where when we read it back, we do not necessarily get the node with the given change and it ensures we get all the other updates from that batch write.
auth:
Authentication paths have been unified and simplified. It removes a lot of bad branches and ensures we only do the minimal work.
A comprehensive auth test set has been created so we do not have to run integration tests to validate auth and it has allowed us to generate test cases for all the branches we currently know of.
integration:
added a lot more tooling and checks to validate that nodes reach the expected state when they come up and down. Standardised between the different auth models. A lot of this is to support or detect issues in the changes to nodestore (races) and auth (inconsistencies after login and reaching correct state)
This PR was assisted, particularly tests, by claude code.
- tailscale client gets a new AuthUrl and sets entry in the regcache
- regcache entry expires
- client doesn't know about that
- client always polls followup request а gets error
When user clicks "Login" in the app (after cache expiry), they visit
invalid URL and get "node not found in registration cache". Some clients
on Windows for e.g. can't get a new AuthUrl without restart the app.
To fix that we can issue a new reg id and return user a new valid
AuthUrl.
RegisterNode is refactored to be created with NewRegisterNode() to
autocreate channel and other stuff.
When the node notifier was replaced with batcher, we removed
its closing, but forgot to add the batchers so it was never
stopping node connections and waiting forever.
Fixes#2751
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
The UserInfo endpoint is always queried since 5d8a2c2.
This allows to use all OIDC related features without any extra
configuration on Authelia.
For Keycloak, its sufficient to add the groups mapper to the userinfo
endpoint.
the client will send a lot of fields as `nil` if they have
not changed. NetInfo, which is inside Hostinfo, is one of those
fields and we often would override the whole hostinfo meaning that
we would remove netinfo if it hadnt changed.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Initial work on a nodestore which stores all of the nodes
and their relations in memory with relationship for peers
precalculated.
It is a copy-on-write structure, replacing the "snapshot"
when a change to the structure occurs. It is optimised for reads,
and while batches are not fast, they are grouped together
to do less of the expensive peer calculation if there are many
changes rapidly.
Writes will block until commited, while reads are never
blocked.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Before this patch, we would send a message to each "node stream"
that there is an update that needs to be turned into a mapresponse
and sent to a node.
Producing the mapresponse is a "costly" afair which means that while
a node was producing one, it might start blocking and creating full
queues from the poller and all the way up to where updates where sent.
This could cause updates to time out and being dropped as a bad node
going away or spending too time processing would cause all the other
nodes to not get any updates.
In addition, it contributed to "uncontrolled parallel processing" by
potentially doing too many expensive operations at the same time:
Each node stream is essentially a channel, meaning that if you have 30
nodes, we will try to process 30 map requests at the same time. If you
have 8 cpu cores, that will saturate all the cores immediately and cause
a lot of wasted switching between the processing.
Now, all the maps are processed by workers in the mapper, and the number
of workers are controlable. These would now be recommended to be a bit
less than number of CPU cores, allowing us to process them as fast as we
can, and then send them to the poll.
When the poll recieved the map, it is only responsible for taking it and
sending it to the node.
This might not directly improve the performance of Headscale, but it will
likely make the performance a lot more consistent. And I would argue the
design is a lot easier to reason about.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
- Fix directory hierarchy flattening by using full paths instead of filepath.Base()
- Remove redundant container hostname prefixes from directory names
- Strip top-level directory from tar extraction to avoid nested structure
- Ensure parent directories exist before creating files
- Results in clean structure: control_logs/mapresponses/1-ts-client/file.json
Share/Contribute Headscale Zabbix Monitoring scripts and templates.
Thank you for the awesome application to everyone involved in Headscale's development!
Previously, nil regions were not properly handled. This change allows users to disable regions in DERPMaps.
Particularly useful to disable some official regions.
This patch includes some changes to the OIDC integration in particular:
- Make sure that userinfo claims are queried *before* comparing the
user with the configured allowed groups, email and email domain.
- Update user with group claim from the userinfo endpoint which is
required for allowed groups to work correctly. This is essentially a
continuation of #2545.
- Let userinfo claims take precedence over id token claims.
With these changes I have verified that Headscale works as expected
together with Authelia without the documented escape hatch [0], i.e.
everything works even if the id token only contain the iss and sub
claims.
[0]: https://www.authelia.com/integration/openid-connect/headscale/#configuration-escape-hatch
There was a bug in HA subnet router handover where we used stale node data
from the longpoll session that we handed to Connect. This meant that we got
some odd behaviour where routes would not be deactivated correctly.
This commit changes to the nodeview is used through out, and we load the
current node to be updated in the write path and then handle it all there
to be consistent.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit changes most of our (*)types.Node to
types.NodeView, which is a readonly version of the
underlying node ensuring that there is no mutations
happening in the read path.
Based on the migration, there didnt seem to be any, but the
idea here is to prevent it in the future and simplify other
new implementations.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
We are already being punished by github actions, there seem to be
little value in running all the tests for both databases, so only
run a few key tests to check postgres isnt broken.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Restructure and rewrite the OpenID Connect documentation. Start from the
most minimal configuration and describe what needs to be done both in
Headscale and the identity provider. Describe additional features such
as PKCE and authorization filters in a generic manner with examples.
Document how Headscale populates its user profile and how it relates to
OIDC claims. This is a revised version from the table in the changelog.
Document the validation rules for fields and extend known limitations.
Sort the provider specific section alphabetically and add a section for
Authelia, Authentik, Kanidm and Keycloak. Also simplify and rename Azure
to Entra ID.
Update the description for the oidc section in the example
configuration. Give a short explanation of each configuration setting.
All documentend features were tested with Headscale 0.26 (using a fresh
database each time) using the following identity providers:
* Authelia
* Authentik
* Kanidm
* Keycloak
Fixes: #2295
this commit moves all of the read and write logic, and all different parts
of headscale that manages some sort of persistent and in memory state into
a separate package.
The goal of this is to clearly define the boundry between parts of the app
which accesses and modifies data, and where it happens. Previously, different
state (routes, policy, db and so on) was used directly, and sometime passed to
functions as pointers.
Now all access has to go through state. In the initial implementation,
most of the same functions exists and have just been moved. In the future
centralising this will allow us to optimise bottle necks with the database
(in memory state) and make the different parts talking to eachother do so
in the same way across headscale components.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* cmd/hi: add integration test runner CLI tool
Add a new CLI tool 'hi' for running headscale integration tests
with Docker automation. The tool replaces manual Docker command
composition with an automated solution.
Features:
- Run integration tests in golang:1.24 containers
- Docker context detection (supports colima and other contexts)
- Test isolation with unique run IDs and isolated control_logs
- Automatic Docker image pulling and container management
- Comprehensive cleanup operations for containers, networks, images
- Docker volume caching for Go modules
- Verbose logging and detailed test artifact reporting
- Support for PostgreSQL/SQLite selection and various test flags
Usage: go run ./cmd/hi run TestPingAllByIP --verbose
The tool uses creachadair/command and flax for CLI parsing and
provides cleanup subcommands for Docker resource management.
Updates flake.nix vendorHash for new Go dependencies.
* ci: update integration tests to use hi CLI tool
Replace manual Docker command composition in GitHub Actions
workflow with the new hi CLI tool for running integration tests.
Changes:
- Replace complex docker run command with simple 'go run ./cmd/hi run'
- Remove manual environment variable setup (handled by hi tool)
- Update artifact paths for new timestamped log directory structure
- Simplify command from 15+ lines to 3 lines
- Maintain all existing functionality (postgres/sqlite, timeout, test patterns)
The hi tool automatically handles Docker context detection, container
management, volume mounting, and environment variable setup that was
previously done manually in the workflow.
* makefile: remove test integration
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* feat: add verify client config for embedded DERP
* refactor: embedded DERP no longer verify clients via HTTP
- register the `headscale://` protocol in `http.DefaultTransport` to intercept network requests
- update configuration to use a single boolean option `verify_clients`
* refactor: use `http.HandlerFunc` for type definition
* refactor: some renaming and restructuring
* chore: some renaming and fix lint
* test: fix TestDERPVerifyEndpoint
- `tailscale debug derp` use random node private key
* test: add verify clients integration test for embedded DERP server
* fix: apply code review suggestions
* chore: merge upstream changes
* fix: apply code review suggestions
---------
Co-authored-by: Kristoffer Dalby <kristoffer@dalby.cc>
* Improve map auth logic
* Bugfix
* Add comment, improve error message
* noise: make func, get by node
this commit splits the additional validation into a
separate function so it can be reused if we add more
endpoints in the future.
It swaps the check, so we still look up by NodeKey, but before
accepting the connection, we validate the known machinekey from
the db against the noise connection.
The reason for this is that when a node logs in or out, the node key
is replaced and it will no longer be possible to look it up, breaking
reauthentication.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* noise: add comment to remind future use of getAndVal
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* changelog: add entry
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Co-authored-by: Kristoffer Dalby <kristoffer@tailscale.com>
The systemd target "syslog.target" and not required because syslog is
socket activated.
The directory /var/run is usually a symlink to /run and its created by
systemd via the RuntimeDirectory=headscale option. System creates and
handles permissions, no need to manually mark it as a read-write path.
Move files for packaging outside the docs directory into its own
packaging directory. Replace the existing postinstall and postremove
scripts with Debian maintainerscripts to behave more like a typical
Debian package:
* Start and enable the headscale systemd service by default
* Does not print informational messages
* No longer stop and disable the service on updates
This package also performs migrations for all changes done in previous
package versions on upgrade:
* Set login shell to /usr/sbin/nologin
* Set home directory to /var/lib/headscale
* Migrate to system UID/GID
The package is lintian-clean with a few exceptions that are documented
as excludes and it passes puipars (both tested on Debian 12).
The following scenarious were tested on Ubuntu 22.04, Ubuntu 24.04,
Debian 11, Debian 12:
* Install
* Install same version again
* Install -> Remove -> Install
* Install -> Purge -> Install
* Purge
* Update from 0.22.0
* Update from 0.26.0
See: #2278
See: #2133Fixes: #2311
This change makes editing the generated command easier.
For example, after pasting into a terminal, the cursor position will be
near the username portion which requires editing.
* policy/v2: allow Username as ssh source
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* policy/v2: validate that no undefined group or tag is used
Fixes#2570
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* policy: fixup tests which violated tag constraing
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Link to stable and development docs in the README
* Add Tailscale SSH and autogroup:nonroot to features page
* Use @ when referencing users in policy
* Remove unmaintained headscale-webui
The project seems to be unmaintained (last commit: 2023-05-08) and it
only supports Headscale 0.22 or earlier.
* Use full image URL in container docs
This makes it easy to switch the container runtime from docker <->
podman.
* Remove version from docker-compose.yml example
This is now deprecated and yields a warning.
* Add documentation for routes
* Rename exit-node to routes and add redirects
* Add a new section on subnet routers
* Extend the existing exit-node documentation
* Describe auto approvers for subnet routers and exit nodes
* Provide ACL examples for subnet routers and exit nodes
* Describe HA and its current limitations
* Add a troubleshooting section with IP forwarding
* Update features page for 0.26
Add auto approvers and link to our documentation if available.
* Prefer the console lexer when commandline and output mixed
If we assume someone doesn't already have the required go package, they might also not have the required git package installed either, so pkg_add both of them.
* Make matchers part of the Policy interface
* Prevent race condition between rules and matchers
* Test also matchers in tests for Policy.Filter
* Compute `filterChanged` in v2 policy correctly
* Fix nil vs. empty list issue in v2 policy test
* policy/v2: always clear ssh map
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Co-authored-by: Aras Ergus <aras.ergus@tngtech.com>
Co-authored-by: Kristoffer Dalby <kristoffer@tailscale.com>
* integration: ensure route is set before node joins, reproduce
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* auth: ensure that routes are autoapproved when the node is stored
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Tailscale allows to override the local DNS settings of a node via
"Override local DNS" [1]. Restore this flag with the same config setting
name `dns.override_local_dns` but disable it by default to align it with
Tailscale's default behaviour.
Tested with Tailscale 1.80.2 and systemd-resolved on Debian 12.
With `dns.override_local_dns: false`:
```
Link 12 (tailscale0)
Current Scopes: DNS
Protocols: -DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
DNS Servers: 100.100.100.100
DNS Domain: tn.example.com ~0.e.1.a.c.5.1.1.a.7.d.f.ip6.arpa [snip]
```
With `dns.override_local_dns: true`:
```
Link 12 (tailscale0)
Current Scopes: DNS
Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
DNS Servers: 100.100.100.100
DNS Domain: tn.example.com ~.
```
[1] https://tailscale.com/kb/1054/dns#override-local-dnsFixes: #2256
* ensure final dot on node name
This ensures that nodes which have a base domain set, will have a dot appended to their FQDN.
Resolves: https://github.com/juanfont/headscale/issues/2501
* improve OIDC TTL expire test
Waiting a bit more than the TTL of the OIDC token seems to remove some flakiness of this test. This furthermore makes use of a go func safe buffer which should avoid race conditions.
* Only read relevant nodes from database in PeerChangedResponse
* Rework to ensure transactional consistency in PeerChangedResponse again
* An empty nodeIDs list should return an empty nodes list
* Add test to ListNodesSubset
* Link PR in CHANGELOG.md
* combine ListNodes and ListNodesSubset into one function
* query for all nodes in ListNodes if no parameter is given
* also add optional filtering for relevant nodes to ListPeers
* fix issue auto approve route on register bug
This commit fixes an issue where routes where not approved
on a node during registration. This cause the auto approval
to require the node to readvertise the routes.
Fixes#2497Fixes#2485
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* hsic: only set db policy if exist
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* policy: calculate changed based on policy and filter
v1 is a bit simpler than v2, it does not pre calculate the auto approver map
and we cannot tell if it is changed.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* populate serving from primary routes
Depends on #2464Fixes#2480
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* also exit
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix route update outside of connection
there was a bug where routes would not be updated if
they changed while a node was connected and it was not part of an
autoapprove.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* update expected test output, cli only shows service node
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Some endpoints in /debug send JSON data as string. Set the Content-Type
header to "application/json" which renders nicely in Firefox.
Mention the /debug route in the example configuration.
* utility iterator for ipset
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* split policy -> policy and v1
This commit split out the common policy logic and policy implementation
into separate packages.
policy contains functions that are independent of the policy implementation,
this typically means logic that works on tailcfg types and generic formats.
In addition, it defines the PolicyManager interface which the v1 implements.
v1 is a subpackage which implements the PolicyManager using the "original"
policy implementation.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* use polivyv1 definitions in integration tests
These can be marshalled back into JSON, which the
new format might not be able to.
Also, just dont change it all to JSON strings for now.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* formatter: breaks lines
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* remove compareprefix, use tsaddr version
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* remove getacl test, add back autoapprover
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* use policy manager tag handling
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* rename display helper for user
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* introduce policy v2 package
policy v2 is built from the ground up to be stricter
and follow the same pattern for all types of resolvers.
TODO introduce
aliass
resolver
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* wire up policyv2 in integration testing
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* split policy v2 tests into seperate workflow to work around github limit
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add policy manager output to /debug
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* update changelog
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* handle register auth errors
This commit handles register auth errors as the
Tailscale clients expect. It returns the error as
part of a tailcfg.RegisterResponse and not as a
http error.
In addition it fixes a nil pointer panic triggered
by not handling the errors as part of this chain.
Fixes#2434
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* changelog
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This PR switches the homegrown debug endpoint to using tsweb.Debugger, a neat toolkit with batteries included for pprof and friends, and making it easy to add additional debug info:
I've started out by adding a bunch of "introspect" endpoints
image
So users can see the acl, filter, config, derpmap and connected nodes as headscale sees them.
This helps preventing messages being sent with the wrong update type
and payload combination, and it is shorter/neater.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Separate the term "tailnet" from user and be more explicit about
providing a single tailnet.
Also be more explicit about users. Refer to "headscale users" when
mentioning commandline invocations and use the term "local users" when
discussing unix accounts.
Fixes: #2335
* add dedicated http error to propagate to user
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* classify user errors in http handlers
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* move validation of pre auth key out of db
This move separates the logic a bit and allow us to
write specific errors for the caller, in this case the web
layer so we can present the user with the correct error
codes without bleeding web stuff into a generic validate.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* update changelog
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* set state and nounce in oidc to prevent csrf
Fixes#2276
* try to fix new postgres issue
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* ensure valid tags is populated on user gets too
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* ensure forced tags are added
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* remove unused envvar in test
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* debug log auth/unauth tags in policy man
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* defer shutdown in tags test
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add tag test with groups
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add email, display name, picture to create user
Updates #2166
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add ability to set display and email to cli
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add email to test users in integration
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix issue where tags were only assigned to email, not username
Fixes#2300Fixes#2307
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* expand principles to correct login name
and if fix an issue where nodeip principles might not expand to all
relevant IPs instead of taking the first in a prefix.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix ssh unit test
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* update cli and oauth tests for users with email
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* index by test email
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix last test
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Describe both ways to add extra DNS records
* Use "extra" instead of "custom" to align with the configuration file
* Include dns.extra_records_path in the configuration file
* Add -race flag to Makefile and integration tests; fix data race in CreateTailscaleNodesInUser
* Fix data race in ExecuteCommand by using local buffers and mutex
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
* lint
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
---------
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
* Fix excess error message during writes
Fixes#2290
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* retry filewatcher on removed files
This should handled if files are deleted and added again, and for rename
scenarios.
Fixes#2289
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* test more write and remove in filewatcher
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Upgrade the use of dns.use_username_in_magic_dns or
dns_config.use_username_in_magic_dns to a fatal error and remove the
option from the example configuration and integration tests.
Fixes: #2219
Docker releases a patch release which changed the required permissions to be able to do tun devices in containers, this caused all containers to fail in tests causing us to fail all tests. This fixes it, and adds some tools for debugging in the future.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Setup mike to provide versioned builds of the documentation.
The goal is to have versioned docs for stable releases (0.23.0, 0.24.0)
and development docs that can progress along with the code. This allows
us to tailor docs to the next upcoming version as we no longer need to
care about diversion between rendered docs and the latest release.
Versions:
* development (alias: unstable) on each push to the main branch
* MAJOR.MINOR.PATCH (alias: stable, latest for the newest version)
* for each "final" release tag
* for each push to doc maintenance branches: doc/MAJOR.MINOR.PATCH
The default version should the current stable version. The doc
maintenance branches may be used to update the version specific
documentation when issues arise after a release.
This commit fixes the constraint syntax so it is both valid for
sqlite and postgres.
To validate this, I've added a new postgres testing library and a
helper that will spin up local postgres, setup a db and use it in
the constraints tests. This should also help testing db stuff in
the future.
postgres has been added to the nix dev shell and is now required
for running the unit tests.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit hardens the migration part of the OIDC from
the old username based approach to the new sub based approach
and makes it possible for the operator to opt out entirely.
Fixes#1990
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Use a trailing slash
recommended by mkdocs-material
* Update doc requirements
Let mkdocs-material resolve its imaging dependencies (cairosvg and
pillow) and fix a dependabot warning along the way.
Reference compatible versions by major.minor.
* config: loosen up BaseDomain and ServerURL checks
Requirements [here][1]:
> OK:
> server_url: headscale.com, base: clients.headscale.com
> server_url: headscale.com, base: headscale.net
>
> Not OK:
> server_url: server.headscale.com, base: headscale.com
>
> Essentially we have to prevent the possibility where the headscale
> server has a URL which can also be assigned to a node.
>
> So for the Not OK scenario:
>
> if the server is: server.headscale.com, and a node joins with the name
> server, it will be assigned server.headscale.com and that will break
> the connection for nodes which will now try to connect to that node
> instead of the headscale server.
Fixes#2210
[1]: https://github.com/juanfont/headscale/issues/2210#issuecomment-2488165187
* server_url and base_domain: re-word error message, fix a one-off bug and add a test case for the bug.
* lint
* lint again
* integration testing: add and validate build-time options for tailscale head
* fixup! integration testing: add and validate build-time options for tailscale head
integration testing: comply with linter
* fixup! fixup! integration testing: add and validate build-time options for tailscale head
integration testing: tsic.New must never return nil
* fixup! fixup! fixup! integration testing: add and validate build-time options for tailscale head
* minor fixes
* Document to either use a minimal configuration file or environment
variables to connect with a remote headscale instance.
* Document a workaround specific for headscale 0.23.0.
* Remove reference to ancient headscale version.
* Use `cli.insecure: true` or `HEADSCALE_CLI_INSECURE=1` to skip
certificate verification.
* Style and typo fixes
Ref: #2193
* Remove status from web-ui docs
Rename the title to indicate that there multiple web interfaces
available. Do not track the status of each web interface here as their
status is subject to change over time.
* Add page for third-party tools and scripts
According to 15fc6cd966
the routes `/derp/probe` and `/derp/latency-check` are the same and
different versions of the tailscale client use one or the other
endpoint.
Also handle /derp/latency-check
Fixes: #2211
* #2140 Fixed updating of hostname and givenName when it is updated in HostInfo
* #2140 Added integration tests
* #2140 Fix unit tests
* Changed IsAutomaticNameMode to GivenNameHasBeenChanged. Fixed errors in files according to golangci-lint rules
* Setup mkdocs-redirects
* Restructure existing documentation
* Move client OS support into the documentation
* Move existing Client OS support table into its own documentation page
* Link from README.md to the rendered documentation
* Document minimum Tailscale client version
* Reuse CONTRIBUTING.md" in the documentation
* Include "CONTRIBUTING.md" from the repository root
* Update FAQ and index page and link to the contributing docs
* Add configuration reference
* Add a getting started page and explain the first steps with headscale
* Use the existing "Using headscale" sections and combine them into a
single getting started guide with a little bit more explanation.
* Explain how to get help from the command line client.
* Remove duplicated sections from existing installation guides
* Document requirements and assumptions
* Document packages provided by the community
* Move deb install guide to official releases
* Move manual install guide to official releases
* Move container documentation to setup section
* Move sealos documentation to cloud install page
* Move OpenBSD docs to build from source
* Simplify DNS documentation
* Add sponsor page
* Add releases page
* Add features page
* Add help page
* Add upgrading page
* Adjust mkdocs nav
* Update wording
Use the term headscale for the project, Headscale on the beginning of a
sentence and `headscale` when refering to the CLI.
* Welcome to headscale
* Link to existing documentation in the FAQ
* Remove the goal header and use the text as opener
* Indent code block in OIDC
* Make a few pages linter compatible
Also update ignored files for prettier
* Recommend HTTPS on port 443
Fixes: #2164
* Use hosts in acl documentation
thx @efficacy38 for noticing this
Ref: #1863
* Use mkdocs-macros to set headscale version once
* Changed all the HTML into go using go-elem
Created templates package in ./hscontrol/templates.
Moved the registerWebAPITemplate into the templates package as a function to be called.
Replaced the apple and windows html files with go-elem.
* update flake
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Co-authored-by: Kristoffer Dalby <kristoffer@tailscale.com>
* make reauth test compat with tailscale head
tailscale/tailscale@1eaad7d broke our reauth test as it makes the client
retry with https/443 if it reconnects within 2 minutes.
This commit fixes this by running the test as a two part,
- with https, to confirm instant reconnect works
- with http, and a 3 min wait, to check that it work without.
The change is not a general consern as headscale in prod is ran
with https.
Updates #2164
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* sort test for stable order
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
expand user, add claims to user
This commit expands the user table with additional fields that
can be retrieved from OIDC providers (and other places) and
uses this data in various tailscale response objects if it is
available.
This is the beginning of implementing
https://docs.google.com/document/d/1X85PMxIaVWDF6T_UPji3OeeUqVBcGj_uHRM5CI-AwlY/edit
trying to make OIDC more coherant and maintainable in addition
to giving the user a better experience and integration with a
provider.
remove usernames in magic dns, normalisation of emails
this commit removes the option to have usernames as part of MagicDNS
domains and headscale will now align with Tailscale, where there is a
root domain, and the machine name.
In addition, the various normalisation functions for dns names has been
made lighter not caring about username and special character that wont
occur.
Email are no longer normalised as part of the policy processing.
untagle oidc and regcache, use typed cache
This commits stops reusing the registration cache for oidc
purposes and switches the cache to be types and not use any
allowing the removal of a bunch of casting.
try to make reauth/register branches clearer in oidc
Currently there was a function that did a bunch of stuff,
finding the machine key, trying to find the node, reauthing
the node, returning some status, and it was called validate
which was very confusing.
This commit tries to split this into what to do if the node
exists, if it needs to register etc.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Update headscale user creation settings in .deb
Update the headscale user settings to:
- shell = /usr/sbin/nologin
- home-dir = /var/lib/headscale
This syncs the .deb installation behavior with the current Linux docs:
fe68f50328/docs/running-headscale-linux-manual.md (L39-L45)Fixesjuanfont/headscale#2133
* slight refactor to use existing variables.
* Fixup for HOME_DIR var
this commit denormalises the Tags related to a Pre auth key
back onto the preauthkey table and struct as a string list.
There was not really any real normalisation here as we just added
a bunch of duplicate tags with new IDs and preauthkeyIDs, lots of
GORM cermony but no actual advantage.
This work is the start to fixup tags which currently are not working
as they should.
Updates #1369
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Some commands such as `nodes delete` require user interaction and they
fail if `-it` is no supplied to `docker exec`. Use `docker exec -it` in
documentation examples to also make them work in interactive commands.
* handle control protocol through websocket
The necessary behaviour is already in place,
but the wasm build only issued GETs, and the handler was not invoked.
* get DERP-over-websocket working for wasm clients
* Prepare for testing builtin websocket-over-DERP
Still needs some way to assert that clients are connected through websockets,
rather than the TCP hijacking version of DERP.
* integration tests: properly differentiate between DERP transports
* do not touch unrelated code
* linter fixes
* integration testing: unexport common implementation of derp server scenario
* fixup! integration testing: unexport common implementation of derp server scenario
* dockertestutil/logs: remove unhelpful comment
* update changelog
---------
Co-authored-by: Csaba Sarkadi <sarkadicsa@tutanota.de>
* Rename docs/ios-client.md to docs/apple-client.md. Add instructions
for macOS; those are copied from the /apple endpoint and slightly
modified. Fix doc links in the README.
* Move infoboxes for /apple and /windows under the "Goal" section to the
top. Those should be seen by users first as they contain *their*
specific headscale URL.
* Swap order of macOS and iOS to move "Profiles" further down.
* Remove apple configuration profiles
* Remove Tailscale versions hints
* Mention /apple and /windows in the README along with their docs
See: #2096
* move logic for validating node names
this commits moves the generation of "given names" of nodes
into the registration function, and adds validation of renames
to RenameNode using the same logic.
Fixes#2121
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix double arg
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add shutdown that asserts if headscale had panics
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add test case producing 2118 panic
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* make stream shutdown if self-node has been removed
Currently we will read the node from database, and since it is
deleted, the id might be set to nil. Keep the node around and
just shutdown, so it is cleanly removed from notifier.
Fixes#2118
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Simplify /windows to the bare minimum. Also remove the
/windows/tailscale.reg endpoint as its generated file is no longer
valid for current Tailscale versions.
* Update and simplify the windows documentation accordingly.
* Add a "Unattended mode" section to the troubleshooting section
explaining how to enable "Unattended mode" in the via the Tailscale
tray icon.
* Add infobox about /windows to the docs
Tested on Windows 10, 22H2 with Tailscale 1.72.0
Replaces: #1995
See: #2096
* replace deprecated golangci-lint output format
CI was producing this kind of messages:
> [config_reader] The output format `github-actions` is deprecated, please use `colored-line-number`
* Actually lint files on CI
* Add support for service reload and sync service file
* Copy the systemd.service file to the manual linux docs and adjust the
path to the headscale binary to match with the previous documentation
blocks. Unfortunately, there seems to be no easy way to include a
file in mkdocs.
* Remove a redundant "deprecation" block. The beginning of the
documentation already states that.
* Add `ExecReload` to the systemd.service file.
Fixes: #2016
* Its called systemd
* Fix link to systemd homepage
* docs/acl: fix path to policy file
* docs/exit-node: fixup for 0.23
* Add newlines between commands to improve readability
* Use nodes instead on name
* Remove query parameter from link to Tailscale docs
* docs/remote-cli: fix formatting
* Indent blocks below line numbers to restore numbering
* Fix minor typos
* docs/reverse-proxy: remove version information
* Websocket support is always required now
* s/see detail/see details
* docs/exit-node: add warning to manual documentation
* Replace the warning section with a warning admonition
* Fix TODO link back to the regular linux documentation
* docs/openbsd: fix typos
* the database is created on-the-fly
* docs/sealos: fix typos
* docs/container: various fixes
* Remove a stray sentence
* Remove "headscale" before serve
* Indent line continuation
* Replace hardcoded 0.22 with <VERSION>
* Fix path in debug image to /ko-app/headscale
Fixes: #1822
aa
* Fix KeyExpiration when a zero time value has a timezone
When a zero time value is loaded from JSON or a DB in a way that
assigns it the local timezone, it does not roudtrip in JSON as a
value for which IsZero returns true. This causes KeyExpiry to be
treated as a far past value instead of a nilish value.
See https://github.com/golang/go/issues/57040
* Fix whitespace
* Ensure that postgresql is used for all tests when env var is set
* Pass through value of HEADSCALE_INTEGRATION_POSTGRES env var
* Add option to set timezone on headscale container
* Add test for registration with auth key in alternate timezone
* validate policy against nodes, error if not valid
this commit aims to improve the feedback of "runtime" policy
errors which would only manifest when the rules are compiled to
filter rules with nodes.
this change will in;
file-based mode load the nodes from the db and try to compile the rules on
start up and return an error if they would not work as intended.
database-based mode prevent a new ACL being written to the database if
it does not compile with the current set of node.
Fixes#2073Fixes#2044
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* ensure stderr can be used in err checks
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* test policy set validation
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add new integration test to ghaction
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add back defer for cli tst
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Requiring someone to write a design doc/contribute to the feature shouldn't be a requirement for raising a feature request as users may lack the skills required to do this.
Code Rabbit is one of these new fancy LLM code review tools. I am skeptical
but we can try it for free and it might provide us with some value to let
people get feedback while waiting for other people.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
this commit changes and streamlines the dns_config into a new
key, dns. It removes a combination of outdates and incompatible
configuration options that made it easy to confuse what headscale
could and could not do, or what to expect from ones configuration.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Fix data race issues in EphemeralGarbageCollector tests
* Add defer for mutex unlock in TestEphemeralGarbageCollectorOrder
* Fix mutex unlock order in closure by updating defer placement
* reformat code
This is mostly an automated change with `make lint`.
I had to manually please golangci-lint in routes_test because of a short
variable name.
* fix start -> strategy which was wrongly corrected by linter
* replace ephemeral deletion logic
this commit replaces the way we remove ephemeral nodes,
currently they are deleted in a loop and we look at last seen
time. This time is now only set when a node disconnects and
there was a bug (#2006) where nodes that had never disconnected
was deleted since they did not have a last seen.
The new logic will start an expiry timer when the node disconnects
and delete the node from the database when the timer is up.
If the node reconnects within the expiry, the timer is cancelled.
Fixes#2006
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* use uint64 as authekyid and ptr helper in tests
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add test db helper
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add list ephemeral node func
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* schedule ephemeral nodes for removal on startup
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix gorm query for postgres
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* add godoc
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Added a new function `RegisterMethodToV1Enum()` to Node, converting the internal register method string to the corresponding V1 Enum value. Included corresponding unit test in `node_test.go` to ensure correct conversion for various register methods.
* correctly enable WAL log for sqlite
this commit makes headscale correctly enable write-ahead-log for
sqlite and adds an option to turn it on and off.
WAL is enabled by default and should make sqlite perform a lot better,
even further eliminating the need to use postgres.
It also adds a couple of other useful defaults.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* update changelog
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
most of the time we dont even check this error and checking
the string for particular errors is very flake as different
databases (sqlite and psql) use different error messages, and
some users might have it in other languages.
Fixes#1956
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This PR removes the complicated session management introduced in https://github.com/juanfont/headscale/pull/1791 which kept track of the sessions in a map, in addition to the channel already kept track of in the notifier.
Instead of trying to close the mapsession, it will now be replaced by the new one and closed after so all new updates goes to the right place.
The map session serve function is also split into a streaming and a non-streaming version for better readability.
RemoveNode in the notifier will not remove a node if the channel is not matching the one that has been passed (e.g. it has been replaced with a new one).
A new tuning parameter has been added to added to set timeout before the notifier gives up to send an update to a node.
Add a keep alive resetter so we wait with sending keep alives if a node has just received an update.
In addition it adds a bunch of env debug flags that can be set:
- `HEADSCALE_DEBUG_HIGH_CARDINALITY_METRICS`: make certain metrics include per node.id, not recommended to use in prod.
- `HEADSCALE_DEBUG_PROFILING_ENABLED`: activate tracing
- `HEADSCALE_DEBUG_PROFILING_PATH`: where to store traces
- `HEADSCALE_DEBUG_DUMP_CONFIG`: calls `spew.Dump` on the config object startup
- `HEADSCALE_DEBUG_DEADLOCK`: enable go-deadlock to dump goroutines if it looks like a deadlock has occured, enabled in integration tests.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
this directory is unmaintained and not verified, if it should be restored, it should end up
under the community docs effort.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* Add test stage to docs
Add new file with docs tets
Run only in pulls
* set explicit python version
* Revert "set explicit python version"
This reverts commit 4dd7b81f26.
* docs/requirements: update mkdocs-material
---------
Co-authored-by: ohdearaugustin <ohdearaugustin@users.noreply.github.com>
jagottsicher's fork fixed a bug in Windows implementation. While Windows may be not intended as a target platform,
some contributors may prefer it for development.
Also ran go mod tidy, thus two more unnecessary packages are removed from go.sum
This commit restructures the map session in to a struct
holding the state of what is needed during its lifetime.
For streaming sessions, the event loop is structured a
bit differently not hammering the clients with updates
but rather batching them over a short, configurable time
which should significantly improve cpu usage, and potentially
flakyness.
The use of Patch updates has been dialed back a little as
it does not look like its a 100% ready for prime time. Nodes
are now updated with full changes, except for a few things
like online status.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Fixes the issue reported in #1712. In Tailscale SaaS, ephemeral keys can be single-user or reusable. Until now, our ephemerals were only reusable. This PR makes us adhere to the .com behaviour.
A lot of things are breaking in 0.23 so instead of having this
be a long process, just rip of the plaster.
Updates #1758
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* rework docker tags
This commit tries to align the new docker tags with the old schema
A prerelease will end up with the following tags:
- unstable
- v0.23.0-alpha3
- 0.23.0.alpha3
- sha-1234adsfg
A release will end up with:
- latest
- stable
- v0.23.0
- v0.23
- v0
- 0.23.0
- 0.23
- 0
- sha-1234adsfg
All of the builds will also have a `-debug` version.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* update changelog
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* TLS documentation updates
Move "Bring your own certificates" to the top
since the letsencrypt section is now much longer, it seems wrong to
keep such a short section way down at the bottom.
Restructure "Challenge types" into separate sections
Add technical description of letsencrypt renewals
this aims to answer:
- what can be expected in terms of renewals
- what logs can be expected (none)
- how to validate that renewal happened successfully
- the reason for some of the 'acme/autocert' logs, or at least
some best-effort assumptions
* +prettier
* Add test because of issue 1604
* Add peer for routes
* Revert previous change to try different way to add peer
* Add traces
* Remove traces
* Make sure tests have IPPrefix comparator
* Get allowedIps before loop
* Remove comment
* Add composite literals :)
We currently do not have a way to clean up api keys. There may be cases
where users of headscale may generate a lot of api keys and these may
end up accumulating in the database. This commit adds the command to
delete an api key given a prefix.
When Postgres is used as the backing database for headscale,
it does not set a limit on maximum open and idle connections
which leads to hundreds of open connections to the Postgres
server.
This commit introduces the configuration variables to set those
values and also sets default while opening a new postgres connection.
This commits removes the locks used to guard data integrity for the
database and replaces them with Transactions, turns out that SQL had
a way to deal with this all along.
This reduces the complexity we had with multiple locks that might stack
or recurse (database, nofitifer, mapper). All notifications and state
updates are now triggered _after_ a database change.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix#1706 - failover should disregard disabled routes during failover
* fixe tests for failover; all current tests assume routes to be enabled
* add testcase for #1706 - failover to disabled route
* upgrade tailscale
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* make Node object use actualy tailscale key types
This commit changes the Node struct to have both a field for strings
to store the keys in the database and a dedicated Key for each type
of key.
The keys are populated and stored with Gorm hooks to ensure the data
is stored in the db.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* use key types throughout the code
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* make sure machinekey is concistently used
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* use machine key in auth url
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix web register
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* use key type in notifier
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
* fix relogin with webauth
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
---------
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit rearranges the poll handler to immediatly accept
updates and notify its peers and return, not travel down the
function for a bit. This reduces the DB calls and other
holdups that isnt necessary to send a "lite response", a
map response without peers, or accepting an endpoint update.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This field is no longer used, it was used in our old state
"algorithm" to determine if we should send an update.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit changes the internals of the mapper to
track all the changes to peers over its lifetime.
This means that it no longer depends on the database
and this should hopefully help with locks and timing issues.
When the mapper is created, it needs the current list of peers,
the world view, when the polling session was started. Then as
update changes are called, it tracks the changes and generates
responses based on its internal list.
As a side, the types.Machines and types.MachinesP, as well as
types.Machine being passed as a full struct and pointer has been
changed to always be pointers, everywhere.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Previously we did not update the packet filter
when nodes changed, which would cause new nodes
to be missing from packet filters of old nodes.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commits extends the mapper with functions for creating "delta"
MapResponses for different purposes (peer changed, peer removed, derp).
This wires up the new state management with a new StateUpdate struct
letting the poll worker know what kind of update to send to the
connected nodes.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit replaces the timestamp based state system with a new
one that has update channels directly to the connected nodes. It
will send an update to all listening clients via the polling
mechanism.
It introduces a new package notifier, which has a concurrency safe
manager for all our channels to the connected nodes.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
There was a lot of tests that actually threw a lot of errors and that did
not pass all the way because we didnt check everything. This commit should
fix all of these cases.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Also bumps tailscale version to trigger build and fixes a CLI test
that had the wrong capitalisation
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit makes a wrapper function round the normalisation requiring
"stripEmailDomain" which has to be passed in almost all functions of
headscale by loading it from Viper instead.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit allows SSH rules to be assigned to each relevant not and
by doing that allow SSH to be rejected, completing the initial SSH
support.
This commit enables SSH by default and removes the experimental flag.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit renames a bunch of files to try to make it a bit less confusing;
protocol_ is now auth as they contained registration, auth and login/out flow
protocol_.*_poll is now poll.go
api.go and other generic handlers are now handlers.go
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Prior to the code reorg, we would generate rules from the Policy and
store it on the global object. Now we generate it on the fly for each node
and this commit cleans up the old variables to make sure we have no
unexpected side effects.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
The mapper package contains functions related to creating and marshalling
reponses to machines.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This is a massive commit that restructures the code into modules:
db/
All functions related to modifying the Database
types/
All type definitions and methods that can be exclusivly used on
these types without dependencies
policy/
All Policy related code, now without dependencies on the Database.
policy/matcher/
Dedicated code to match machines in a list of FilterRules
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This is step one in detaching the Database layer from Headscale (h). The
ultimate goal is to have all function that does database operations in
its own package, and keep the business logic and writing separate.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
distroless has proven a mantenance burden for us, and it has caused headaches for user when trying to debug issues in the container.
And in 2023, 20MB of extra disk space are neglectible.
This commit simplifies the goreleaser configuration and then adds nfpm
support which allows us to build .deb and .rpm for each of the ARCH we
support.
The deb and rpm packages adds systemd services and users, creates
directories etc and should in general give the user a working
environment. We should be able to remove a lot of the complicated,
PEBCAK inducing documentation after this.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commits adds a test to verify that nodes get updated if a node in
their network expires.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit adds a default OpenID Connect expiry to 180d to align with
Tailscale SaaS (previously infinite or based on token expiry).
In addition, it adds an option use the expiry time from the Token sent
by the OpenID provider. This will typically cause really short expiry
and you should only turn on this option if you know what you are
desiring.
This fixes#1176.
Co-authored-by: Even Holthe <even.holthe@bekk.no>
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Align behaviour of dns_config.restricted_nameservers to tailscale.
Tailscale allows split DNS configuration without requiring global nameservers.
In addition, as per [the docs](https://tailscale.com/kb/1054/dns/#using-dns-settings-in-the-admin-console):
> These nameservers also configure search domains for your devices
This commit aligns headscale to tailscale by:
* honouring dns_config.restricted_nameservers regardless of whether any global resolvers are configured
* adding a search domain for each restricted_nameserver
- Save logs from control(headscale) on every run to tmp
- Upgrade nix-actions
- Cancel builds if new commit is pushed
- Fix a sorting bug in user command test
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
As indicated by bradfitz in https://github.com/juanfont/headscale/issues/804#issuecomment-1399314002,
both routes for the exit node must be enabled at the same time. If a user tries to enable one of the exit node routes,
the other gets activated too.
This commit also reduces the API surface, making private a method that didnt need to be exposed.
The calls to AutoMigrate to other classes that refer to users will
create the table and it will break, it needs to be done before
everything else.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
While this truly breaks the point of the backwards compatible stuff with
protobuf, it does not seem worth it to attempt to glue together a
compatible API.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Currently the most "secret" way to specify the oidc client secret is via
an environment variable `OIDC_CLIENT_SECRET`, which is problematic[1].
Lets allow reading oidc client secret from a file. For extra convenience
the path to the secret will resolve the environment variables.
[1]: https://systemd.io/CREDENTIALS/
This commit sets the Headscale config from env instead of file for
integration tests, the main point is to make sure that when we add per
test config, it properly replaces the config key and not append it or
something similar.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
In TS2021 the MachineKey can be obtained from noiseConn.Peer() - contrary to what I thought before,
where I assumed MachineKey was dropped in TS2021.
By having a ts2021App and hanging from there the TS2021 handlers, we can fetch again the MachineKey.
When using Tailscale v1.34.1, enabling or disabling a route does not
effectively add or remove the route from the node's routing table.
We must restart tailscale on the node to have a netmap update.
Fix this by refreshing last state change so that a netmap diff is sent.
Also do not include secondary routes in allowedIPs, otherwise secondary
routes might be used by nodes instead of the primary route.
Signed-off-by: Fatih Acar <facar@scaleway.com>
Port routes tests to new model
Mark as primary the first instance of subnet + tests
In preparation for subnet failover, mark the initial occurrence of a subnet as the primary one.
This commit makes the initial SSH test a bit simpler:
- Use the same pattern/functions for all clients as other tests
- Only test within _one_ namespace/user to confirm the base case
- Use retry function, same as taildrop, there is some funky going on
there...
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Advertises the SSH capability, and parses the SSH ACLs to pass to the
tailscale client. Doesn’t support ‘autogroup’ ACL functionality.
Co-authored-by: Daniel Brooks <db48x@headline.com>
* Port OIDC integration tests to v2
* Move Tailscale old versions to TS2019 list
* Remove Alpine Linux container
* Updated changelog
* Releases: use flavor to set the tag suffix
* Added more debug messages in OIDC registration
* Added more logging
* Do not strip nodekey prefix on handle expired
* Updated changelog
* Add WithHostnameAsServerURL option func
* Reduce the number of namespaces and use hsic.WithHostnameAsServerURL
* Linting fix
* Fix linting issues
* Wait for ready outside the up goroutine
* Minor change in log message
* Add prefix to env var
* Remove unused env var
Co-authored-by: Juan Font <juan.font@esa.int>
Co-authored-by: Steven Honson <steven@honson.id.au>
Co-authored-by: Kristoffer Dalby <kristoffer@dalby.cc>
This commit injects the per-test-generated tls certs into the tailscale
container and makes sure all can ping all. It does not test any of the
DERP isolation yet.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit injects the per-test-generated tls certs into the tailscale
container and makes sure all can ping all. It does not test any of the
DERP isolation yet.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
When using running `tailscale up` in the AuthKey flow process, the tailscale client immediately enters PollMap after registration - avoiding a race condition.
When using the web auth (up -> go to the Control website -> CLI `register`) the client is polling checking if it has been authorized. If we immediately ask for the client IP, as done in CreateHeadscaleEnv() we might have the client in NotReady status.
This method provides a way to wait for the client to be ready.
Signed-off-by: Juan Font Alonso <juanfontalonso@gmail.com>
The retry has no real function as it will just fail on
"container exists" on the old tests and the new test will
just try forever before it eventually fails.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
As indicated by the comment, the default /var/lib/headscale path is not writable in the container. However the sample setting is not following that like `private_key_path`
0.16.0 introduced random suffixes to all machine given names
(DNS hostnames) regardless of collisions within a namespace.
This commit brings Headscale more inline with Tailscale by only
adding a suffix if the hostname will collide within the namespace.
The suffix generation differs from Tailscale.
See https://tailscale.com/kb/1098/machine-names/
Add a configuration flag (default true to preserve current behaviour) to
allow headscale to start without OIDC being able to initialise.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit makes headscale fall back to CLI authentication if oidc
fails to initialised and posts a warning to users.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
OIDC might be configured, but unable to be initialised, this only runs
the oidc cycle if it is actually successfully set up/initialised.
Prep for next commit
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
This commit addresses a potential issue where we allowed unsanitised
content to be passed through a go template without validation.
We now try to unmarshall the incoming node key and fails to render the
template if it is not a valid node key.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
It appears to be causing confusion for users on Discord when copying/pasting from the example here, if Headscale crashes on launch then the container will be removed and logs can't be viewed with `docker logs`.
It appears to be causing confusion for users on Discord when copying/pasting from the example here, if Headscale crashes on launch then the container will be removed and logs can't be viewed with `docker logs`.
Currently we exit the program if the setup does not work, this can cause
is to leave containers and other resources behind since we dont run
TearDown. This change will just fail the test if we cant set up, which
should mean that the TearDown runs aswell.
We currently have a bit of flaky logic which prevents the docker plugin
from cleaning up the containers if the tests or setup fatals or crashes,
this is due to a limitation in the save / passed stats handling.
This change makes it an environment variable which by default ditches
the logs and makes the containers clean up "correctly" in the teardown
method.
This commit makes the setLastStateChangeToNow function take a list of
namespaces instead of a single namespace. If no namespaces is passed,
all namespaces will be updated. This means that the argument acts like a
filter.
Implements #617.
Tailscale has changed the format of their ACLs to use a more firewall-y terms ("users" & "ports" -> "src" & "dst"). They have also started using all-lowercase tags. This PR applies these changes.
Extract LoadConfig from GetHeadscaleConfig, as they are conceptually
different operation, e.g.,
1) you can reload config through LoadConfig and do not get config
2) you can get config without reload config
This commit starts to wire up better signal handling, it starts with
handling shutdown a bit better, using the graceful shutdown for all the
listeners we use.
It also adds the initial switch case for handling config and acl reload,
which is to be implemented.
This commit makes isOutdated validate a nodes necessity to update
against all namespaces, and not just the nodes own namespace (which made
more sense before).
getLastStateChange is now uses the passed namespaces as a filter,
meaning that not requesting any namespace will give you the total last
updated state.
In addition, the sync.Map is exchanged for a variant that uses generics
which allows us to remove some casting logic.
When running in CI, I obtained the following error:
```
Running [/home/runner/golangci-lint-1.41.0-linux-amd64/golangci-lint run --out-format=github-actions --new-from-patch=/tmp/tmp-1795-28vaWZek2jfM/pull.patch --new=false --new-from-rev=] in [] ...
level=error msg="Running error: unknown linters: 'ireturn,maintidx', run 'golangci-lint linters' to see the list of supported linters"
```
Adds knobs to configure three aspects of the OpenID Connect flow:
* Custom scopes to override the default "openid profile email".
* Custom parameters to be added to the Authorize Endpoint request.
* Domain allowlisting for authenticated principals.
* User allowlisting for authenticated principals.
This commit fixes the issue of headscale crashing after sending on a
closed channel by moving the channel close to the sender side, instead
of the creator. closeChanWithLog is also implemented with generics now.
Fixes: https://github.com/juanfont/headscale/issues/342
Signed-off-by: Moritz Poldrack <git@moritz.sh>
- makes the html/template for /register follow the same formatting
as /apple and /windows
- adds a <title> element
- minor change for consistency's sake
This commit adds dockerbuild to flakes.nix:
```
nix build .#headscale-docker
```
This uses the Nix infra to build and _does not_ use Dockerfile.
It currently works on Linux (no macOS)
This commit adds a flake.nix build file, it can be used for three
things:
Build `headscale` from local or straight from git:
nix build
or
nix build github:juanfont/headscale
Run and Build `headscale` from local or straight from git:
nix run
or
nix run github:juanfont/headscale
Set up a development environment including all our tools,
- linters
- protobuf tooling
- compilers
nix develop
Websockets, in which DERP is based, requires a TLS certificate. At the same time,
if we use a certificate it must be valid... otherwise Tailscale wont connect (does not
have an Insecure option). So there is no option to expose insecure here
- registry file /windows/tailscale.reg is generated, filling in the
associated control server URL
- also includes CLI instructions
- fix /apple incorrect template: 'Url' is supposed to be '.URL'
This series of commit will be adding an embedded DERP server (and STUN) to Headscale,
thus making it completely self-contained and not dependant in other infrastructure.
logs looks like the following
```
2022-03-02T20:43:08Z DBG Expanding alias=app-test
2022-03-02T20:43:08Z DBG Expanding alias=kube-test
2022-03-02T20:43:08Z DBG Expanding alias=test
2022-03-02T20:43:08Z WRN No IPs found with the alias test
2022-03-02T20:43:08Z DBG Expanding alias=prod
2022-03-02T20:43:08Z WRN No IPs found with the alias prod
2022-03-02T20:43:08Z DBG Expanding alias=prod
2022-03-02T20:43:08Z WRN No IPs found with the alias prod
```
- `preauthkey`, `authkey`, `pre` are aliases for `preauthkey` command
- `ls`, `show` are aliases for `list` subcommand
- `c`, `new` are aliases for `create` subcommand
- `revoke`, `exp`, `e` are aliases for `expire` subcommand
- `apikey`, `api` are aliases for `apikeys` command
- `ls`, `show` are aliases for `list` subcommand
- `c`, `new` are aliases for `create` subcommand
- `revoke`, `exp`, `e` are aliases for the `expire` subcommand
- `namespace`, `ns`, `user`, `users` are aliases for `namespaces`
command
- `c`, `new` are aliases for the `create` subcommand
- `delete` is an alias for the `destroy` subcommand
- `mv` is an alias for the `rename` subcommand
- `ls`, `show` are aliases for the `list` subcommand
- `node`, `machine`, `machines` are aliases for `nodes` command
- `ls`, `show` aliases for `list` subcommand
- `logout`, `exp`, `e` are aliases for `expire` subcommand
- `del` is an alias for `delete` subcommand
This commit removes the need for datatypes.JSON and makes the code a bit
cleaner by allowing us to use proper types throughout the code when it
comes to hostinfo and other datatypes on the machine object.
This allows us to remove alot of unmarshal/marshal operations and remove
a lot of obsolete error checks.
This following commits will clean away a lot of untyped data and
uneccessary error checks.
This commit removes the two extra caches (oidc, requested time) and uses
the new central registration cache instead. The requested time is
unified into the main machine object and the oidc key is just added to
the same cache, as a string with the state as a key instead of machine
key.
This commit removes the field from the database and does a DB migration
**removing** all unregistered machines from headscale.
This means that from this version, all machines in the database is
considered registered.
current logic is not safe as it will allow an IP that isnt persisted to
the DB to be given out multiple times if machines joins in quick
succession.
This adds a lock around the "get ip" and machine registration and save
to DB so we ensure thiis isnt happning.
Currently this had to be done three places, which is silly, and outlined
in #294.
After some more tests in tailscale I couldn't replicate the behavior
described in there.
When adding a rule, allowing A to talk to B the reverse connection was
instantly added to B to allow communication to B.
The previous assumption was probably wrong.
Using h.ListAllMachines also listed the current machine in the result. It's unnecessary (I don't know if it's harmful).
Breaking the check with the `matchSourceAndDestinationWithRule` broke the tests. We have a specificity with the '*' destination that isn't symetrical.
I need to think of a better way to do this. It too hard to read.
Rewrite some function to get rid of the dependency on Headscale object. This allows us
to write succinct test that are more easy to review and implement.
The improvements of the tests allowed to write the removal of the tagged hosts
from the namespace as specified here: https://tailscale.com/kb/1068/acl-tags/
This call should be done quite at each modification of a server resources like RequestTags.
When a server changes it's tag we should rebuild the ACL rules.
When a server is added to headscale we also should update the ACLRules.
This commit change the default behaviour and remove the notion of namespaces between the hosts. It allows all namespaces to be only filtered by the ACLs. This behavior is closer to tailsnet.
A hidden thing was implied in this document is that each person should have his own namespace.
Hidden information in spicification isn't good.
Thank's @kradalby for pointing it out.
description: Use this agent when you need to execute, analyze, or troubleshoot Headscale integration tests. This includes running specific test scenarios, investigating test failures, interpreting test artifacts, validating end-to-end functionality, or ensuring integration test quality before releases. Examples: <example>Context: User has made changes to the route management code and wants to validate the changes work correctly. user: 'I've updated the route advertisement logic in poll.go. Can you run the relevant integration tests to make sure everything still works?' assistant: 'I'll use the headscale-integration-tester agent to run the subnet routing integration tests and analyze the results.' <commentary>Since the user wants to validate route-related changes with integration tests, use the headscale-integration-tester agent to execute the appropriate tests and analyze results.</commentary></example> <example>Context: A CI pipeline integration test is failing and the user needs help understanding why. user: 'The TestSubnetRouterMultiNetwork test is failing in CI. The logs show some timing issues but I can't figure out what's wrong.' assistant: 'Let me use the headscale-integration-tester agent to analyze the test failure and examine the artifacts.' <commentary>Since this involves analyzing integration test failures and interpreting test artifacts, use the headscale-integration-tester agent to investigate the issue.</commentary></example>
color: green
---
You are a specialist Quality Assurance Engineer with deep expertise in Headscale's integration testing system. You understand the Docker-based test infrastructure, real Tailscale client interactions, and the complex timing considerations involved in end-to-end network testing.
## Integration Test System Overview
The Headscale integration test system uses Docker containers running real Tailscale clients against a Headscale server. Tests validate end-to-end functionality including routing, ACLs, node lifecycle, and network coordination. The system is built around the `hi` (Headscale Integration) test runner in `cmd/hi/`.
## Critical Test Execution Knowledge
### System Requirements and Setup
```bash
# ALWAYS run this first to verify system readiness
go run ./cmd/hi doctor
```
This command verifies:
- Docker installation and daemon status
- Go environment setup
- Required container images availability
- Sufficient disk space (critical - tests generate ~100MB logs per run)
- Network configuration
### Test Execution Patterns
**CRITICAL TIMEOUT REQUIREMENTS**:
- **NEVER use bash `timeout` command** - this can cause test failures and incomplete cleanup
- **ALWAYS use the built-in `--timeout` flag** with generous timeouts (minimum 15 minutes)
- **Increase timeout if tests ever time out** - infrastructure issues require longer timeouts
```bash
# Single test execution (recommended for development)
# ALWAYS use --timeout flag with minimum 15 minutes (900s)
go run ./cmd/hi run "TestSubnetRouterMultiNetwork" --timeout=900s
# Database-heavy tests require PostgreSQL backend and longer timeouts
go run ./cmd/hi run "TestExpireNode" --postgres --timeout=1800s
# Pattern matching for related tests - use longer timeout for multiple tests
go run ./cmd/hi run "TestSubnet*" --timeout=1800s
# Long-running individual tests need extended timeouts
go run ./cmd/hi run "TestNodeOnlineStatus" --timeout=2100s # Runs for 12+ minutes
# Full test suite (CI/validation only) - very long timeout required
- **Slow tests** (5+ min): Node expiration, HA failover
- **Long-running tests** (10+ min): `TestNodeOnlineStatus` runs for 12 minutes
**CONCURRENT EXECUTION**: Multiple tests CAN run simultaneously. Each test run gets a unique Run ID for isolation. See "Concurrent Execution and Run ID Isolation" section below.
## Test Artifacts and Log Analysis
### Artifact Structure
All test runs save comprehensive artifacts to `control_logs/TIMESTAMP-ID/`:
```
control_logs/20250713-213106-iajsux/
├── hs-testname-abc123.stderr.log # Headscale server error logs
├── hs-testname-abc123.stdout.log # Headscale server output logs
├── hs-testname-abc123.db # Database snapshot for post-mortem
3.**MapResponse JSON files**: Protocol-level debugging for network map generation issues
4.**Client status dumps** (`*_status.json`): Network state and peer connectivity information
5.**Database snapshots** (`.db` files): For data consistency and state persistence issues
## Concurrent Execution and Run ID Isolation
### Overview
The integration test system supports running multiple tests concurrently on the same Docker daemon. Each test run is isolated through a unique Run ID that ensures containers, networks, and cleanup operations don't interfere with each other.
### Run ID Format and Usage
Each test run generates a unique Run ID in the format: `YYYYMMDD-HHMMSS-{6-char-hash}`
## Common Failure Patterns and Root Cause Analysis
### CRITICAL MINDSET: Code Issues vs Infrastructure Issues
**⚠️ IMPORTANT**: When tests fail, it is ALMOST ALWAYS a code issue with Headscale, NOT infrastructure problems. Do not immediately blame disk space, Docker issues, or timing unless you have thoroughly investigated the actual error logs first.
### Systematic Debugging Process
1.**Read the actual error message**: Don't assume - read the stderr logs completely
2.**Check Headscale server logs first**: Most issues originate from server-side logic
3.**Verify client connectivity**: Only after ruling out server issues
4.**Check timing patterns**: Use proper `EventuallyWithT` patterns
5.**Infrastructure as last resort**: Only blame infrastructure after code analysis
### Real Failure Patterns
#### 1. Timing Issues (Common but fixable)
```go
// ❌ Wrong: Immediate assertions after async operations
},10*time.Second,100*time.Millisecond,"route should be advertised")
```
**Timeout Guidelines**:
- Route operations: 3-5 seconds
- Node state changes: 5-10 seconds
- Complex scenarios: 10-15 seconds
- Policy recalculation: 5-10 seconds
#### 2. NodeStore Synchronization Issues
Route advertisements must propagate through poll requests (`poll.go:420`). NodeStore updates happen at specific synchronization points after Hostinfo changes.
**Why focus on WHY**: Helps maintainers understand architectural decisions and security requirements.
## EventuallyWithT Pattern for External Calls
### Overview
EventuallyWithT is a testing pattern used to handle eventual consistency in distributed systems. In Headscale integration tests, many operations are asynchronous - clients advertise routes, the server processes them, updates propagate through the network. EventuallyWithT allows tests to wait for these operations to complete while making assertions.
### External Calls That Must Be Wrapped
The following operations are **external calls** that interact with the headscale server or tailscale clients and MUST be wrapped in EventuallyWithT:
- `headscale.ListNodes()` - Queries server state
- `client.Status()` - Gets client network status
- `client.Curl()` - Makes HTTP requests through the network
}, 10*time.Second, 100*time.Millisecond, "route should be approved")
```
## Your Core Responsibilities
1. **Test Execution Strategy**: Execute integration tests with appropriate configurations, understanding when to use `--postgres` and timing requirements for different test categories. Follow phase-based testing approach prioritizing route tests.
- **Why this priority**: Route tests are less infrastructure-sensitive and validate core security logic
2. **Systematic Test Analysis**: When tests fail, systematically examine artifacts starting with Headscale server logs, then client logs, then protocol data. Focus on CODE ISSUES first (99% of cases), not infrastructure. Use real-world failure patterns to guide investigation.
- **Why this approach**: Most failures are logic bugs, not environment issues - efficient debugging saves time
3. **Timing & Synchronization Expertise**: Understand asynchronous Headscale operations, particularly route advertisements, NodeStore synchronization at `poll.go:420`, and policy propagation. Fix timing with `EventuallyWithT` while preserving original test expectations.
- **Why preserve expectations**: Test assertions encode business requirements and security policies
- **Key Pattern**: Apply the EventuallyWithT pattern correctly for all external calls as documented above
4. **Root Cause Analysis**: Distinguish between actual code regressions (route approval logic, HA failover architecture), timing issues requiring `EventuallyWithT` patterns, and genuine infrastructure problems (DNS, Docker, container issues).
- **Why this distinction matters**: Different problem types require completely different solution approaches
- **EventuallyWithT Issues**: Often manifest as flaky tests or immediate assertion failures after async operations
5.**Security-Aware Quality Validation**: Ensure tests properly validate end-to-end functionality with realistic timing expectations and proper error handling. Never suggest security bypasses or test expectation changes. Add comprehensive godoc when you understand test business logic.
- **Why security focus**: Integration tests are the last line of defense against security regressions
- **EventuallyWithT Usage**: Proper use prevents race conditions without weakening security assertions
6.**Concurrent Execution Awareness**: Respect run ID isolation and never interfere with other agents' test sessions. Each test run has a unique run ID - only clean up YOUR containers (by run ID label), never perform global cleanup while tests may be running.
- **Why this matters**: Multiple agents/users may run tests concurrently on the same Docker daemon
- **Key Rule**: NEVER use global container cleanup commands - the test runner handles cleanup automatically per run ID
**CRITICAL PRINCIPLE**: Test expectations are sacred contracts that define correct system behavior. When tests fail, fix the code to match the test, never change the test to match broken code. Only timing and observability improvements are allowed - business logic expectations are immutable.
**ISOLATION PRINCIPLE**: Each test run is isolated by its unique Run ID. Never interfere with other test sessions. The system handles cleanup automatically - manual global cleanup commands are forbidden when other tests may be running.
**EventuallyWithT PRINCIPLE**: Every external call to headscale server or tailscale client must be wrapped in EventuallyWithT. Follow the five key rules strictly: one external call per block, proper variable scoping, no nesting, use CollectT for assertions, and provide descriptive messages.
**Remember**: Test failures are usually code issues in Headscale that need to be fixed, not infrastructure problems to be ignored. Use the specific debugging workflows and failure patterns documented above to efficiently identify root causes. Infrastructure issues have very specific signatures - everything else is code-related.
Thank you for taking the time to report this issue.
To help us investigate and resolve this, we need more information. Please provide the following:
> [!TIP]
> Most issues turn out to be configuration errors rather than bugs. We encourage you to discuss your problem in our [Discord community](https://discord.gg/c84AZQhmpx) **before** opening an issue. The community can often help identify misconfigurations quickly, saving everyone time.
- **Attach long files** - Do not paste large logs or configurations inline. Use GitHub file attachments or GitHub Gists.
- **Use proper Markdown** - Format code blocks, logs, and configurations with appropriate syntax highlighting.
- **Structure your response** - Use the headings above to organize your information clearly.
## Redaction Rules
> [!CAUTION]
> **Replace, do not remove.** Removing information makes debugging impossible.
When redacting sensitive information:
- ✅ **Replace consistently** - If you change `alice@company.com` to `user1@example.com`, use `user1@example.com` everywhere (logs, config, policy, etc.)
- ❌ **Never remove information** - Gaps in data prevent us from correlating events across logs
- ❌ **Never redact IP addresses** - We need the actual IPs to trace network paths and identify issues
**If redaction rules are not followed, we will be unable to debug the issue and will have to close it.**
---
**Note:** This issue will be automatically closed in 3 days if no additional information is provided. Once you reply with the requested information, the `needs-more-info` label will be removed automatically.
If you need help gathering this information, please visit our [Discord community](https://discord.gg/c84AZQhmpx).
This issue tracker is used for **bug reports and feature requests** only. Your question appears to be a support or configuration question rather than a bug report.
For help with setup, configuration, or general questions, please visit our [Discord community](https://discord.gg/c84AZQhmpx) where the community and maintainers can assist you in real-time.
**Before posting in Discord, please check:**
- [Documentation](https://headscale.net/)
- [FAQ](https://headscale.net/stable/faq/)
- [Debugging and Troubleshooting Guide](https://headscale.net/stable/ref/debug/)
If after troubleshooting you determine this is actually a bug, please open a new issue with the required debug information from the troubleshooting guide.
body: 'Nix build failed with wrong gosum, please update "vendorSha256" (${{ steps.build.outputs.OLD_HASH }}) for the "headscale" package in flake.nix with the new SHA: ${{ steps.build.outputs.NEW_HASH }}'
name_template:'{{ .ProjectName }}_{{ .Version }}_{{ .Os }}_{{ .Arch }}{{ with .Arm }}v{{ . }}{{ end }}{{ with .Mips }}_{{ . }}{{ end }}{{ if not (eq .Amd64 "v1") }}{{ .Amd64 }}{{ end }}'
formats:
- binary
source:
enabled:true
name_template:"{{ .ProjectName }}_{{ .Version }}"
format:tar.gz
files:
- "vendor/"
nfpms:
# Configure nFPM for .deb and .rpm releases
#
# See https://nfpm.goreleaser.com/configuration/
# and https://goreleaser.com/customization/nfpm/
#
# Useful tools for debugging .debs:
# List file contents: dpkg -c dist/headscale...deb
Headscale is "Open Source, acknowledged contribution", this means that any contribution will have to be discussed with the maintainers before being added to the project.
This model has been chosen to reduce the risk of burnout by limiting the maintenance overhead of reviewing and validating third-party code.
## Why do we have this model?
Headscale has a small maintainer team that tries to balance working on the project, fixing bugs and reviewing contributions.
When we work on issues ourselves, we develop first hand knowledge of the code and it makes it possible for us to maintain and own the code as the project develops.
Code contributions are seen as a positive thing. People enjoy and engage with our project, but it also comes with some challenges; we have to understand the code, we have to understand the feature, we might have to become familiar with external libraries or services and we think about security implications. All those steps are required during the reviewing process. After the code has been merged, the feature has to be maintained. Any changes reliant on external services must be updated and expanded accordingly.
The review and day-1 maintenance adds a significant burden on the maintainers. Often we hope that the contributor will help out, but we found that most of the time, they disappear after their new feature was added.
This means that when someone contributes, we are mostly happy about it, but we do have to run it through a series of checks to establish if we actually can maintain this feature.
## What do we require?
A general description is provided here and an explicit list is provided in our pull request template.
All new features have to start out with a design document, which should be discussed on the issue tracker (not discord). It should include a use case for the feature, how it can be implemented, who will implement it and a plan for maintaining it.
All features have to be end-to-end tested (integration tests) and have good unit test coverage to ensure that they work as expected. This will also ensure that the feature continues to work as expected over time. If a change cannot be tested, a strong case for why this is not possible needs to be presented.
The contributor should help to maintain the feature over time. In case the feature is not maintained probably, the maintainers reserve themselves the right to remove features they redeem as unmaintainable. This should help to improve the quality of the software and keep it in a maintainable state.
## Bug fixes
Headscale is open to code contributions for bug fixes without discussion.
## Documentation
If you find mistakes in the documentation, please submit a fix to the documentation.
An open source, self-hosted implementation of the Tailscale coordination server.
An open source, self-hosted implementation of the Tailscale control server.
Join our [Discord](https://discord.gg/XcQxk2VHjx) server for a chat.
Join our [Discord server](https://discord.gg/c84AZQhmpx) for a chat.
**Note:** Always select the same GitHub tag as the released version you use to ensure you have the correct example configuration and documentation. The `main` branch might contain unreleased changes.
**Note:** Always select the same GitHub tag as the released version you use
to ensure you have the correct example configuration. The `main` branch might
contain unreleased changes. The documentation is available for stable and
development versions:
## Overview
- [Documentation for the stable version](https://headscale.net/stable/)
- [Documentation for the development version](https://headscale.net/development/)
Tailscale is [a modern VPN](https://tailscale.com/) built on top of [Wireguard](https://www.wireguard.com/). It [works like an overlay network](https://tailscale.com/blog/how-tailscale-works/) between the computers of your networks - using all kinds of [NAT traversal sorcery](https://tailscale.com/blog/how-nat-traversal-works/).
## What is Tailscale
Everything in Tailscale is Open Source, except the GUI clients for proprietary OS (Windows and macOS/iOS), and the 'coordination/control server'.
Tailscale is [a modern VPN](https://tailscale.com/) built on top of
[Wireguard](https://www.wireguard.com/).
It [works like an overlay network](https://tailscale.com/blog/how-tailscale-works/)
The control server works as an exchange point of Wireguard public keys for the nodes in the Tailscale network. It also assigns the IP addresses of the clients, creates the boundaries between each user, enables sharing machines between users, and exposes the advertised routes of your nodes.
Everything in Tailscale is Open Source, except the GUI clients for proprietary OS
(Windows and macOS/iOS), and the control server.
headscale implements this coordination server.
The control server works as an exchange point of Wireguard public keys for the
nodes in the Tailscale network. It assigns the IP addresses of the clients,
creates the boundaries between each user, enables sharing machines between users,
and exposes the advertised routes of your nodes.
## Support
A [Tailscale network (tailnet)](https://tailscale.com/kb/1136/tailnet/) is private
network which Tailscale assigns to a user in terms of private users or an
organisation.
If you like `headscale` and find it useful, there is sponsorship and donation buttons available in the repo.
## Design goal
If you would like to sponsor features, bugs or prioritisation, reach out to one of the maintainers.
Headscale aims to implement a self-hosted, open source alternative to the
[Tailscale](https://tailscale.com/) control server. Headscale's goal is to
provide self-hosters and hobbyists with an open-source server they can use for
their projects and labs. It implements a narrow scope, a _single_ Tailscale
network (tailnet), suitable for a personal use, or a small open-source
organisation.
## Status
## Supporting Headscale
- [x] Base functionality (nodes can communicate with each other)
- [x] Node registration through the web flow
- [x] Network changes are relayed to the nodes
- [x] Namespaces support (~tailnets in Tailscale.com naming)
- [x] Routing (advertise & accept, including exit nodes)
- [x] Node registration via pre-auth keys (including reusable keys, and ephemeral node support)
- [x] JSON-formatted output
- [x] ACLs
- [x] Taildrop (File Sharing)
- [x] Support for alternative IP ranges in the tailnets (default Tailscale's 100.64.0.0/10)
- [x] DNS (passing DNS servers to nodes)
- [x] Single-Sign-On (via Open ID Connect)
- [x] Share nodes between namespaces
- [x] MagicDNS (see `docs/`)
If you like `headscale` and find it useful, there is a sponsorship and donation
buttons available in the repo.
## Features
Please see ["Features" in the documentation](https://headscale.net/stable/about/features/).
| macOS | Yes (see `/apple` on your headscale for more information) |
| Windows | Yes [docs](./docs/windows-client.md) |
| Android | [You need to compile the client yourself](https://github.com/juanfont/headscale/issues/58#issuecomment-885255270) |
| iOS | Not yet |
## Roadmap 🤷
Suggestions/PRs welcomed!
Please see ["Client and operating system support" in the documentation](https://headscale.net/stable/about/clients/).
## Running headscale
Please have a look at the documentation under [`docs/`](docs/).
**Please note that we do not support nor encourage the use of reverse proxies
and container to run Headscale.**
Please have a look at the [`documentation`](https://headscale.net/stable/).
For NixOS users, a module is available in [`nix/`](./nix/).
## Builds from `main`
Development builds from the `main` branch are available as container images and
binaries. See the [development builds](https://headscale.net/stable/setup/install/main/)
documentation for details.
## Talks
- Fosdem 2026 (video): [Headscale & Tailscale: The complementary open source clone](https://fosdem.org/2026/schedule/event/KYQ3LL-headscale-the-complementary-open-source-clone/)
- presented by Kristoffer Dalby
- Fosdem 2023 (video): [Headscale: How we are using integration testing to reimplement Tailscale](https://fosdem.org/2023/schedule/event/goheadscale/)
- presented by Juan Font Alonso and Kristoffer Dalby
## Disclaimer
1. We have nothing to do with Tailscale, or Tailscale Inc.
2. The purpose of Headscale is maintaining a working, self-hosted Tailscale control panel.
This project is not associated with Tailscale Inc.
However, one of the active maintainers for Headscale [is employed by Tailscale](https://tailscale.com/blog/opensource) and he is allowed to spend work hours contributing to the project. Contributions from this maintainer are reviewed by other maintainers.
The maintainers work together on setting the direction for the project. The underlying principle is to serve the community of self-hosters, enthusiasts and hobbyists - while having a sustainable project.
## Contributing
To contribute to Headscale you would need the lastest version of [Go](https://golang.org) and [Buf](https://buf.build)(Protobuf generator).
Please read the [CONTRIBUTING.md](./CONTRIBUTING.md) file.
### Requirements
To contribute to headscale you would need the latest version of [Go](https://golang.org)
and [Buf](https://buf.build) (Protobuf generator).
We recommend using [Nix](https://nixos.org/) to setup a development environment. This can
be done with `nix develop`, which will install the tools and give you a shell.
This guarantees that you will have the same dev env as `headscale` maintainers.
### Code style
To ensure we have some consistency with a growing number of contributions, this project has adopted linting and style/formatting rules:
To ensure we have some consistency with a growing number of contributions,
this project has adopted linting and style/formatting rules:
The **Go** code is linted with [`golangci-lint`](https://golangci-lint.run) and
formatted with [`golines`](https://github.com/segmentio/golines) (width 88) and
@@ -82,6 +113,8 @@ run `make lint` and `make fmt` before committing any code.
The **Proto** code is linted with [`buf`](https://docs.buf.build/lint/overview) and
formatted with [`clang-format`](https://clang.llvm.org/docs/ClangFormat.html).
The **docs** are formatted with [`mdformat`](https://mdformat.readthedocs.io).
The **rest** (Markdown, YAML, etc) is formatted with [`prettier`](https://prettier.io).
Check out the `.golangci.yaml` and `Makefile` to see the specific configuration.
@@ -90,15 +123,18 @@ Check out the `.golangci.yaml` and `Makefile` to see the specific configuration.
- Go
- Buf
- Protobuf tools:
- Protobuf tools
Install and activate:
```shell
make install-protobuf-plugins
nix develop
```
### Testing and building
Some parts of the project require the generation of Go code from Protobuf (if changes are made in `proto/`) and it must be (re-)generated with:
Some parts of the project require the generation of Go code from Protobuf
(if changes are made in `proto/`) and it must be (re-)generated with:
```shell
make generate
@@ -118,169 +154,31 @@ To build the program:
make build
```
### Development workflow
We recommend using Nix for dependency management to ensure you have all required tools. If you prefer to manage dependencies yourself, you can use Make directly:
**With Nix (recommended):**
```shell
nix develop
make test
make build
```
**With your own dependencies:**
```shell
make test
make build
```
The Makefile will warn you if any required tools are missing and suggest running `nix develop`. Run `make help` to see all available targets.
This page provides <a href="https://support.apple.com/guide/mdm/mdm-overview-mdmbf9e668/web">configuration profiles</a> for the official Tailscale clients for <a href="https://apps.apple.com/us/app/tailscale/id1470499037?ls=1">iOS</a> and <a href="https://apps.apple.com/ca/app/tailscale/id1475387142?mt=12">macOS</a>.
</p>
<p>
The profiles will configure Tailscale.app to use {{.Url}} as its control server.
</p>
<h3>Caution</h3>
<p>You should always inspect the profile before installing it:</p>
approveRoutesCmd.Flags().StringSliceP("routes","r",[]string{},`List of routes that will be approved (comma-separated, e.g. "10.0.0.0/8,192.168.0.0/24" or empty string to remove all approved routes)`)
bypassFlag="bypass-grpc-and-access-database-directly"//nolint:gosec // not a credential
)
varerrAborted=errors.New("command aborted by user")
// bypassDatabase loads the server config and opens the database directly,
// bypassing the gRPC server. The caller is responsible for closing the
// returned database handle.
funcbypassDatabase()(*db.HSDatabase,error){
cfg,err:=types.LoadServerConfig()
iferr!=nil{
returnnil,fmt.Errorf("loading config: %w",err)
}
d,err:=db.NewHeadscaleDatabase(cfg,nil)
iferr!=nil{
returnnil,fmt.Errorf("opening database: %w",err)
}
returnd,nil
}
funcinit(){
rootCmd.AddCommand(policyCmd)
getPolicy.Flags().BoolP(bypassFlag,"",false,"Uses the headscale config to directly access the database, bypassing gRPC and does not require the server to be running")
policyCmd.AddCommand(getPolicy)
setPolicy.Flags().StringP("file","f","","Path to a policy file in HuJSON format")
setPolicy.Flags().BoolP(bypassFlag,"",false,"Uses the headscale config to directly access the database, bypassing gRPC and does not require the server to be running")
mustMarkRequired(setPolicy,"file")
policyCmd.AddCommand(setPolicy)
checkPolicy.Flags().StringP("file","f","","Path to a policy file in HuJSON format")
// this is only a warning because there could be something sitting in front of headscale that redirects the traffic (e.g. an iptables rule)
log.Warn().
Msg("Warning: when using tls_letsencrypt_hostname with TLS-ALPN-01 as challenge type, headscale must be reachable on port 443, i.e. listen_addr should probably end in :443")
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.