[Bug] Enabling again a fallback node redundant routes made both main and fallback node routes primary=false #1091

Closed
opened 2025-12-29 02:28:13 +01:00 by adam · 2 comments
Owner

Originally created by @rgarrigue on GitHub (Aug 27, 2025).

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I have 2 nodes with identical routes, the "main" and the "fallback", for redundancy purpose. They were properly routing traffic to a database,, the "main" routes being listed with a primary=true, and the "fallback" with primary=false.

I'm adding one more node, with routes to the main & fallback. Out of laziness I ran a for loop to enable all the routes, including already enabled ones. Then we lost connections through the VPN to the database. As shown in the screenshot the "main" routes 1️⃣ and "fallback" routes 2️⃣ have primary=false.

Image

I disabled the "fallback" routes 1️⃣ 2️⃣ but the "main" routes didn't turned primary=true 3️⃣

Image

I enabled the "main" routes 1️⃣ and it did turned the its routes to primary=true 2️⃣ , connection to the database was back ✔️

Image

After that I enabled the "fallback" routes and the primary stayed true on the "main" and false on "fallback"

Expected Behavior

I expect for the traffic / routing to go on if I enable already enabled routes. In other word, enabling already enabled should be idempotent. Even if it's a seemingly useless things to do.

Steps To Reproduce

  1. Add two tailscale with identical routes
  2. Enable them both
  3. Check the state of the routes, one should be primary=true the other primary=false
  4. Enable them both again
  5. Check the state of the routes, both should be primary=false

Environment

"main" instance (which also happens to run headscale)
- OS: Debian 11
- Headscale version: 0.25.1
- Tailscale version: 1.80.2

"fallback" instance (which runs tailscale only)
- OS: Debian 13
- Tailscale version: 1.86.2

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Debug information

There are no policy on my headscale.

All clients lost & retrieved connections so I believe it ain't about the clients, and there are a lot of sensitive information in the tailscales dump, so I'll provide them only if deemed really relevant.

Headscale configuration

server_url: https://vpn.REDACTED
listen_addr: 0.0.0.0:443
metrics_listen_addr: 127.0.0.1:9090 # No monitoring
grpc_listen_addr: 127.0.0.1:50443 # No remote CLI
grpc_allow_insecure: false
noise:
  private_key_path: /var/lib/headscale/noise_private.key
prefixes:
  v6: fd7a:115c:a1e0::/48
  v4: 100.64.0.0/10
  allocation: random
derp:
  server:
    enabled: false
    region_id: 999
    region_code: "headscale"
    region_name: "Headscale Embedded DERP"
    stun_listen_addr: "0.0.0.0:3478"
    private_key_path: /var/lib/headscale/derp_server_private.key
    automatically_add_embedded_derp_region: true
    ipv4: 1.2.3.4
    ipv6: 2001:db8::1
  urls:
    - https://controlplane.tailscale.com/derpmap/default
  paths: []
  auto_update_enabled: true
  update_frequency: 24h
disable_check_updates: true
ephemeral_node_inactivity_timeout: 30m
database:
  type: sqlite
  debug: false
  gorm:
    prepare_stmt: true
    parameterized_queries: true
    skip_err_record_not_found: true
    slow_threshold: 1000
  sqlite:
    path: /var/lib/headscale/db.sqlite
    write_ahead_log: true
acme_url: https://acme-v02.api.letsencrypt.org/directory
acme_email: "ops@REDACTED"
tls_letsencrypt_hostname: "vpn.REDACTED"
tls_letsencrypt_cache_dir: /var/lib/headscale/cache
tls_letsencrypt_challenge_type: HTTP-01
tls_letsencrypt_listen: ":http"
log:
  format: text
  level: info
policy:
  mode: file
  path: ""
dns:
  magic_dns: true
  base_domain: REDACTED.internal
  nameservers:
    global:
      - 9.9.9.9
      - 149.112.112.112
      - 2620:fe::fe
      - 2620:fe::9
    split:
      {}
  search_domains: []
unix_socket: /var/run/headscale/headscale.sock
unix_socket_permission: "0770"
oidc:
  only_start_if_oidc_is_available: true
  issuer: "https://accounts.google.com"
  client_id: "REDACTED.apps.googleusercontent.com"
  client_secret: "GOCSPX-REDACTED"
  expiry: 30d
  use_expiry_from_token: false
  scope: ["openid", "profile", "email"]
  extra_params:
    domain_hint: REDACTED
  allowed_domains:
    - REDACTED
  allowed_groups: []
  allowed_users: []
  strip_email_domain: true
logtail:
  enabled: false
randomize_client_port: false # default static port 41641
Originally created by @rgarrigue on GitHub (Aug 27, 2025). ### Is this a support request? - [x] This is not a support request ### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior I have 2 nodes with identical routes, the "main" and the "fallback", for redundancy purpose. They were properly routing traffic to a database,, the "main" routes being listed with a primary=true, and the "fallback" with primary=false. I'm adding one more node, with routes to the main & fallback. Out of laziness I ran a for loop to enable all the routes, including already enabled ones. Then we lost connections through the VPN to the database. As shown in the screenshot the "main" routes 1️⃣ and "fallback" routes 2️⃣ have primary=false. <img width="988" height="478" alt="Image" src="https://github.com/user-attachments/assets/00dd4bb7-67b1-4838-8806-e5224f1c8706" /> I disabled the "fallback" routes 1️⃣ 2️⃣ but the "main" routes didn't turned primary=true 3️⃣ <img width="981" height="254" alt="Image" src="https://github.com/user-attachments/assets/fa8bca23-47e9-4c7e-ba13-3118eed2746e" /> I enabled the "main" routes 1️⃣ and it did turned the its routes to primary=true 2️⃣ , connection to the database was back ✔️ <img width="798" height="258" alt="Image" src="https://github.com/user-attachments/assets/3614fefa-1940-420d-baf0-f95bc9f05ddd" /> After that I enabled the "fallback" routes and the primary stayed true on the "main" and false on "fallback" ### Expected Behavior I expect for the traffic / routing to go on if I enable already enabled routes. In other word, enabling already enabled should be idempotent. Even if it's a seemingly useless things to do. ### Steps To Reproduce 1. Add two tailscale with identical routes 2. Enable them both 3. Check the state of the routes, one should be primary=true the other primary=false 4. Enable them both again 5. Check the state of the routes, both should be primary=false ### Environment ```markdown "main" instance (which also happens to run headscale) - OS: Debian 11 - Headscale version: 0.25.1 - Tailscale version: 1.80.2 "fallback" instance (which runs tailscale only) - OS: Debian 13 - Tailscale version: 1.86.2 ``` ### Runtime environment - [ ] Headscale is behind a (reverse) proxy - [ ] Headscale runs in a container ### Debug information There are no policy on my headscale. All clients lost & retrieved connections so I believe it ain't about the clients, and there are a lot of sensitive information in the tailscales dump, so I'll provide them only if deemed really relevant. Headscale configuration ```yaml server_url: https://vpn.REDACTED listen_addr: 0.0.0.0:443 metrics_listen_addr: 127.0.0.1:9090 # No monitoring grpc_listen_addr: 127.0.0.1:50443 # No remote CLI grpc_allow_insecure: false noise: private_key_path: /var/lib/headscale/noise_private.key prefixes: v6: fd7a:115c:a1e0::/48 v4: 100.64.0.0/10 allocation: random derp: server: enabled: false region_id: 999 region_code: "headscale" region_name: "Headscale Embedded DERP" stun_listen_addr: "0.0.0.0:3478" private_key_path: /var/lib/headscale/derp_server_private.key automatically_add_embedded_derp_region: true ipv4: 1.2.3.4 ipv6: 2001:db8::1 urls: - https://controlplane.tailscale.com/derpmap/default paths: [] auto_update_enabled: true update_frequency: 24h disable_check_updates: true ephemeral_node_inactivity_timeout: 30m database: type: sqlite debug: false gorm: prepare_stmt: true parameterized_queries: true skip_err_record_not_found: true slow_threshold: 1000 sqlite: path: /var/lib/headscale/db.sqlite write_ahead_log: true acme_url: https://acme-v02.api.letsencrypt.org/directory acme_email: "ops@REDACTED" tls_letsencrypt_hostname: "vpn.REDACTED" tls_letsencrypt_cache_dir: /var/lib/headscale/cache tls_letsencrypt_challenge_type: HTTP-01 tls_letsencrypt_listen: ":http" log: format: text level: info policy: mode: file path: "" dns: magic_dns: true base_domain: REDACTED.internal nameservers: global: - 9.9.9.9 - 149.112.112.112 - 2620:fe::fe - 2620:fe::9 split: {} search_domains: [] unix_socket: /var/run/headscale/headscale.sock unix_socket_permission: "0770" oidc: only_start_if_oidc_is_available: true issuer: "https://accounts.google.com" client_id: "REDACTED.apps.googleusercontent.com" client_secret: "GOCSPX-REDACTED" expiry: 30d use_expiry_from_token: false scope: ["openid", "profile", "email"] extra_params: domain_hint: REDACTED allowed_domains: - REDACTED allowed_groups: [] allowed_users: [] strip_email_domain: true logtail: enabled: false randomize_client_port: false # default static port 41641 ```
adam added the bug label 2025-12-29 02:28:13 +01:00
adam closed this issue 2025-12-29 02:28:13 +01:00
Author
Owner

@nblock commented on GitHub (Aug 27, 2025):

Please try to reproduce with the current version of Headscale: 0.26.1.

@nblock commented on GitHub (Aug 27, 2025): Please try to reproduce with the current version of Headscale: 0.26.1.
Author
Owner

@rgarrigue commented on GitHub (Aug 29, 2025):

Hello,

It seems the issue is gone with 0.26.1

Image

Sorry for not trying it out first.

Thanks for Headscale :-)

@rgarrigue commented on GitHub (Aug 29, 2025): Hello, It seems the issue is gone with 0.26.1 <img width="1845" height="437" alt="Image" src="https://github.com/user-attachments/assets/b5c31ff8-5bb0-4fd3-a2cc-a42db31bc595" /> Sorry for not trying it out first. Thanks for Headscale :-)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#1091