[Feature] Configurable Handling of Hostname Conflicts for Ephemeral Nodes to Preserve MagicDNS Discoverability #1112

New Issue

adam · 2025-12-29T02:28:20+01:00

adam commented

2025-12-29 02:28:20 +01:00

Originally created by @oneingan on GitHub (Oct 18, 2025).

Use case

In environments with ephemeral nodes (e.g., short-lived VMs or containers that frequently restart or redeploy), nodes often re-register with the same hostname. The current random suffix appending disrupts MagicDNS discoverability during the overlap period before the old node expires, making it harder to reliably resolve and connect to the active node via its intended hostname.

Description

Add a new server configuration option (e.g., ephemeral_conflict_resolution) that applies only to ephemeral nodes. Possible values:

"suffix" (default, current behavior: append random suffix to new registration).
"overwrite" (delete the old conflicting entry and register the new one with the clean hostname).
"rename_old" (append a suffix or timestamp to the old entry, and assign the clean hostname to the new registration).

Non-ephemeral nodes continue with the existing suffix-on-conflict behavior. This ensures clean, predictable MagicDNS names for active ephemeral nodes without affecting persistent setups.

Contribution

I can write the design doc for this feature
I can contribute this feature

How can it be implemented?

Introduce a new config field in config.yaml under server settings, e.g., ephemeral_conflict_resolution: suffix|overwrite|rename_old.
In the registration handler (handleRegistration), check if the node is ephemeral and if a conflict exists.
If conflict and mode is "overwrite": Delete the old node entry (via DeleteNode or similar).
If mode is "rename_old": Update the old node's given_name and fqdn with a suffix (e.g., -old-<timestamp>), then register the new one cleanly.
Fall back to suffix for non-ephemeral or default.
Add tests for each mode, covering registration overlaps and MagicDNS resolution.
Update documentation in docs/ to explain the new config and use cases.

Originally created by @oneingan on GitHub (Oct 18, 2025). ### Use case In environments with ephemeral nodes (e.g., short-lived VMs or containers that frequently restart or redeploy), nodes often re-register with the same hostname. The current random suffix appending disrupts MagicDNS discoverability during the overlap period before the old node expires, making it harder to reliably resolve and connect to the active node via its intended hostname. ### Description Add a new server configuration option (e.g., `ephemeral_conflict_resolution`) that applies only to ephemeral nodes. Possible values: - "suffix" (default, current behavior: append random suffix to new registration). - "overwrite" (delete the old conflicting entry and register the new one with the clean hostname). - "rename_old" (append a suffix or timestamp to the old entry, and assign the clean hostname to the new registration). Non-ephemeral nodes continue with the existing suffix-on-conflict behavior. This ensures clean, predictable MagicDNS names for active ephemeral nodes without affecting persistent setups. ### Contribution - [ ] I can write the design doc for this feature - [x] I can contribute this feature ### How can it be implemented? - Introduce a new config field in `config.yaml` under server settings, e.g., `ephemeral_conflict_resolution: suffix|overwrite|rename_old`. - In the registration handler (`handleRegistration`), check if the node is ephemeral and if a conflict exists. - If conflict and mode is "overwrite": Delete the old node entry (via `DeleteNode` or similar). - If mode is "rename_old": Update the old node's given_name and fqdn with a suffix (e.g., `-old-<timestamp>`), then register the new one cleanly. - Fall back to suffix for non-ephemeral or default. - Add tests for each mode, covering registration overlaps and MagicDNS resolution. - Update documentation in `docs/` to explain the new config and use cases.

adam added the enhancement label 2025-12-29 02:28:20 +01:00

adam commented

2025-12-29 02:28:20 +01:00

@kradalby commented on GitHub (Oct 19, 2025):

I will object to this configuration, it would take a lot of special handling and be error prone. It sounds like we would solve the wrong problem where instead of making the system creating the ephemeral nodes more robust by programatically talking to headscale and finding the node that was added, we add complexity to Headscale.

The provisioning of node should contact the headscale and find the node and the related hostname it will end up and report it back to its own system. Alternatively as part of the tailscale login process, it can retrieve it there. If needed, the nodes could programatically be renamed to the desired name.

@kradalby commented on GitHub (Oct 19, 2025): I will object to this configuration, it would take a lot of special handling and be error prone. It sounds like we would solve the wrong problem where instead of making the system creating the ephemeral nodes more robust by programatically talking to headscale and finding the node that was added, we add complexity to Headscale. The provisioning of node should contact the headscale and find the node and the related hostname it will end up and report it back to its own system. Alternatively as part of the tailscale login process, it can retrieve it there. If needed, the nodes could programatically be renamed to the desired name.

adam commented

2025-12-29 02:28:20 +01:00

@oneingan commented on GitHub (Oct 21, 2025):

Thank you for the feedback. I agree that adding the proposed configuration could introduce unnecessary complexity to Headscale. My issue stems from the current limitation of managing devices programmatically via the Terraform provider, which isn't feasible for my use case.

That said, I noticed Tailscale upstream handles hostname conflicts by appending simple incremental suffixes (e.g., -1, -2, -3). This seems like a lightweight change that could align Headscale more closely with Tailscale's behavior while addressing the ephemeral node discoverability issue. I'd be happy to contribute to implementing this.

@oneingan commented on GitHub (Oct 21, 2025): Thank you for the feedback. I agree that adding the proposed configuration could introduce unnecessary complexity to Headscale. My issue stems from the current limitation of managing devices programmatically via the Terraform provider, which isn't feasible for my use case. That said, I noticed Tailscale upstream handles hostname conflicts by appending simple incremental suffixes (e.g., -1, -2, -3). This seems like a lightweight change that could align Headscale more closely with Tailscale's behavior while addressing the ephemeral node discoverability issue. I'd be happy to contribute to implementing this.

adam commented

2025-12-29 02:28:20 +01:00

@Sharpie commented on GitHub (Oct 27, 2025):

I bumped into the same issue of getting stable names from MagicDNS when combining Terraform and Tailscale and another user suggested a solution that uses a terraform_data resource to send a DELETE to the API when a VM is destroyed:

https://github.com/tailscale/terraform-provider-tailscale/issues/68#issuecomment-3420463533

I was able to adapt the pattern to Headscale using https://github.com/awlsring/terraform-provider-headscale :

# === Boilerplate ===
terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.68"
    }

   headscale = {
      source  = "awlsring/headscale"
      version = "~> 0.4"
    }
  }
}

variable "headscale_url" { }
variable "headscale_api_key" { sensitive = true }

provider "headscale" {
  endpoint = var.headscale_url
  api_key  = var.headscale_api_key
}

variable "digitalocean_token" { sensitive = true }

provider "digitalocean" {
  token = var.digitalocean_token
}

variable "ssh_admin_key" {}
variable "ssh_admin_private_key" { sensitive = true }

resource "digitalocean_ssh_key" "admin-key" {
  name       = "keyhole-admin"
  public_key = var.ssh_admin_key
}


# === Example ===
resource "headscale_pre_auth_key" "tf-example" {
  user = "1"
}

resource "digitalocean_droplet" "tf-example" {
  name   = "tf-example"
  image  = "ubuntu-25-04-x64"
  region = "nyc1"
  size   = "s-1vcpu-512mb-10gb"

  ssh_keys = [digitalocean_ssh_key.admin-key.id]

  user_data = <<-EOF
    #cloud-config
    runcmd:
      - ['sh', '-c', 'curl -fsSL https://tailscale.com/install.sh | sh']
      # Set sysctl settings for IP forwarding (useful when configuring an exit node)
      - ['sh', '-c', "echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && sudo sysctl -p /etc/sysctl.d/99-tailscale.conf" ]
      - ['tailscale', 'up',
           '--hostname=tf-example',
           '--auth-key=${headscale_pre_auth_key.tf-example.key}',
           '--login-server=${var.headscale_url}']
  EOF

  connection {
    type        = "ssh"
    user        = "root"
    host        = "tf-example"
    private_key = var.ssh_admin_private_key
  }

  # Blocks Terraform until cloud-init makes the node reachable via 
  # the tailnet. This ensures data.headscale_device is populated.
  provisioner "remote-exec" { inline = ["uptime"] }
}

data "headscale_device" "tf-example" {
  name = digitalocean_droplet.tf-example.name
}

resource "terraform_data" "reset_tf-example" {
  lifecycle {
    replace_triggered_by = [digitalocean_droplet.tf-example]
  }

  input = [
    var.headscale_api_key,
    var.headscale_url,
    data.headscale_device.tf-example.id
  ]

  provisioner "local-exec" {
    when       = destroy
    on_failure = continue

    environment = {
      HS_API_KEY   = self.input[0]
      HS_URL       = self.input[1]
      HS_DEVICE_ID = self.input[2]
    }

    command = <<-EOS
      curl -sS -X DELETE \
        -H "Authorization: Bearer $HS_API_KEY" \
        "$HS_URL/api/v1/node/$HS_DEVICE_ID"
    EOS
  }
}

@Sharpie commented on GitHub (Oct 27, 2025): I bumped into the same issue of getting stable names from MagicDNS when combining Terraform and Tailscale and another user suggested a solution that uses a `terraform_data` resource to send a `DELETE` to the API when a VM is destroyed: https://github.com/tailscale/terraform-provider-tailscale/issues/68#issuecomment-3420463533 I was able to adapt the pattern to Headscale using https://github.com/awlsring/terraform-provider-headscale : ```terraform # === Boilerplate === terraform { required_providers { digitalocean = { source = "digitalocean/digitalocean" version = "~> 2.68" } headscale = { source = "awlsring/headscale" version = "~> 0.4" } } } variable "headscale_url" { } variable "headscale_api_key" { sensitive = true } provider "headscale" { endpoint = var.headscale_url api_key = var.headscale_api_key } variable "digitalocean_token" { sensitive = true } provider "digitalocean" { token = var.digitalocean_token } variable "ssh_admin_key" {} variable "ssh_admin_private_key" { sensitive = true } resource "digitalocean_ssh_key" "admin-key" { name = "keyhole-admin" public_key = var.ssh_admin_key } # === Example === resource "headscale_pre_auth_key" "tf-example" { user = "1" } resource "digitalocean_droplet" "tf-example" { name = "tf-example" image = "ubuntu-25-04-x64" region = "nyc1" size = "s-1vcpu-512mb-10gb" ssh_keys = [digitalocean_ssh_key.admin-key.id] user_data = <<-EOF #cloud-config runcmd: - ['sh', '-c', 'curl -fsSL https://tailscale.com/install.sh | sh'] # Set sysctl settings for IP forwarding (useful when configuring an exit node) - ['sh', '-c', "echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && sudo sysctl -p /etc/sysctl.d/99-tailscale.conf" ] - ['tailscale', 'up', '--hostname=tf-example', '--auth-key=${headscale_pre_auth_key.tf-example.key}', '--login-server=${var.headscale_url}'] EOF connection { type = "ssh" user = "root" host = "tf-example" private_key = var.ssh_admin_private_key } # Blocks Terraform until cloud-init makes the node reachable via # the tailnet. This ensures data.headscale_device is populated. provisioner "remote-exec" { inline = ["uptime"] } } data "headscale_device" "tf-example" { name = digitalocean_droplet.tf-example.name } resource "terraform_data" "reset_tf-example" { lifecycle { replace_triggered_by = [digitalocean_droplet.tf-example] } input = [ var.headscale_api_key, var.headscale_url, data.headscale_device.tf-example.id ] provisioner "local-exec" { when = destroy on_failure = continue environment = { HS_API_KEY = self.input[0] HS_URL = self.input[1] HS_DEVICE_ID = self.input[2] } command = <<-EOS curl -sS -X DELETE \ -H "Authorization: Bearer $HS_API_KEY" \ "$HS_URL/api/v1/node/$HS_DEVICE_ID" EOS } } ```

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#1112