[Feature] Configurable Handling of Hostname Conflicts for Ephemeral Nodes to Preserve MagicDNS Discoverability #1112

Open
opened 2025-12-29 02:28:20 +01:00 by adam · 3 comments
Owner

Originally created by @oneingan on GitHub (Oct 18, 2025).

Use case

In environments with ephemeral nodes (e.g., short-lived VMs or containers that frequently restart or redeploy), nodes often re-register with the same hostname. The current random suffix appending disrupts MagicDNS discoverability during the overlap period before the old node expires, making it harder to reliably resolve and connect to the active node via its intended hostname.

Description

Add a new server configuration option (e.g., ephemeral_conflict_resolution) that applies only to ephemeral nodes. Possible values:

  • "suffix" (default, current behavior: append random suffix to new registration).
  • "overwrite" (delete the old conflicting entry and register the new one with the clean hostname).
  • "rename_old" (append a suffix or timestamp to the old entry, and assign the clean hostname to the new registration).

Non-ephemeral nodes continue with the existing suffix-on-conflict behavior. This ensures clean, predictable MagicDNS names for active ephemeral nodes without affecting persistent setups.

Contribution

  • I can write the design doc for this feature
  • I can contribute this feature

How can it be implemented?

  • Introduce a new config field in config.yaml under server settings, e.g., ephemeral_conflict_resolution: suffix|overwrite|rename_old.
  • In the registration handler (handleRegistration), check if the node is ephemeral and if a conflict exists.
  • If conflict and mode is "overwrite": Delete the old node entry (via DeleteNode or similar).
  • If mode is "rename_old": Update the old node's given_name and fqdn with a suffix (e.g., -old-<timestamp>), then register the new one cleanly.
  • Fall back to suffix for non-ephemeral or default.
  • Add tests for each mode, covering registration overlaps and MagicDNS resolution.
  • Update documentation in docs/ to explain the new config and use cases.
Originally created by @oneingan on GitHub (Oct 18, 2025). ### Use case In environments with ephemeral nodes (e.g., short-lived VMs or containers that frequently restart or redeploy), nodes often re-register with the same hostname. The current random suffix appending disrupts MagicDNS discoverability during the overlap period before the old node expires, making it harder to reliably resolve and connect to the active node via its intended hostname. ### Description Add a new server configuration option (e.g., `ephemeral_conflict_resolution`) that applies only to ephemeral nodes. Possible values: - "suffix" (default, current behavior: append random suffix to new registration). - "overwrite" (delete the old conflicting entry and register the new one with the clean hostname). - "rename_old" (append a suffix or timestamp to the old entry, and assign the clean hostname to the new registration). Non-ephemeral nodes continue with the existing suffix-on-conflict behavior. This ensures clean, predictable MagicDNS names for active ephemeral nodes without affecting persistent setups. ### Contribution - [ ] I can write the design doc for this feature - [x] I can contribute this feature ### How can it be implemented? - Introduce a new config field in `config.yaml` under server settings, e.g., `ephemeral_conflict_resolution: suffix|overwrite|rename_old`. - In the registration handler (`handleRegistration`), check if the node is ephemeral and if a conflict exists. - If conflict and mode is "overwrite": Delete the old node entry (via `DeleteNode` or similar). - If mode is "rename_old": Update the old node's given_name and fqdn with a suffix (e.g., `-old-<timestamp>`), then register the new one cleanly. - Fall back to suffix for non-ephemeral or default. - Add tests for each mode, covering registration overlaps and MagicDNS resolution. - Update documentation in `docs/` to explain the new config and use cases.
adam added the enhancement label 2025-12-29 02:28:20 +01:00
Author
Owner

@kradalby commented on GitHub (Oct 19, 2025):

I will object to this configuration, it would take a lot of special handling and be error prone. It sounds like we would solve the wrong problem where instead of making the system creating the ephemeral nodes more robust by programatically talking to headscale and finding the node that was added, we add complexity to Headscale.

The provisioning of node should contact the headscale and find the node and the related hostname it will end up and report it back to its own system. Alternatively as part of the tailscale login process, it can retrieve it there. If needed, the nodes could programatically be renamed to the desired name.

@kradalby commented on GitHub (Oct 19, 2025): I will object to this configuration, it would take a lot of special handling and be error prone. It sounds like we would solve the wrong problem where instead of making the system creating the ephemeral nodes more robust by programatically talking to headscale and finding the node that was added, we add complexity to Headscale. The provisioning of node should contact the headscale and find the node and the related hostname it will end up and report it back to its own system. Alternatively as part of the tailscale login process, it can retrieve it there. If needed, the nodes could programatically be renamed to the desired name.
Author
Owner

@oneingan commented on GitHub (Oct 21, 2025):

Thank you for the feedback. I agree that adding the proposed configuration could introduce unnecessary complexity to Headscale. My issue stems from the current limitation of managing devices programmatically via the Terraform provider, which isn't feasible for my use case.

That said, I noticed Tailscale upstream handles hostname conflicts by appending simple incremental suffixes (e.g., -1, -2, -3). This seems like a lightweight change that could align Headscale more closely with Tailscale's behavior while addressing the ephemeral node discoverability issue. I'd be happy to contribute to implementing this.

@oneingan commented on GitHub (Oct 21, 2025): Thank you for the feedback. I agree that adding the proposed configuration could introduce unnecessary complexity to Headscale. My issue stems from the current limitation of managing devices programmatically via the Terraform provider, which isn't feasible for my use case. That said, I noticed Tailscale upstream handles hostname conflicts by appending simple incremental suffixes (e.g., -1, -2, -3). This seems like a lightweight change that could align Headscale more closely with Tailscale's behavior while addressing the ephemeral node discoverability issue. I'd be happy to contribute to implementing this.
Author
Owner

@Sharpie commented on GitHub (Oct 27, 2025):

I bumped into the same issue of getting stable names from MagicDNS when combining Terraform and Tailscale and another user suggested a solution that uses a terraform_data resource to send a DELETE to the API when a VM is destroyed:

https://github.com/tailscale/terraform-provider-tailscale/issues/68#issuecomment-3420463533

I was able to adapt the pattern to Headscale using https://github.com/awlsring/terraform-provider-headscale :

# === Boilerplate ===
terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.68"
    }

   headscale = {
      source  = "awlsring/headscale"
      version = "~> 0.4"
    }
  }
}

variable "headscale_url" { }
variable "headscale_api_key" { sensitive = true }

provider "headscale" {
  endpoint = var.headscale_url
  api_key  = var.headscale_api_key
}

variable "digitalocean_token" { sensitive = true }

provider "digitalocean" {
  token = var.digitalocean_token
}

variable "ssh_admin_key" {}
variable "ssh_admin_private_key" { sensitive = true }

resource "digitalocean_ssh_key" "admin-key" {
  name       = "keyhole-admin"
  public_key = var.ssh_admin_key
}


# === Example ===
resource "headscale_pre_auth_key" "tf-example" {
  user = "1"
}

resource "digitalocean_droplet" "tf-example" {
  name   = "tf-example"
  image  = "ubuntu-25-04-x64"
  region = "nyc1"
  size   = "s-1vcpu-512mb-10gb"

  ssh_keys = [digitalocean_ssh_key.admin-key.id]

  user_data = <<-EOF
    #cloud-config
    runcmd:
      - ['sh', '-c', 'curl -fsSL https://tailscale.com/install.sh | sh']
      # Set sysctl settings for IP forwarding (useful when configuring an exit node)
      - ['sh', '-c', "echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && sudo sysctl -p /etc/sysctl.d/99-tailscale.conf" ]
      - ['tailscale', 'up',
           '--hostname=tf-example',
           '--auth-key=${headscale_pre_auth_key.tf-example.key}',
           '--login-server=${var.headscale_url}']
  EOF

  connection {
    type        = "ssh"
    user        = "root"
    host        = "tf-example"
    private_key = var.ssh_admin_private_key
  }

  # Blocks Terraform until cloud-init makes the node reachable via 
  # the tailnet. This ensures data.headscale_device is populated.
  provisioner "remote-exec" { inline = ["uptime"] }
}

data "headscale_device" "tf-example" {
  name = digitalocean_droplet.tf-example.name
}

resource "terraform_data" "reset_tf-example" {
  lifecycle {
    replace_triggered_by = [digitalocean_droplet.tf-example]
  }

  input = [
    var.headscale_api_key,
    var.headscale_url,
    data.headscale_device.tf-example.id
  ]

  provisioner "local-exec" {
    when       = destroy
    on_failure = continue

    environment = {
      HS_API_KEY   = self.input[0]
      HS_URL       = self.input[1]
      HS_DEVICE_ID = self.input[2]
    }

    command = <<-EOS
      curl -sS -X DELETE \
        -H "Authorization: Bearer $HS_API_KEY" \
        "$HS_URL/api/v1/node/$HS_DEVICE_ID"
    EOS
  }
}
@Sharpie commented on GitHub (Oct 27, 2025): I bumped into the same issue of getting stable names from MagicDNS when combining Terraform and Tailscale and another user suggested a solution that uses a `terraform_data` resource to send a `DELETE` to the API when a VM is destroyed: https://github.com/tailscale/terraform-provider-tailscale/issues/68#issuecomment-3420463533 I was able to adapt the pattern to Headscale using https://github.com/awlsring/terraform-provider-headscale : ```terraform # === Boilerplate === terraform { required_providers { digitalocean = { source = "digitalocean/digitalocean" version = "~> 2.68" } headscale = { source = "awlsring/headscale" version = "~> 0.4" } } } variable "headscale_url" { } variable "headscale_api_key" { sensitive = true } provider "headscale" { endpoint = var.headscale_url api_key = var.headscale_api_key } variable "digitalocean_token" { sensitive = true } provider "digitalocean" { token = var.digitalocean_token } variable "ssh_admin_key" {} variable "ssh_admin_private_key" { sensitive = true } resource "digitalocean_ssh_key" "admin-key" { name = "keyhole-admin" public_key = var.ssh_admin_key } # === Example === resource "headscale_pre_auth_key" "tf-example" { user = "1" } resource "digitalocean_droplet" "tf-example" { name = "tf-example" image = "ubuntu-25-04-x64" region = "nyc1" size = "s-1vcpu-512mb-10gb" ssh_keys = [digitalocean_ssh_key.admin-key.id] user_data = <<-EOF #cloud-config runcmd: - ['sh', '-c', 'curl -fsSL https://tailscale.com/install.sh | sh'] # Set sysctl settings for IP forwarding (useful when configuring an exit node) - ['sh', '-c', "echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf && sudo sysctl -p /etc/sysctl.d/99-tailscale.conf" ] - ['tailscale', 'up', '--hostname=tf-example', '--auth-key=${headscale_pre_auth_key.tf-example.key}', '--login-server=${var.headscale_url}'] EOF connection { type = "ssh" user = "root" host = "tf-example" private_key = var.ssh_admin_private_key } # Blocks Terraform until cloud-init makes the node reachable via # the tailnet. This ensures data.headscale_device is populated. provisioner "remote-exec" { inline = ["uptime"] } } data "headscale_device" "tf-example" { name = digitalocean_droplet.tf-example.name } resource "terraform_data" "reset_tf-example" { lifecycle { replace_triggered_by = [digitalocean_droplet.tf-example] } input = [ var.headscale_api_key, var.headscale_url, data.headscale_device.tf-example.id ] provisioner "local-exec" { when = destroy on_failure = continue environment = { HS_API_KEY = self.input[0] HS_URL = self.input[1] HS_DEVICE_ID = self.input[2] } command = <<-EOS curl -sS -X DELETE \ -H "Authorization: Bearer $HS_API_KEY" \ "$HS_URL/api/v1/node/$HS_DEVICE_ID" EOS } } ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#1112