XenServer

Troubleshoot clustered pools

XenServer pools that use GFS2 to thin provision their shared block storage are clustered. These pools behave differently to pools that use shared file-based storage or LVM with shared block storage. As a result, there are some specific issues that might occur in XenServer clustered pools and GFS2 environments.

Use the following information to troubleshoot minor issues that might occur when using this feature.

All my hosts can ping each other, but I can’t create a cluster. Why?

The clustering mechanism uses specific ports. If your hosts can’t communicate on these ports (even if they can communicate on other ports), you can’t enable clustering for the pool.

Ensure that the hosts in the pool can communicate on the following ports:

  • TCP: 8892, 8896, 21064
  • UDP: 5404, 5405 (not multicast)

If there are any firewalls or similar between the hosts in the pool, ensure that these ports are open.

If you have previously configured HA in the pool, disable the HA before enabling clustering.

Why am I getting an error when I try to join a new host to an existing clustered pool?

When clustering is enabled on a pool, every pool membership change must be agreed by every member of the cluster before it can succeed. If a cluster member isn’t contactable, operations that change cluster membership (such as host add or host remove) fail.

To add your new host to the clustered pool:

  1. Ensure that all of your hosts are online and can be contacted.

  2. Ensure that the hosts in the pool can communicate on the following ports:

    • TCP: 8892, 8896, 21064
    • UDP: 5404, 5405 (not multicast)
  3. Ensure that the joining host has an IP address allocated on the NIC that joins the cluster network of the pool.

  4. Ensure that no host in the pool is offline when a new host is trying to join the clustered pool.

  5. If an offline host cannot be recovered, mark it as dead to remove it from the cluster. For more information, see A host in my clustered pool is offline and I can’t recover it. How do I remove the host from my cluster?

What do I do if some members of the clustered pool aren’t joining the cluster automatically?

This issue might be caused by members of the clustered pool losing synchronization.

To resync the members of the clustered pool, use the following command:

xe cluster-pool-resync cluster-uuid=<cluster_uuid>

If the issue persists, you can try to reattach the GFS2 SR. You can do this task by using the xe CLI or through XenCenter.

Reattach the GFS2 SR by using the xe CLI:

  1. Detach the GFS2 SR from the pool. On each host, run the xe CLI command xe pbd-unplug uuid=<uuid_of_pbd>.

  2. Disable the clustered pool by using the command xe cluster-pool-destroy cluster-uuid=<cluster_uuid>

    If the preceding command is unsuccessful, you can forcibly disable a clustered pool by running xe cluster-host-force-destroy uuid=<cluster_host> on every host in the pool.

  3. Enable the clustered pool again by using the command xe cluster-pool-create network-uuid=<network_uuid> [cluster-stack=cluster_stack] [token-timeout=token_timeout] [token-timeout-coefficient=token_timeout_coefficient]

  4. Reattach the GFS2 SR by running the command xe pbd-plug uuid=<uuid_of_pbd> on each host.

Alternatively, to use XenCenter to reattach the GFS2 SR:

  1. In the pool Storage tab, right-click on the GFS2 SR and select Detach….
  2. From the toolbar, select Pool > Properties.
  3. In the Clustering tab, deselect Enable clustering.
  4. Click OK to apply your change.
  5. From the toolbar, select Pool > Properties.
  6. In the Clustering tab, select Enable clustering and choose the network to use for clustering.
  7. Click OK to apply your change.
  8. In the pool Storage tab, right-click on the GFS2 SR and select Repair.

How do I know if my host has self-fenced?

If your host self-fenced, it might have rejoined the cluster when it restarted. To see if a host has self-fenced and recovered, you can check the /var/opt/xapi-clusterd/boot-times file to see the times the host started. If there are start times in the file that you did not expect to see, the host has self-fenced.

Why is my host offline? How can I recover it?

There are many possible reasons for a host to go offline. Depending on the reason, the host can either be recovered or not.

The following reasons for a host to be offline are more common and can be addressed by recovering the host:

  • Clean shutdown
  • Forced shutdown
  • Temporary power failure
  • Reboot

The following reasons for a host to be offline are less common:

  • Permanent host hardware failure
  • Permanent host power supply failure
  • Network partition
  • Network switch failure

These issues can be addressed by replacing hardware or by marking failed hosts as dead.

A host in my clustered pool is offline and I can’t recover it. How do I remove the host from my cluster?

You can tell the cluster to forget the host. This action removes the host from the cluster permanently and decreases the number of live hosts required for quorum.

To remove an unrecoverable host, use the following command:

xe host-forget uuid=<host_uuid>

This command removes the host from the cluster permanently and decreases the number of live hosts required for quorum.

Note:

If the host isn’t offline, this command can cause data loss. You’re asked to confirm that you’re sure before proceeding with the command.

After a host is forgotten, it can’t be added back into the cluster. To add this host back into the cluster, you must do a fresh installation of XenServer on the host.

I’ve repaired a host that was marked as dead. How do I add it back into my cluster?

A XenServer host that has been marked as dead can’t be added back into the cluster. To add this system back into the cluster, you must do a fresh installation of XenServer. This fresh installation appears to the cluster as a new host.

What do I do if my cluster keeps losing quorum and its hosts keep fencing?

If one or more of the XenServer hosts in the cluster gets into a fence loop because of continuously losing and gaining quorum, you can boot the host with the nocluster kernel command-line argument. Connect to the physical or serial console of the host and edit the boot arguments in grub.

Example:

/boot/grub/grub.cfg
menuentry 'XenServer' {
        search --label --set root root-oyftuj
        multiboot2 /boot/xen.gz dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=192M,below=4G console=vga vga=mode-0x0311
        module2 /boot/vmlinuz-4.4-xen root=LABEL=root-oyftuj ro nolvm hpet=disable xencons=hvc console=hvc0 console=tty0 quiet vga=785 splash plymouth.ignore-serial-consoles nocluster
        module2 /boot/initrd-4.4-xen.img
}
menuentry 'XenServer (Serial)' {
        search --label --set root root-oyftuj
        multiboot2 /boot/xen.gz com1=115200,8n1 console=com1,vga dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=192M,below=4G
        module2 /boot/vmlinuz-4.4-xen root=LABEL=root-oyftuj ro nolvm hpet=disable console=tty0 xencons=hvc console=hvc0 nocluster
        module2 /boot/initrd-4.4-xen.img
}
<!--NeedCopy-->

What happens when the pool coordinator gets restarted in a clustered pool?

In most cases, the behavior when the pool coordinator is shut down or restarted in a clustered pool is the same as that when another pool member shuts down or restarts.

How the host is shut down or restarted can affect the quorum of the clustered pool. For more information about quorum, see Quorum.

The only difference in behavior depends on whether HA is enabled in your pool:

  • If HA is enabled, a new coordinator is selected and general service is maintained.
  • If HA is not enabled, there is no coordinator for the pool. Running VMs on the remaining hosts continue to run. Most administrative operations aren’t available until the coordinator restarts.

Why has my pool vanished after a host in the clustered pool is forced to shut down?

If you shut down a host normally (not forcibly), it’s temporarily removed from quorum calculations until it’s turned back on. However, if you forcibly shut down a host or it loses power, that host still counts towards quorum calculations. For example, if you had a pool of 3 hosts and forcibly shut down 2 of them the remaining host fences because it no longer has quorum.

Try to always shut down hosts in a clustered pool cleanly. For more information, see Manage your clustered pool.

Why did all hosts within the clustered pool restart at the same time?

All hosts in an active cluster are considered to have lost quorum when the number of contactable hosts in the pool is less than these values:

  • For a pool with an even number of hosts: n/2
  • For a pool with an odd number of hosts: (n+1)/2

The letter n indicates the total number of hosts in the clustered pool. For more information about quorum, see Quorum.

In this situation, all hosts self-fence and you see all hosts restarting.

To diagnose why the pool lost quorum, the following information can be useful:

  • In XenCenter, check the Notifications section for the time of the issue to see whether self-fencing occurred.
  • On the cluster hosts, check /var/opt/xapi-clusterd/boot-times to see whether a reboot occurred at an unexpected time.
  • In Crit.log, check whether any self-fencing messages are outputted.
  • Review the dlm_tool status command output for fencing information.

    Example dlm_tool status output:

     dlm_tool status
    
     cluster nodeid 1 quorate 1 ring seq 8 8
     daemon now 4281 fence_pid 0
     node 1 M add 3063 rem 0 fail 0 fence 0 at 0 0
     node 2 M add 3066 rem 0 fail 0 fence 0 at 0 0
     <!--NeedCopy-->
    

When collecting logs for debugging, collect diagnostic information from all hosts in the cluster. In the case where a single host has self-fenced, the other hosts in the cluster are more likely to have useful information.

Collect full server status reports for the hosts in your clustered pool. For more information, see XenServer host logs.

Why can’t I recover my clustered pool when I have quorum?

If you have a clustered pool with an even number of hosts, the number of hosts required to achieve quorum is one more than the number of hosts required to retain quorum. For more information about quorum, see Quorum.

If you are in an even-numbered pool and have recovered half of the hosts, you must recover one more host before you can recover the cluster.

Why do I see an Invalid token error when changing the cluster settings?

When updating the configuration of your cluster, you might receive the following error message about an invalid token ("[[\"InternalError\",\"Invalid token\"]]").

You can resolve this issue by completing the following steps:

  1. (Optional) Back up the current cluster configuration by collecting a server status report that includes the xapi-clusterd and system logs.

  2. Use XenCenter to detach the GFS2 SR from the clustered pool.

    In the pool Storage tab, right-click on the GFS2 SR and select Detach….

  3. On any host in the cluster, run this command to forcibly destroy the cluster:

    xe cluster-pool-force-destroy cluster-uuid=<uuid>
    
  4. Use XenCenter to reenable clustering on your pool.

    1. From the toolbar, select Pool > Properties.
    2. In the Clustering tab, select Enable clustering and choose the network to use for clustering.
    3. Click OK to apply your change
  5. Use XenCenter to reattach the GFS2 SR to the pool

    In the pool Storage tab, right-click on the GFS2 SR and select Repair.