Citrix Hypervisor

Cope with machine failures

This section provides details of how to recover from various failure scenarios. All failure recovery scenarios require the use of one or more of the backup types listed in Backup.

Member failures

In the absence of HA, master nodes detect the failures of members by receiving regular heartbeat messages. If no heartbeat has been received for 600 seconds, the master assumes the member is dead. There are two ways to recover from this problem:

  • Repair the dead host (for example, by physically rebooting it). When the connection to the member is restored, the master marks the member as alive again.

  • Shut down the host and instruct the master to forget about the member node using the xe host-forget CLI command. Once the member has been forgotten, all the VMs which were running there are marked as offline and can be restarted on other Citrix Hypervisor servers.

    It is important to ensure that the Citrix Hypervisor server is actually offline, otherwise VM data corruption might occur.

    Do not to split your pool into multiple pools of a single host by using xe host-forget. This action might result in them all mapping the same shared storage and corrupting VM data.

Warning:

  • If you are going to use the forgotten host as an active host again, perform a fresh installation of the Citrix Hypervisor software.
  • Do not use xe host-forget command if HA is enabled on the pool. Disable HA first, then forget the host, and then re-enable HA.

When a member Citrix Hypervisor server fails, there might be VMs still registered in the running state. If you are sure that the member Citrix Hypervisor server is definitely down, use the xe vm-reset-powerstate CLI command to set the power state of the VMs to halted. See vm-reset-powerstate for more details.

Warning:

Incorrect use of this command can lead to data corruption. Only use this command if necessary.

Before you can start VMs on another Citrix Hypervisor server, you are also required to release the locks on VM storage. Only on host at a time can use each disk in an SR. It is key to make the disk accessible to other Citrix Hypervisor servers once a host has failed. To do so, run the following script on the pool master for each SR that contains disks of any affected VMs: /opt/xensource/sm/resetvdis.py host_UUID SR_UUID master

You need only supply the third string (“master”) if the failed host was the SR master at the time of the crash. (The SR master is the pool master or a Citrix Hypervisor server using local storage.)

Warning:

Be sure that the host is down before running this command. Incorrect use of this command can lead to data corruption.

If you attempt to start a VM on another Citrix Hypervisor server before running the resetvdis.py script, then you receive the following error message: VDI <UUID> already attached RW.

Master failures

Every member of a resource pool contains all the information necessary to take over the role of master if necessary. When a master node fails, the following sequence of events occurs:

  1. If HA is enabled, another master is elected automatically.

  2. If HA is not enabled, each member waits for the master to return.

If the master comes back up at this point, it re-establishes communication with its members, and operation returns to normal.

If the master is dead, choose one of the members and run the command xe pool-emergency-transition-to-master on it. Once it has become the master, run the command xe pool-recover-slaves and the members now point to the new master.

If you repair or replace the server that was the original master, you can simply bring it up, install the Citrix Hypervisor software, and add it to the pool. Since the Citrix Hypervisor servers in the pool are enforced to be homogeneous, there is no real need to make the replaced server the master.

When a member Citrix Hypervisor server is transitioned to being a master, check that the default pool storage repository is set to an appropriate value. This check can be done using the xe pool-param-list command and verifying that the default-SR parameter is pointing to a valid storage repository.

Pool failures

In the unfortunate event that your entire resource pool fails, you must recreate the pool database from scratch. Be sure to regularly back up your pool-metadata using the xe pool-dump-database CLI command (see pool-dump-database).

To restore a completely failed pool:

  1. Install a fresh set of hosts. Do not pool them up at this stage.

  2. For the host nominated as the master, restore the pool database from your backup using the xe pool-restore-database command (see pool-restore-database).

  3. Connect to the master host using XenCenter and ensure that all your shared storage and VMs are available again.

  4. Perform a pool join operation on the remaining freshly installed member hosts, and start up your VMs on the appropriate hosts.

Cope with failure due to configuration errors

If the physical host machine is operational but the software or host configuration is corrupted:

  1. Run the following command to restore host software and configuration:

    xe host-restore host=host file-name=hostbackup
    <!--NeedCopy-->
    
  2. Reboot to the host installation CD and select Restore from backup.

Physical machine failure

If the physical host machine has failed, use the appropriate procedure from the following list to recover.

Warning:

Any VMs running on a previous member (or the previous host) which have failed are still marked as Running in the database. This behavior is for safety. Simultaneously starting a VM on two different hosts would lead to severe disk corruption. If you are sure that the machines (and VMs) are offline you can reset the VM power state to Halted:

xe vm-reset-powerstate vm=vm_uuid --force

VMs can then be restarted using XenCenter or the CLI.

To replace a failed master with a still running member:

  1. Run the following commands:

    xe pool-emergency-transition-to-master
    xe pool-recover-slaves
    <!--NeedCopy-->
    
  2. If the commands succeed, restart the VMs.

To restore a pool with all hosts failed:

  1. Run the command:

    xe pool-restore-database file-name=backup
    <!--NeedCopy-->
    

    Warning:

    This command only succeeds if the target machine has an appropriate number of appropriately named NICs.

  2. If the target machine has a different view of the storage than the original machine, modify the storage configuration using the pbd-destroy command. Next use the pbd-create command to recreate storage configurations. See pbd commands for documentation of these commands.

  3. If you have created a storage configuration, use pbd-plug or Storage > Repair Storage Repository menu item in XenCenter to use the new configuration.

  4. Restart all VMs.

To restore a VM when VM storage is not available:

  1. Run the following command:

    xe vm-import filename=backup metadata=true
    <!--NeedCopy-->
    
  2. If the metadata import fails, run the command:

    xe vm-import filename=backup metadata=true --force
    <!--NeedCopy-->
    

    This command attempts to restore the VM metadata on a ‘best effort’ basis.

  3. Restart all VMs.

Cope with machine failures