Cope with machine failures

This section provides details of how to recover from various failure scenarios. All failure recovery scenarios require the use of one or more of the backup types listed in Backup.

Member failures

In the absence of HA, master nodes detect the failures of members by receiving regular heartbeat messages. If no heartbeat has been received for 600 seconds, the master assumes the member is dead. There are two ways to recover from this problem:

  • Repair the dead host (e.g. by physically rebooting it). When the connection to the member is restored, the master will mark the member as alive again.

  • Shutdown the host and instruct the master to forget about the member node using the xe host-forget CLI command. Once the member has been forgotten, all the VMs which were running there will be marked as offline and can be restarted on other XenServer hosts. Note it is very important to ensure that the XenServer host is actually offline, otherwise VM data corruption might occur. Be careful not to split your pool into multiple pools of a single host by using xe host-forget, since this could result in them all mapping the same shared storage and corrupting VM data.

Warning:

  • If you are going to use the forgotten host as an active host again, perform a fresh installation of the XenServer software.
  • Do not use xe host-forget command if HA is enabled on the pool. Disable HA first, then forget the host, and then re-enable HA.

When a member XenServer host fails, there may be VMs still registered in the running state. If you are sure that the member XenServer host is definitely down, use the xe vm-reset-powerstate CLI command to set the power state of the VMs to halted. See vm-reset-powerstate for more details.

Warning:

Incorrect use of this command can lead to data corruption. Only use this command if absolutely necessary.

Before you can start VMs on another XenServer host, you are also required to release the locks on VM storage. Each disk in an SR can only be used by one host at a time, so it is key to make the disk accessible to other XenServer hosts once a host has failed. To do so, run the following script on the pool master for each SR that contains disks of any affected VMs: /opt/xensource/sm/resetvdis.py host_UUID SR_UUID master

You need only supply the third string (“master”) if the failed host was the SR master MDASH pool master or XenServer host using local storage MDASH at the time of the crash.

Warning:

Be absolutely sure that the host is down before executing this command. Incorrect use of this command can lead to data corruption.

If you attempt to start a VM on another XenServer host before running the script above, then you will receive the following error message: VDI <UUID> already attached RW.

Master failures

Every member of a resource pool contains all the information necessary to take over the role of master if required. When a master node fails, the following sequence of events occurs:

  1. If HA is enabled, another master is elected automatically.

  2. If HA is not enabled, each member will wait for the master to return.

If the master comes back up at this point, it re-establishes communication with its members, and operation returns to normal.

If the master is really dead, choose one of the members and run the command xe pool-emergency-transition-to-master on it. Once it has become the master, run the command xe pool-recover-slaves and the members will now point to the new master.

If you repair or replace the server that was the original master, you can simply bring it up, install the XenServer host software, and add it to the pool. Since the XenServer hosts in the pool are enforced to be homogeneous, there is no real need to make the replaced server the master.

When a member XenServer host is transitioned to being a master, you should also check that the default pool storage repository is set to an appropriate value. This can be done using the xe pool-param-list command and verifying that the default-SR parameter is pointing to a valid storage repository.

Pool failures

In the unfortunate event that your entire resource pool fails, you will need to recreate the pool database from scratch. Be sure to regularly back up your pool-metadata using the xe pool-dump-database CLI command (see pool-dump-database).

To restore a completely failed pool:

  1. Install a fresh set of hosts. Do not pool them up at this stage.

  2. For the host nominated as the master, restore the pool database from your backup using the xe pool-restore-database command (see pool-restore-database).

  3. Connect to the master host using XenCenter and ensure that all your shared storage and VMs are available again.

  4. Perform a pool join operation on the remaining freshly installed member hosts, and start up your VMs on the appropriate hosts.

Cope with failure due to configuration errors

If the physical host machine is operational but the software or host configuration is corrupted:

  1. Run the following command to restore host software and configuration:

    xe host-restore host=host file-name=hostbackup
    
  2. Reboot to the host installation CD and select Restore from backup.

Physical machine failure

If the physical host machine has failed, use the appropriate procedure listed below to recover.

Warning:

Any VMs running on a previous member (or the previous host) which have failed will still be marked as Running in the database. This is for safety– simultaneously starting a VM on two different hosts would lead to severe disk corruption. If you are sure that the machines (and VMs) are offline you can reset the VM power state to Halted:

  xe vm-reset-powerstate vm=vm_uuid --force VMs can then be restarted using XenCenter or the CLI.

To replace a failed master with a still running member:

  1. Run the following commands:

    xe pool-emergency-transition-to-master xe pool-recover-slaves

  2. If the commands succeed, restart the VMs.

To restore a pool with all hosts failed:

  1. Run the command:

    xe pool-restore-database file-name=backup
    

    Warning:

    This command will only succeed if the target machine has an appropriate number of appropriately named NICs.

  2. If the target machine has a different view of the storage (for example, a block-mirror with a different IP address) than the original machine, modify the storage configuration using the pbd-destroy command and then the pbd-create command to recreate storage configurations. See pbd commands for documentation of these commands.

  3. If you have created a new storage configuration, use pbd-plug or Storage > Repair Storage Repository menu item in XenCenter to use the new configuration.

  4. Restart all VMs.

To restore a VM when VM storage is not available:

  1. Run the following command:

    xe vm-import filename=backup metadata=true
    
  2. If the metadata import fails, run the command:

    xe vm-import filename=backup metadata=true --force
    

    This command attempts to restore the VM metadata on a ‘best effort’ basis.

  3. Restart all VMs.