Cope with machine failures
This section provides details of how to recover from various failure scenarios. All failure recovery scenarios require the use of one or more of the backup types listed in Backup.
In the absence of HA, master nodes detect the failures of members by receiving regular heartbeat messages. If no heartbeat has been received for 600 seconds, the master assumes the member is dead. There are two ways to recover from this problem:
Repair the dead host (e.g. by physically rebooting it). When the connection to the member is restored, the master will mark the member as alive again.
Shutdown the host and instruct the master to forget about the member node using the
xe host-forgetCLI command. Once the member has been forgotten, all the VMs which were running there will be marked as offline and can be restarted on other Citrix Hypervisor servers. Note it is very important to ensure that the Citrix Hypervisor server is actually offline, otherwise VM data corruption might occur. Be careful not to split your pool into multiple pools of a single host by using
xe host-forget, since this could result in them all mapping the same shared storage and corrupting VM data.
- If you are going to use the forgotten host as an active host again, perform a fresh installation of the Citrix Hypervisor software.
- Do not use
xe host-forgetcommand if HA is enabled on the pool. Disable HA first, then forget the host, and then re-enable HA.
When a member Citrix Hypervisor server fails, there may be VMs still registered in the running state. If you are sure that the member Citrix Hypervisor server is definitely down, use the
xe vm-reset-powerstate CLI command to set the power state of the VMs to
halted. See vm-reset-powerstate for more details.
Incorrect use of this command can lead to data corruption. Only use this command if absolutely necessary.
Before you can start VMs on another Citrix Hypervisor server, you are also required to release the locks on VM storage. Each disk in an SR can only be used by one host at a time, so it is key to make the disk accessible to other Citrix Hypervisor servers once a host has failed. To do so, run the following script on the pool master for each SR that contains disks of any affected VMs:
/opt/xensource/sm/resetvdis.py host_UUID SR_UUID master
You need only supply the third string (“master”) if the failed host was the SR master MDASH pool master or Citrix Hypervisor server using local storage MDASH at the time of the crash.
Be absolutely sure that the host is down before executing this command. Incorrect use of this command can lead to data corruption.
If you attempt to start a VM on another Citrix Hypervisor server before running the script above, then you will receive the following error message:
VDI <UUID> already attached RW.
Every member of a resource pool contains all the information necessary to take over the role of master if required. When a master node fails, the following sequence of events occurs:
If HA is enabled, another master is elected automatically.
If HA is not enabled, each member will wait for the master to return.
If the master comes back up at this point, it re-establishes communication with its members, and operation returns to normal.
If the master is really dead, choose one of the members and run the command
xe pool-emergency-transition-to-master on it. Once it has become the master, run the command
xe pool-recover-slaves and the members will now point to the new master.
If you repair or replace the server that was the original master, you can simply bring it up, install the Citrix Hypervisor server software, and add it to the pool. Since the Citrix Hypervisor servers in the pool are enforced to be homogeneous, there is no real need to make the replaced server the master.
When a member Citrix Hypervisor server is transitioned to being a master, you should also check that the default pool storage repository is set to an appropriate value. This can be done using the
xe pool-param-list command and verifying that the
default-SR parameter is pointing to a valid storage repository.
In the unfortunate event that your entire resource pool fails, you will need to recreate the pool database from scratch. Be sure to regularly back up your pool-metadata using the
xe pool-dump-database CLI command (see pool-dump-database).
To restore a completely failed pool:
Install a fresh set of hosts. Do not pool them up at this stage.
For the host nominated as the master, restore the pool database from your backup using the
xe pool-restore-databasecommand (see pool-restore-database).
Connect to the master host using XenCenter and ensure that all your shared storage and VMs are available again.
Perform a pool join operation on the remaining freshly installed member hosts, and start up your VMs on the appropriate hosts.
Cope with failure due to configuration errors
If the physical host machine is operational but the software or host configuration is corrupted:
Run the following command to restore host software and configuration:
xe host-restore host=host file-name=hostbackup
Reboot to the host installation CD and select Restore from backup.
Physical machine failure
If the physical host machine has failed, use the appropriate procedure listed below to recover.
Any VMs running on a previous member (or the previous host) which have failed will still be marked as
Runningin the database. This is for safety– simultaneously starting a VM on two different hosts would lead to severe disk corruption. If you are sure that the machines (and VMs) are offline you can reset the VM power state to
xe vm-reset-powerstate vm=vm_uuid --force
VMs can then be restarted using XenCenter or the CLI.
To replace a failed master with a still running member:
Run the following commands:
xe pool-emergency-transition-to-master xe pool-recover-slaves
If the commands succeed, restart the VMs.
To restore a pool with all hosts failed:
Run the command:
xe pool-restore-database file-name=backup
This command will only succeed if the target machine has an appropriate number of appropriately named NICs.
If the target machine has a different view of the storage (for example, a block-mirror with a different IP address) than the original machine, modify the storage configuration using the
pbd-destroycommand and then the
pbd-createcommand to recreate storage configurations. See pbd commands for documentation of these commands.
If you have created a new storage configuration, use
pbd-plugor Storage > Repair Storage Repository menu item in XenCenter to use the new configuration.
Restart all VMs.
To restore a VM when VM storage is not available:
Run the following command:
xe vm-import filename=backup metadata=true
If the metadata import fails, run the command:
xe vm-import filename=backup metadata=true --force
This command attempts to restore the VM metadata on a ‘best effort’ basis.
Restart all VMs.