Citrix Hypervisor high availability enables VMs to restart automatically in the event of an underlying hardware failure or loss of any server. High availability is about making sure important VMs are always running in a resource pool. With high availability enabled, if one of your servers fails, its VMs restart on other servers in the same pool. This capability allows essential services to be restored with minimal service interruption in the event of system or component failure.
If the pool master server fails, Citrix Hypervisor high availability selects a new server to take over as master. Any server in a pool can be a master server. Citrix Hypervisor replicates the pool database constantly across all nodes. It also backs up the database to shared storage on the heartbeat SR for extra safety.
There are two key aspects to Citrix Hypervisor high availability:
- Reliably detecting server failure
- Computing a failure plan to enable swift recovery
Heartbeats for availability
Detecting server failure reliably is difficult since you need to remotely distinguish between a server disappearing for a while versus catastrophic failure. If high availability incorrectly decides a master server has broken down and elects a new master, there might be unpredictable results if the original server returns. Similarly, if a network issue causes the pool to split into two equal halves, we must ensure only one half accesses the shared storage and not both simultaneously. Citrix Hypervisor solves all these problems by having two mechanisms: a storage heartbeat and a network heartbeat.
When you enable high availability in a pool, you nominate an iSCSI, Fibre Channel or NFS storage repository to be the heartbeat SR. Citrix Hypervisor automatically creates a couple of small virtual disks in this SR. The first disk is used by every server in the resource pool as a shared quorum disk. Each server allocates itself a unique block in the shared disk and regularly writes to the block to indicate that it is alive. When high availability starts up, all servers exchange data over both network and storage channels. This action indicates which servers they can see over both channels and demonstrates which I/O paths are working and which are not. This information is exchanged until a fixed point is reached and all servers in the pool agree about what they can see. When this agreement happens, high availability is enabled and the pool is protected. This high availability arming process can take a few minutes to settle for larger pools, but is only required when you first enable high availability.
After high availability is active, each server regularly writes storage updates to the heartbeat virtual disk, and network packets over the management interface. Ensure that network adapters are bonded for resilience, and that storage interfaces are using dynamic multipathing where it is supported. This configuration ensures that any single adapter or wiring failures do not result in any availability issues.
For more information, see:
The worst-case scenario for high availability is one where a server is thought to be off-line but is still writing to the shared storage. This scenario can result in corruption of persistent data. Citrix Hypervisor uses server fencing to prevent this situation. The server is automatically powered off and isolated from accessing any shared resources in the pool. Fencing prevents the failing server from writing to shared disks. This behavior prevents damage to the stored data during an automated failover, when protected virtual machines are being moved to other servers in the pool.
Servers self-fence (that is, power off and restart) in the event of any heartbeat failure unless any of the following hold true:
- The storage heartbeat is present for all servers but the network has partitioned (so that there are now two groups of servers). In this case, all servers that are members of the largest network partition stay running, and the servers in the smaller network partition self-fence. The assumption here is that the network outage has isolated the VMs, and they must be restarted on a server with working networking. If the network partitions are the same size, then only one of them self-fences according to a stable selection function.
- If the storage heartbeat goes away but the network heartbeat remains, the servers check to see if they can see all other servers over the network. If this condition holds true, the servers remain running on the assumption that the storage heartbeat server has gone away. This action doesn’t compromise VM safety, but any network glitches result in fencing, since that would mean both heartbeats have disappeared.
Capacity planning for failure
The heartbeat system gives us reliable notification of server failure, and so we move onto the second step of high availability: capacity planning for failure.
A resource pool consists of several servers (say, 32), each with potentially different amounts of memory and a different number of running VMs. Citrix Hypervisor high availability dynamically computes a failure plan that calculates the actions to be taken on any server failure. This failure plan ensures that no single server failure makes it impossible to restart its VMs on another server (for example, due to insufficient memory on other servers). In addition to dealing with failure of a single server, Citrix Hypervisor high availability can deal with the loss of multiple servers in a pool. For example, high availability can handle when failure of a network partition takes out an entire group of servers.
In addition to calculating what actions are taken, the failure plan considers the number of server failures that can be tolerated in the pool. There are two important considerations involved in calculating the high availability plan for a pool:
Maximum failure capacity. This value is the maximum number of servers that can fail before there are insufficient resources to run all the protected VMs in the pool. To calculate the maximum failure capacity, Citrix Hypervisor considers:
- The restart priorities of the VMs in the pool
- The number of servers in the pool
- The server CPU and memory capacity
Server failure limit. You can define this value as part of the high availability configuration which specifies the number of server failures to allow in the pool, within the plan. For example, when the server failure limit for a pool is 3, Citrix Hypervisor calculates a failover plan that allows any 3 servers to fail and all protected VMs can still run in the pool. You can configure the server failure limit to a value that is lower than the maximum failure capacity, making it less likely that the pool becomes overcommitted. This configuration can be useful in an environment with RBAC enabled. For example, this setting allows RBAC users with lower permissions than Pool Operator to bring more VMs online without breaking the high availability plan. For more information, see the High availability and Role-Based Access Control (RBAC) section.
A system alert is generated when the maximum failure capacity value falls below the value specified for the server failure limit.
When high availability is first enabled on a pool, a failure plan is calculated based on the resources available then. Citrix Hypervisor high availability dynamically calculates a new failure plan in response to events which would affect the pool, for example, starting a new VM. If a new plan cannot be calculated due to insufficient resources across the pool, the pool becomes overcommitted. Examples of insufficient resources might be not enough free memory or changes to virtual disks and networks that affect which VMs might be restarted on which servers.
High availability restart priority is used to determine which VMs to start when a pool is overcommitted. When you configure the restart priority for the VMs you want to protect in the HA Configuration dialog box or in the Configure HA wizard, the maximum failure capacity for the pool is recalculated dynamically. This information enables you to try various combinations of VM restart priorities depending on your business needs. You can see if the maximum failure capacity is appropriate to the level of protection you need for the critical VMs in the pool.
If you attempt to start or resume a VM and that action would cause the pool to be overcommitted, a warning is displayed in XenCenter. The message can also be sent to an email address, if configured. You are given the option to cancel the operation, or proceed anyway, causing the pool to become overcommitted.
Working with an HA-enabled pool
The best practice for high availability is not to make configuration changes to the pool while high availability is enabled. Instead, it is intended to be the “2am safeguard” which restarts servers in the event of a problem when there isn’t a human administrator nearby. If you are actively making configuration changes in the pool such as applying software updates, disable high availability during these changes.
- If you try to shut down a protected VM from XenCenter, XenCenter offers the option of removing the VM from the failure plan and then shutting it down. This option ensures that accidental VM shutdowns do not result in downtime, but that you can still stop a protected VM if you really want to.
- If you must reboot a server when high availability is enabled, XenCenter automatically uses the VM restart priorities to determine if this reboot invalidates the pool failure plan. If it doesn’t affect the plan, then the server is shut down normally. If the plan is violated, but the maximum failure capacity is greater than 1, XenCenter provides the option of lowering the pool’s server failure limit by 1. This action reduces the overall resilience of the pool, but always ensures that at least one server failure is tolerated. When the server comes back up, the plan is automatically recalculated and the original server failure limit is restored if appropriate.
- When you install software updates using the Install Update wizard, you must disable high availability on the pool by selecting Turn HA off. You can re-enable high availability after the update has been installed. If you do not disable high availability, the update does not proceed. Monitor the pool manually while updates are being installed to ensure that server failures do not disrupt the operation of the pool.
- When high availability is enabled, some operations that can compromise the plan for restarting VMs might be disabled, such as removing a server from a pool. To perform these operations, temporarily disable high availability or you can shut down the protected VMs before continuing.
High availability and role-based access control (RBAC)
In Citrix Hypervisor environments where role-based access control (RBAC) is implemented, not all users are permitted to change a pool’s high availability configuration settings. For example, VM Operators do not have sufficient permissions to adjust the failover capacity for an HA-enabled pool. If starting a VM reduces the maximum number of allowed server failures to a value lower than the current value, a VM Operator cannot start the VM. Only Pool Administrator or Pool Operator-level users can configure the number of server failures allowed.
In this case, the Pool Administrator or Pool Operator can set the server failure limit to a number that is lower than the maximum number of failures allowed. This setting creates slack capacity and so ensures that less privileged users can start up new VMs. It reduces the pool’s failover capacity without threatening the failure plan.