HA is a set of automatic features designed to plan for, and safely recover from, issues which take down XenServer hosts or make them unreachable. For example physically disrupted networking or host hardware failures.
HA is designed to be used in conjunction with storage multipathing and network bonding to create a system which is resilient to, and can recover from hardware faults. HA should always be used with multipathed storage and bonded networking.
Firstly, HA ensures that in the event of a host becoming unreachable or unstable, VMs which were known to be running on that host are shut down and able to be restarted elsewhere. This avoids the scenario where VMs are started (manually or automatically) on a new host and at some point later, the original host is able to recover. This scenario could lead two instances of the same VM running on different hosts, and a corresponding high probability of VM disk corruption and data loss.
Secondly, HA recovers administrative control of a pool in the event that the pool master becomes unreachable or unstable. HA ensures that administrative control is restored automatically without any manual intervention.
Optionally, HA can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. This allows infrastructure VMs to started before their dependent VMs (For example, a DHCP server before its dependent SQL server).
A pool is overcommitted if the VMs that are currently running could not be restarted elsewhere following a user-defined number of host failures.
This would happen if there was not enough free memory across the pool to run those VMs following failure. However there are also more subtle changes which can make HA guarantees unsustainable: changes to Virtual Block Devices (VBDs) and networks can affect which VMs may be restarted on which hosts. Currently it is not possible for XenServer to check all actions before they occur and determine if they will cause violation of HA demands. However an asynchronous notification is sent if HA becomes unsustainable.
XenServer dynamically maintains a failover plan which details what to do if a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of HA configuration. This determines the number of failures that is allowed without any loss of service. For example, if a resource pool consisted of 16 hosts, and the tolerated failures is set to 3, the pool calculates a failover plan that allows for any 3 hosts to fail and still be able to restart VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. Alerts are sent (either through XenCenter or e-mail) if changes (for example the addition on new VMs to the pool) cause your pool to become overcommitted.
If you attempt to start or resume a VM and that action causes the pool to be overcommitted, a warning alert is raised. This warning is displayed in XenCenter and is also available as a message instance through the Xen API. The message may also be sent to an email address if configured. You will then be allowed to cancel the operation, or proceed anyway. Proceeding causes the pool to become overcommitted. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.
If a server failure occurs such as the loss of network connectivity or a problem with the control stack is encountered, the XenServer host self-fences to ensure that the VMs are not running on two servers simultaneously. When a fence action is taken, the server immediately and abruptly restarts, causing all VMs running on it to be stopped. The other servers will detect that the VMs are no longer running and the VMs will be restarted according to the restart priorities assign to them. The fenced server will enter a reboot sequence, and when it has restarted it will try to re-join the resource pool.