Design Decision: Disaster Recovery Planning

Overview

This guide assists with disaster recovery (DR) architecture planning and considerations for both on-premises and cloud deployments of Citrix Virtual Apps and Desktops.
DR is a significant topic in breadth of scope in and of itself. Citrix acknowledges this document isn't a comprehensive guide to an overall DR strategy. It does not consider all aspects of DR and sometimes takes a layman’s perspective on various DR concepts.

This guide is intended to address these considerations which have a significant impact on an organization’s Citrix architecture and provide architectural guidance related to them:

Business Requirements
Disaster Recovery versus High Availability
Disaster Recovery Tier Classifications
Disaster Recovery Options
Disaster Recovery in the Cloud
Operation Considerations

Business Requirements

Aligning to the “Business Layer” of Citrix design methodology, gathering precise business requirements and known constraints for service recovery is the starting point in developing any recovery plan for Citrix. This step directs the scope and requirements that inform the DR strategy (discussed later) most suitable to meet the business and functional requirements within the constraints determined by the company or IT department.

Here are a few examples of useful questions to discuss when doing DR planning. These key components are important for the identification of business requirements and potential constraints. This list is an example of common baseline questions, and organizations will have other questions to answer to align with their requirements. These questions are discussed in more detail in the Disaster Recovery Options section later in the document:

Backup Strategy. Backup can support DR but DR is not a replacement for backups which will in fact be needed during DR either as part of the plan or as a hedge against corrupt replication data. What backup strategy is used today for servers? Including:
- Backup frequency?
- Retention period?
- Offsite storage?
- Backup testing Methodology?
Recovery Time Objective (RTO). Will the Citrix platform be immediately available in the event of a disaster or brought online within a specific time period? (See Disaster Recovery Tier Classifications). RTO varies based on application or service criticality classification but generally, Citrix’s RTO must match or be superior to the lowest RTO of an app hosted on Citrix and Citrix’s dependencies (network, storage, compute, and so on) must similarly match or be superior to Citrix’s or downstream RTOs to apps can be jeopardized. It is worth including application back-ends that Citrix-hosted apps connect to into the discussion to align expectations.
Recovery Point Objective (RPO). For RPO what degree of data loss is considered tolerable for the organization, which can vary depending on the classification of the applications in question? How old can the recovered data for the service be? 0 minutes? An hour? A month? Within the context of the Citrix infrastructure, this consideration can apply only to database changes and user data (profiles, folder redirection, and so on). As with RTO, it is worth considering the application back-ends for Citrix-hosted apps.
Scope of Recovery. This consideration does not include just the Citrix infrastructure but can include user data or key application servers which Citrix-hosted application clients interface with. It is important to identify if there is going to be a disparity between time to recover the Citrix platform, and time to recover applications hosted on Citrix. The time delta can prolong an outage as only part of the solution is.
Use Cases. Citrix platforms often service many use cases, each with different levels of business criticality. Does recovery cover all Citrix use cases? Or key use cases upon which the business’s operational success is fully dependent. The answer has a large impact on infrastructure scoping and capacity projections. Tiering and prioritization of use cases is recommended to compartmentalize the enablement of DR if it is a net new capability for the Citrix environment.
Capabilities. Are there any key capabilities that do not need to be included in DR which can reduce cost? An example of this capability would be the use of persistent desktops versus non-persistent; some VDI use cases that can be served by Hosted Shared Desktops or even specific application solos. Organizations have at times indicated component redundancy (ADCs, Controllers, StoreFront, SQL, and so on) is not considered necessary in DR. Such decisions can be viewed as a risk if there is a prolonged outage of production.
Existing DR. Does an existing DR strategy or plan exist for other Citrix environments and other infrastructure services? Does it recover into new routable subnets or into a “bubble” mirroring production networks? The answer can help give visibility to current DR approaches, tools, or precedence.
Technology Capabilities. This consideration can't dictate the final recovery strategy for Citrix. However, it's important to understand:
- Recovery: What storage replication technologies, VM recovery technologies (VMware SRM, Veeam, Zerto, Azure Site Recovery (ASR), and so on) are available within the organization? Some Citrix components alongside Citrix dependencies can use them.
- Citrix: What Citrix technologies are in use for image management and access? Some components can't lend themselves well to certain recovery strategies. Non-persistent environments managed by MCS or PVS make for poor candidates for VM replication technologies because of their tight integration with storage and networking.
Data Criticality. Are user profiles or user data considered critical to recovery? For persistent VDI if this data is unavailable when DR is invoked, would this cause a significant impact on productivity? Or can a non-persistent profile or VDI be used as a temporary solution? This decision can drive up costs and solutions complexity.
Types of Disasters. This decision is fairly significant but also can be pre-established through a corporate BC or DR plan. This requirement often dictates recovery location. Is the strategy intended to accommodate for recovery of service at the application level only? Between hardware racks? Within a metro location? Between two geographies? Two countries? Within a cloud provider region or an entire cloud provider’s infrastructure (multi-region)? Between cloud providers (multi-cloud)?
Client Users. Where are the users located who access the services in production? This decision can have implications as to where service is restored to, including corporate network connectivity which can require manual modifications during a DR event. In addition, the answer dictates access tier considerations.
Network Bandwidth. How much traffic (ICA, application, file services) does each user session consume on average? This decision applies both to cloud and on-prem recovery.
Fallback (or Failback). Does the organization have pre-existing plans for how to return to service in production once the DR situation has been resolved? How does the organization recover to a normal state? How is new data that can have been created in DR reconciled and combined with production?

Constraints

Constraints limit DR design options or their configurations. They come in many forms but can include:

Regulatory or Corporate Policy. This decision can dictate what technologies can or can't be used for replication or recovery, how they are used, RTO, data sovereignty, or minimum distance between facilities.
Infrastructure. Is there is directive on type of hardware to be used, network bandwidth available, and so on? These considerations can impact RPO considerations or can even present risks. For example:
- Undersized firewalls or network pipes in DR can ultimately lead to outages as network dependencies can't handle the DR workload.
- In the case of compute, undersized hypervisor hosts or use of different hypervisors entirely can impact options.
- Regarding networking, if the recovery site happens to have more constrained network throughput capability versus production when servicing the WAN.
- In cloud environments, especially when scaling into thousands of seats, this potential risk can quickly become a significant bottleneck because of component service limitations such as virtual firewalls, VPN gateways, and cloud-to-WAN uplinks (AWS Direct Connect, Azure ExpressRoute, Google Cloud Interconnect, and so on).
- Also, in cloud environments, appropriate consideration is given to the parity of services available in cloud regions designated in the DR region. Lessons learned from the field tell us that although a public cloud region initially seem to appeal from a hosting cost perspective, important features such as Availability Zones, instance types, or particular storage features for example, are not available in all cloud regions. A good rule of thumb is to assess if the features and capabilities planned for use are available in all regions planned for deployment.
Budget. DR solutions come with costs that can conflict with project budgets. More often than not, the shorter the RTO, the higher the cost.
Geography. If a designated DR facility has been identified, considerations such as latency from production to DR facilities, in addition to users connecting to DR must be understood.
Data Residency or Data Classification. This decision can limit options for locales and technologies or recovery methods.
Cloud Recovery. Is there a mandate for infrastructure to be recovered in the cloud versus in an on-prem facility?
Capacity. Does the DR location have sufficient capacity to handle the load required to meet the DR requirements? If recovering to the cloud, is the volume of workloads needing to recover realistic within the RTO if not pre-provisioned or covered with guaranteed capacity entitlements?
Application Readiness. Do the applications hosted on Citrix that have back-end dependencies have a DR plan and how do the RTOs line up with Citrix’s target RTO? Citrix can be designed for rapid recovery but if core applications do not have a recovery plan or one with an extended RTO, this risk likely impacts the usefulness of the Citrix platform in DR.
Network Security. Does the organization have security policies that might dictate what traffic segments require encryption in transit? Does this consideration vary depending on the network link being traversed? The answer might require in DR scenarios the use of SSL VDA, ICA Proxy, GRE, or IPsec VPN encapsulation of ICA traffic if network flows for DR are different from production.

Misconceptions on Disaster Recovery (DR) Design

One of the most common misconceptions for Citrix DR designs is that a single control plane spanning two data centers makes up DR. It does not.
A single Citrix Site or PVS Farm spanning data centers, even ones near, includes a high availability design but not a DR design.
Some organizations choose this path as a business decision, valuing simplified management over DR capability. StoreFront Server Groups spanning data centers are supported only between well-connected (low latency) data centers. Similarly, PVS has documented latency maximums to consider when deploying across data centers as outlined in CTX220651.

The following diagram is an example of a multi-data center HA Citrix architecture that is not however, a DR platform because of the previously mentioned rationale. This reference architecture is used by organizations who treat two physically separate data center facilities as a single logical data center because of their close proximity to each other and low latency, high bandwidth interconnectivity. A DR plan and an environment for Citrix would still be recommended as this architecture would be unlikely to satisfy the needs of either DR or HA.

In addition to the HA conceptual reference above, HA architectures based on zones are available for organizations who want to provide cross-geo redundancy but do not require the platform to be fully recoverable. The zones concept allows for various intra-site redundancy capabilities which can solve challenges for multi-location deployments.

In a Site architecture that uses the zone preference feature, identical Citrix resources are deployed across several zones and aggregated into a single Delivery Group. Zone preference (with the ability to fail back to other zones for a given resource) can be controlled based on the application, the user’s home zone, or the user's location. Refer to Zone Preference Internals for more information on this subject. The VDA Registration auto-update feature allows VDAs to maintain a locally cached up-to-date list of all controllers within a Site. This function allows VDAs within the satellite zone to fail over registration to VDAs to the primary zone in the event their local Delivery Controllers or Cloud Connectors become unavailable in their local zone. StoreFront Server Groups are present in each zone location (or at least in two zones), as is recommended for users to continue accessing resources from their local access tier in the event of the primary zone being unavailable.

In contrast to these supported HA options, the following diagram illustrates unsupported or ill-advised component architectures spanning geographically distant data centers or cloud regions. These diagrams provide neither effective HA nor DR because of the distance and latency between components. Platform stability issues can also result because of latency concerns in such a deployment. Furthermore, the stretching of administrative boundaries between sites does not align to DR principals. We have seen similar conceptual designs by organizations in the past.

Another common misconception is the depth of consideration that a Citrix DR solution can include. Citrix Virtual Apps and Desktops at its core is primarily a presentation and delivery platform. The Citrix control plane depends on other components (networking, SQL, Active Directory, DNS, RDS CALs, DHCP, and so on) whose own availability and recovery design meet the DR requirements of the Citrix platform. Any shared administrative domain of technologies on which Citrix is dependent can impact the efficacy of the Citrix DR solution.

Of similar importance and occasionally not fully accounted for by organizations are recovery considerations for file services (user data and business data on shares, and so on) and application back-ends with which Citrix-hosted applications and desktops interface. Referencing the earlier point on application readiness, a Citrix DR platform can be designed to recover in short timespans. However, if these core use case dependencies do not have a recovery plan with RTO similar to Citrix or no recovery plan at all, this plan can affect Citrix successfully failing over in DR as expected but users being unable to do their job functions as these dependencies remain unavailable. Take for instance a hospital who hosts their core EMR application on Citrix. A DR event occurs, and Citrix fails over to an “always on” Citrix platform in DR and users are connecting, but the EMR application is recovered via more manual means which can take hours or days to complete. Clinical staff would likely be forced to use offline processes (pen and paper) during this time. Such an outcome can't be congruent to the overall expectation of the business for recovery time or user experience.

The next section will discuss DR versus HA in more detail.

Disaster Recovery (DR) versus High Availability (HA)

Understanding HA versus DR is critical in aligning with organizational needs and meeting recovery objectives. HA is not synonymous with DR, but DR can use HA. This guide interprets HA and DR as follows:

HA is regarded as providing fault tolerance for a system (service, application, and so on) or component. A system can fail over to another system with minimal disruption to a user. The solution can be as simple as a clustered or load-balanced application to a more complex active or standby “hot” or “always on” infrastructure that mirrors the production configurations and is always available. Failover tends to be automated in these configurations and can also be termed a “geo-redundant” deployment.
DR is concerned primarily with the recovery of service (due to significant app-level failure or catastrophic physical failure of the data center) to an alternate data center (or cloud platform). DR tends to involve automated or manual recovery processes. Steps and procedures are documented to orchestrate recovery. It is not concerned with redundancy or fault tolerance of the components and is generally a broader-scope strategy that withstands multiple types of failures.

While HA tends to be embedded into design specifications and solutions deployment, DR is largely concerned with the orchestration planning of IT personnel and infrastructure resources to invoke recovery of the service.

Can HA include DR? Yes. For enterprise mission-critical IT services, this concept is common. Take for instance the second example in the earlier HA description, coupled with appropriate recovery of other non-Citrix components related to the solution, which would be regarded as a highly available DR solution wherein a service (Citrix) fails over to an opposing data center. That architecture can be implemented as Active-Passive or Active-Active in various multi-site iterations such as Active-Passive for all users, Active-Active with users preferring one data center over another, or Active-Active with users being load balanced between two data centers without preference. It is important to note that when designing such solutions, standby capacity must be considered, accounted for, and load monitored on an ongoing basis to ensure that capacity remains available to accommodate DR if it is required.

It is also essential that DR components be kept up to date with production to maintain the integrity of the solution. This activity is often overlooked by organizations who design and deploy such a solution with the best of intentions, then start consuming more platform resources in production, and forget to increase available capacity to keep the DR integrity of the solution.

In the context of Citrix, spanning Citrix administrative domains (Citrix Site, PVS Farm, and so on) between two data centers such as two geo-localized facilities as per published guidance would not make up DR and for some components such as StoreFront Server Groups. Similarly, because of PVS’s database latency limitations per published guidance, such a deployment is not recommended. Supportability constraints between geos also extend to Citrix Site and Zone design, because of latency maximums between satellite controllers and the databases per published guidance.

As many Citrix components share dependencies such as databases, stretching administrative boundaries between the two data centers would not protect against several key failure scenarios. If databases became corrupt the failure domain would impact application services at both facilities. To regard an HA Citrix solution as being sufficient for DR, we recommend that the second facility does not share key dependencies or administrative boundaries. For instance, create separate Sites, Farms, and Server Groups for each data center in the solution. By enabling a recovery platform to be as independent as possible we reduce the impact of component-level failures from affecting both production and DR environments. This consideration also extends beyond Citrix, and using different service accounts, file services, DNS, NTP, hypervisor management, certificate authorities (FAS dependency), and authentication services (AD, RADIUS, and so on) between production and DR environments is also recommended to reduce single points of failure.

Disaster Recovery Tier Classifications

Tier classifications for DR are an important aspect of an organization's DR strategy as they provide clarity into application or service criticality which in turn dictates the RTO (and thus costs for accomplishing that level of recovery). Alternatively, some organizations establish supported RTO (backed by technology solutions or architectural models) and assign tier classifications to them. Generally, the shorter the RTO, the higher the DR solution cost. Being able to break down various inter-dependencies into different classifications (based on business criticality and RTO) can help optimize cost-sensitive DR cases.

The following is a set of sample DR tier classifications to serve as a reference when assessing the Citrix infrastructure services, its dependencies, and critical applications or use cases (associated with VDA workloads) hosted on Citrix. DR tiers are outlined in order of recovery priority with 0 being the most critical. Organizations are encouraged to apply or develop a DR tiering classification aligning with their own recovery objectives and classification needs.

Disaster Recovery Tier 0 (Example)

Tier 0 - Recovery Objectives

Recovery Time Objective: 0

Recovery Point Objective: 0

Tier 0 - Classification

Core IT Infrastructure

Network and security infrastructure
Network links
Hypervisors and dependencies (compute, storage)

Core IT Services

Active Directory
DNS
DHCP
File Services
RDS Licensing
CA Services (if using FAS)
SQL Servers (on-prem WEM or Citrix Virtual Apps and Desktops, PVS, Session Recording)
Critical End User Services (Citrix)

Tier 0 - Notes

This classification is largely for core infrastructure components. These components are always available in the DR location as they are dependencies for other tiers, and not in an isolated network segment. They need to be provisioned and maintained alongside their production equivalents. If Citrix is hosting critical applications, it’s likely regarded as a Tier 0 platform. In such scenarios, the Citrix infrastructure is deployed “hot” in DR.

Disaster Recovery Tier 1 (Example)

Tier 1 - Recovery Objectives

Recovery Time Objective: 15 Minutes

Recovery Point Objective: 1 Hour

Tier 1 - Classification

Critical Applications

Websites and web apps
ERPs and CRM applications
Line-of-business applications

Citrix Use Cases

Management
Call Centers
Customer Service or Sales
IT Engineering and Support

Tier 1 - Notes

Applications or virtual desktops that the business depends on to carry on core business activities would typically be contained in this tier. They would likely also employ a similar “hot standby” DR architecture as Tier 0 or benefit from an automated replication and failover solution. If provisioning into the cloud, considerations discussed later, need to be accounted for as they can impact RTO targets.

Disaster Recovery Tier 2 (Example)

Tier 2 - Recovery Objectives

Recovery Time Objective: 4 Hours

Recovery Point Objective: 1 Day

Tier 2 - Classification

Non-critical Applications

Non-critical Citrix use cases

User Data that does not impact the functionality of Tier 1 or Tier 0 apps

Tier 2 - Notes

Applications or use cases which are key to business operations, but whose short-term unavailability is unlikely to cause serious financial, reputation, or operational impacts. These applications are either recovered from backups or recovered as the lowest priority by automated recovery tools.

Disaster Recovery Tier 3 (Example)

Tier 3 - Recovery Objectives

Recovery Time Objective: 5 Days

Recovery Point Objective: 1 Week

Tier 3 - Classification

Infrequently Used Applications

Tier 3 - Notes

Applications with negligible outage impact are unavailable for up to one week. These applications are likely recovered from backups.

Disaster Recovery Tier 4 (Example)

Tier 4 - Recovery Objectives

Recovery Time Objective: 30 Days

Recovery Point Objective: 24 Hours or None

Tier 4 - Classification

Non-production environments

Tier 4 - Notes

Applications, infrastructure, and VDIs whose outages also have a negligible impact on business operations can be restored over time. Depending on their nature, they can also have an extended RPO or not at all. These RPOs can be recovered from backups or built brand new in DR and are regarded as the last tier to be recovered.

Why is Tier classification important for Citrix DR planning?

Tier classification is important for Citrix for the following reasons:

How critical is the Citrix infrastructure to business operations? If Citrix is considered essential, but its host application is regarded as necessary, the Citrix infrastructure would become classified as critical.
The use cases or core business applications that Citrix hosts. Do they have different DR tier classifications?

For the first question, many enterprise Citrix deployments tend to be classified as Tier 0 because of the delivery of critical apps or desktops; the “always on” tier like network, Active Directory, DNS, and hypervisor infrastructure. This circumstance is not always the case however for every Citrix use case. Treating every Citrix use case as Tier 0 when some can fall into Tier 1 or higher can impact the overall cost and complexity of the DR process.

The second question stresses classification per Citrix use case and lends its importance most significantly in cloud environments which is discussed later in more detail. In the cloud, there are significant cost considerations between accommodating for “always on” application or desktop platforms, versus applications or desktops regarded as being less business critical. Such considerations can also influence application or use case isolation (silos) in production to take advantage of deployment flexibility in a DR platform.

When establishing a DR design for Citrix, bringing the discussion beyond the scope of Citrix itself helps set expectations for business units. For example, a Citrix environment is considered an “always on” service and is made highly available in an alternate data center. Yet, the critical back-ends that Citrix’s hosted applications depend upon will remain unavailable for several hours as the recovery is invoked. This gap creates a recovery time disparity between the two platforms and can provide a misleading user experience during the recovery. Citrix is available immediately, but critical applications are not functional. Setting expectations at the outset offers appropriate visibility to all stakeholders on the recovery experience. In some situations, a customer can want to keep Citrix on hot standby (always on) in the opposing facility but manually control the failover of the access tier to avoid misunderstandings on platform availability.

Disaster Recovery Options

In this section, common Citrix recovery strategies are outlined including their pros and cons, and key considerations. Other recovery capabilities or variations of the following themes for Citrix are possible. This section is focusing on some of the most common. In addition, this section illustrates how responses to the core questions indicated early on influence the DR design.

Correlating to several of the earlier DR questions, the following question topics have Citrix DR design implications as follows:

Backup Strategy and Recovery Time Objective (RTO). If Citrix infrastructure or applications served from Citrix are considered mission critical, it is likely Citrix must employ an “always on” model where hot standby or Active-Active Citrix infrastructure is present at two or more data centers. This architecture would result in multiple independent Citrix control planes being created at each data center (separate Citrix Sites, PVS farms if applicable, StoreFront Server Groups, and so on). Spanning the control planes for the Citrix infrastructure does not make up DR (ex. spanning a Site across data centers or using satellite zones); doing so if the business is expecting a DR platform for Citrix, that counters to that mandate.
Scope of Recovery. If Citrix is deployed in a hot standby (always on) capacity in DR but the core business application back-ends take, for example, 8 hours to recover, it can't make sense to employ an automated access tier failover. Manual failover of the access tier can be more appropriate.
Use Cases. If only critical workloads or core applications must be rapidly recovered in Citrix to maintain business operations in a DR scenario, this scenario can likely reduce DR costs from a capacity perspective. If multiple use cases of differing priorities need recovery, the classification of importance per use case contrasted to business impact per the recovery tiering strategy discussed earlier can't reduce capacity costs but will provide IT staff a more focused set of recovery priority orders.
Capabilities. If certain components such as HDX Insight, and Session Recording are not considered critical to DR platform operation, their omission in the DR environment removes some complexity and maintenance overhead. Similarly, if many use cases in a DR scenario can subsist on a simpler and more generic Citrix delivery option in DR, this can also reduce complexity and costs. For example, using a Hosted Shared Desktop instead of Pooled VDI or Persistent VDI if technically feasible (in the context of rapid and cost-effective service recovery in a DR scenario), or aggregating more use cases into one, provided they are not detrimental to business operations.
Existing DR. If the existing DR strategy of the organization, for example, recovers Citrix and other application infrastructure using data replication and orchestration tools, most Citrix components can be suitable for this model. If the size of the platform and reliance on single-image management technology is a production platform requirement, then often such technologies can't be suitable; a hybrid approach of a hot standby Citrix platform and perhaps replication of master images can be more appropriate.
Data Criticality. If profiles are considered critical for recovery in DR, appropriate replication technology can be necessary. Many organizations are less concerned with profiles in DR scenarios and accept them being created again. This consideration will also apply to user data accessed in Citrix (folder redirection, mapped drives); RPO and RTO for this data can be a business decision. Furthermore, if many persistent VDIs are considered important enough to remain intact in DR (versus requiring users to reinstall their software, and so on) then these VMs must be considered for replication, which can incur extra costs to support recovery.
Types of Disasters. The extent to which a customer wants to protect against failure can dictate various architectural changes. If the customer only wants HA of the Citrix infrastructure within the data center or cloud region, this type of disaster can merely require ensuring management components are both redundant and operating on opposing infrastructures. As an example, a type of StoreFront server pair uses VMware anti-affinity rules to operate on different hosts in a cluster, or on different clusters entirely within the data center, or perhaps as part of different availability sets. This situation seldom requires creating redundant control planes entirely (multiple Citrix Sites and StoreFront Server Groups for example). However, if DR is intended to encompass multiple data centers regardless of region, then redundant control planes using local dependencies at each data center (AD, DNS, SQL, hypervisor, and so on) is more appropriate. If the customer is global and employs multiple data centers to service Citrix (or plans to) with respective application back-ends local to these data centers, then it is more likely that a geo-localized Active-Active HA-DR architecture is more appropriate. This architecture provides users with the most optimal user experience, when possible, by using geo-localized Citrix infrastructures which can if need be, failover to a backup data center in a secondary preference order.
Client Users. Beyond relation to the earlier point about user, app, and data localization considerations, some client user networks can be relatively locked down with security devices (proxy, firewall, and so on) which can restrict outbound communication to the Internet or even the WAN. It is important to consider if this state applies to the client networks and ensures that new IPs for Citrix services (such as StoreFront VIP and VDA IPs, or Citrix Gateway IP) are accounted for in their local security configurations to ensure that further recovery delays do not occur due to local LAN security restrictions when invoking DR. From a preparedness perspective, in the event of DR, will client access change in some fashion? In some DR scenarios, for example, an organization can assume that the WAN is unavailable and all users must access Citrix resources via the Internet. Such steps would need documentation in the DR and readiness plans, with prerequisites established for end-users (regarding supported Citrix client details, and assumptions on corporate or BYOD device access) to remove barriers for users returning to service, further reducing the burden on the support desk.
Network Bandwidth. The amount of network bandwidth used about VDA traffic (ICA, applications, file services) must be factored into DR facility network sizing and firewalls. This consideration is especially important in cloud environments where limits on VPN gateways and virtual firewall capacity exist. Monitoring production traffic from VDAs to determine an average value for sizing is important to effectively size networking. Where network constraints exist, organizations must use different network configurations to accommodate the expected DR traffic load if and when invoked.
Fallback (or Failback). If user data has changed in DR, or if VDA images while in DR were updated, the organization must plan for failback to propagate these changes back to production if the production environment proves recoverable. For user data, it can be as simple as reversing the replication order and recovering; similarly, for Citrix infrastructure if not using single image management technologies. If using MCS or PVS, master images or vDisks can be manually replicated back to production, and VDAs updated accordingly.

The following list outlines various common recovery options for Citrix. Adaptations of each exist in the field (organizations sometimes opt to implement only portions of these options to limit DR to critical use cases or use a mix of options to balance costs among use case tiers), however for the sake of comparison, we are outlining basic versions of each. The options are organized starting with the simplest (often higher RTO and lower cost) through the more advanced (often lower RTO but higher cost). The ideal option for a given organization is to align to recovery time for hosted applications or use cases, in addition to the IT skills, budget, and infrastructure available.

In addition, although possible to accomplish, many options indicate that Citrix technologies integrated with network and storage such as NetScaler and single-image management technologies are not suited for methods other than “always on” recovery models. It is not that it is technically impossible to accomplish, but the level of complexity involved in accomplishing them can make recovery riskier and more prone to human error.

Recovery Option - Recover from Offline Backup

In this option, organizations rely solely on traditional backup solutions to restore the Citrix platform to service in another location. There is no standby infrastructure to fail over to. Multiple, often manual tasks are required in the recovery plan to restore the core services Citrix depends on and later restore Citrix itself. The outage time of Citrix is dictated by the speed at which components can be restored in the DR location. This model is not ideal for advanced Citrix deployments and is best suited for smaller, simpler infrastructures and organizations that can withstand an extended period of downtime.

Pros and Cons

Pros:

Lower maintenance costs compared to replication or hot-standby solutions

Cons:

High downtime impact
Larger, more detailed recovery plan (DR orchestration) documentation
Extended recovery time
Relies on integrity and age of backups
A higher degree of human error if manual reconfiguration is necessary (networking, and so on)
Unsuitable for Pooled, Fast Clone MCS or PVS
Unsuitable for NetScaler VPX because of networking (and needs to be rebuilt using backups of nsconfig directory and ns.conf files)
Considerations must be given to sufficiently extending backup retention scope to accommodate other modern-day threats such as time bombing and ransomware

Use Case and Assumptions

Useful for less mature IT organizations and organizations with limited IT operations budgets and can allow for extended outages to recover core business services. Assumes backups and the recovery process are tested regularly to ensure that the backups are good and the recovery process is well understood by those who have to do it during a real event.

Recovery Option - Recover via Replication

This option is similar to the prior option but uses replication instead of backup from which to restore infrastructure. Replication can reduce some aspects of manual tasks in the recovery sequence and potentially allow RTO to be reduced (and thus restore service faster) but remains reliant on many manual tasks. It is worth noting that replication is not a replacement for backup. Corrupt or compromised data still gets replicated to DR and is thus not useful in such scenarios. Backups must still be considered as a fallback. As with the prior option, this model is not ideal for advanced Citrix deployments and is best suited for smaller, simpler infrastructures and organizations that can withstand an extended period of downtime.

Pros and Cons

Pros:

Replication is likely automated and aligns with RTO and RPOs
Likely uses less complex technologies as compared to automated recovery solutions

Cons:

Relies on administrative intervention
Larger, more detailed recovery plan (DR orchestration) documentation
A higher degree of human error if manual reconfiguration is necessary (networking, and so on)
Unsuitable for Pooled, Fast Clone MCS, or PVS. Recreation of Machine Catalogs is factored into the projected RTO. However, by creating dummy Machine Catalogs in DR or scaling out VDA instances in DR and performing an “Update Catalog” action applying a replicated master image, this RTO can be shortened
Unsuitable for NetScaler VPX due to networking and thus better suited to employ hot standby NetScaler and assume the cost of doing so

Use Case and Assumptions

Useful for less mature IT organizations and organizations with limited IT operations budgets. This solution relies on storage replication technologies from the SAN vendor or the hypervisor vendor (vSphere Replication, Nutanix Replication, and so on) to replicate VMs to another facility over the WAN.

Recovery Option - Replication with Automated Recovery

In this option, automated recovery is introduced for several components of the solution to restore the Citrix infrastructure and its dependencies in the DR location. Technologies that orchestrate automated recovery further reduce manual steps and minimize human intervention (and potential human error), and can further improve RTO by streamlining the recovery of service. It is expected that intervention is still required to fail over to the DR location by way of DNS record modifications. It is worth noting that replication is not a replacement for backup. Corrupt or compromised data still gets replicated to DR and is thus not useful in such scenarios. Backups are still be considered as a fallback. While this model is more mature, it remains unsuitable for use with advanced Citrix configurations or for organizations that can't tolerate much downtime.

Pros and Cons

Pros:

Lower maintenance costs compared to hot-standby solutions
Replication is likely automated and aligns to RTO and RPOs
Recovery plans tend to be automated
Less administrative intervention and human error

Cons:

Relies on more advanced technologies such as VMware SRM, Veeam, Zerto, Nutanix Disaster Recovery Orchestration, XenServer Disaster Recovery, and Azure Site Recovery (ASR) to orchestrate recovery and modify network parameters (unless network segments are stretched)
Unsuitable for Pooled, Fast Clone MCS or PVS. Recreation of Machine Catalogs must be factored into the projected RTO. However, by creating dummy Machine Catalogs in DR or scaling out VDA instances in DR and performing an “Update Catalog” action applying a replicated master image, this RTO can be shortened
Unsuitable for NetScaler VPX due to networking and thus better suited to employ hot standby NetScaler

Use Case and Assumptions

Useful for enterprise organizations with appropriate resources and budget for DR facilities. This solution relies on the same storage replication of the previous option but includes DR orchestration technologies to recover VMs in a particular order, adjust NIC configurations (if needed), and so on.

Recovery Option – Hot Standby (Active-Passive) with Manual Failover

We now start to look at options that use hot standby infrastructure. Although more costly, the advantages of an appropriately designed hot standby model drastically reduce the number of steps required to orchestrate recovery to the DR location, reduce the risk of technology and human errors, and can significantly recovery time. These models do need maintaining separate independent Citrix infrastructures which create more overhead (ongoing maintenance and testing) but accommodate more advanced Citrix deployments that use single image management technologies. In this option, failover relies on administrator intervention to redirect traffic to the alternate location through DNS changes, reducing the cost and complexity of automated failover of the access tier. This option would best suit organizations with sufficient budget to accommodate standby Citrix infrastructure, rely on advanced Citrix configurations, and require low RTO but manually controlled recovery.

Pros and Cons

Pros:

Short recovery time as the platform is "always on"
Supports storage and network-dependent components such as NetScaler, MCS, PVS
Lower recovery plan (DR orchestration) documentation

Cons:

Relies on administrative intervention to fail over URL or direct users to back up the URL (albeit manual orchestration is sometimes intentional)
Higher costs due to having "hot" hardware in DR sitting on standby
Higher administrative overhead to keep standby platform's configurations and updates in sync with production

Use Case and Assumptions

Useful for enterprise organizations with appropriate resources and budget for DR facilities. Can use a hot standby “fully scaled” platform or a “scale on-demand” platform. The latter can be appealing for cloud recovery to reduce operating costs, with caveats.

At the time of failover, administrators update the DNS entry for one or more access URLs to point to one or more DR IPs for Citrix Gateway and StoreFront, or users are advised by formal communication to begin using a “backup” or “DR” URL.

This manual option can be useful for scenarios where application back-ends can require longer recovery time but would add confusion for users if Citrix was fully available and applications were not.

This model assumes a mature IT organization and enough WAN and compute infrastructure are available to support failover.

Recovery Option – Hot Standby (Active-Passive) with Automated Failover

Similar to the prior hot standby option, this option includes automated failover typically via DNS technologies that detect failures in production and redirect traffic to the DR location. The automated nature does put further emphasis on the IT teams to ensure that configurations are duly maintained in DR and application and data dependencies remain accessible from DR during regular operation to avoid impact to users who might end up in the DR site as a result of an isolated temporary service disruption in production. Capacity management and monitoring (ensuring DR can accommodate the same number of users on Citrix as production) become increasingly critical due to the automated failover capability. This option would best suit organizations with sufficient budget to accommodate standby Citrix infrastructure, rely on advanced Citrix configurations, and require RTO near zero.

Pros and Cons

Pros:

Short recovery time as the platform is "always on"
Supports storage and network-dependent components such as NetScaler, MCS, PVS
Minimal recovery plan (DR orchestration) documentation
Easier for end-users as URLs fail-over
Supports EPA Scans at Citrix Gateway with GSLB

Cons:

Higher costs due to having “hot” hardware in DR sitting on standby and NetScaler licensing. Cost burden can be reduced to an extent if the standby capacity is used by development or other less critical non-production workloads which can be powered down during a DR event to prioritize production recovery
Higher access tier complexity
Higher administrative overhead to keep standby platform’s configurations and updates in sync with production

Use Case and Assumptions

A common configuration with enterprise organizations that allows for an automated failover to the DR site via NetScaler GSLB (Advanced or higher needed). This model assumes a mature IT organization and enough WAN and compute infrastructure to support failover. This model also assumes that application and user data dependencies are in alignment with the latest active site versions/updates and recoverable at the DR facility in a similarly automated manner to reduce prolonged impact of service to the end user and confusion due to partial solution functionality.

Recovery Option – Active-Active with Automated Failover

This option builds off the former but during regular operation, both production and DR locations are in active use. Capacity can be managed and monitored tightly, as in such a model it is easier to overlook when either location has surpassed a threshold (say 40–45% usage) to ensure that either location can accommodate the full user load during a DR event. As both environments are used actively, routine validation of most components is built in. However, components that require failover to the DR location (apps, storage arrays, and so on) still undergo routine periodic failover testing. As the model uses two or more active locations, consideration must be given to performance tolerances and variances between locations, especially if the locations span significant geographic distances. Some performance deviations can be minimized in this model by homing users to one location over another (localizing the VDAs) during regular operation and failing over to other VDAs if their local location is unavailable. This model is the most complex, most robust, and accommodates the most advanced Citrix deployments. It is ideal for organizations who need an RTO of zero and desire a model that sees only 50% of users failover during a DR event versus 100%.

Pros and Cons

Pros:

Short recovery time as the platform is "always on"
Supports storage and network-dependent components such as NetScaler, MCS, PVS
Minimal recovery plan (DR orchestration) documentation
Seamless to end-users
Supports EPA Scans at Citrix Gateway with GSLB (NetScaler 13.0+ firmware releases from 2022 onward can accommodate EPA scans in Active-Active GSLB configurations, previously functional only in Active-Passive configurations)

Cons:

Higher costs due to having “hot” hardware in DR sitting on standby and NetScaler licensing. Cost burden can be reduced to an extent if the standby capacity is used by development or other less critical non-production workloads which can be powered down during a DR event to prioritize production recovery
Highest access tier complexity
Higher administrative overhead to keep standby platform’s configurations and updates in sync with production
Relies on administrators to monitor and adjust resource and hardware capacity at all data centers to ensure that as the platform grows, DR capacity integrity is not impacted

Use Case and Assumptions

A more advanced but common configuration with enterprise organizations that allows access tier URLs to operate in an Active-Active manner via NetScaler GSLB (NetScaler Advanced or higher needed). This functionality is useful in environments with local data center proximity to each other, or in situations where data centers can be remote but with the means to pin users to preferred data centers (often driven by advanced StoreFront configurations and GSLB to a lesser extent) for multi-site scenarios.

This model assumes a mature IT organization and enough WAN and compute infrastructure to support failover. This model also assumes that application and user data dependencies are in alignment with the latest site versions/updates and recoverable at the DR facility in a similarly automated manner to reduce prolonged impact of service to the end-user and confusion due to partial solution functionality.

Disaster Recovery in Public Cloud

Disaster Recovery involving on-prem to cloud platforms or cloud-to-cloud comes with its own set of challenges or considerations which often do not present themselves in on-prem recovery scenarios.

The following key considerations can be addressed during DR design planning to avoid missteps that can render the DR plan applying cloud infrastructure either invalid, cost prohibitive, or unable to meet target capacity in the event of DR.

Consideration – Network Throughput

Impact Areas

Availability Performance Cost

Details

Organizations can underestimate and thus undersize the throughput juncture points in their cloud solutions including virtual firewalls, VPN gateways, and WAN uplinks (AWS Direct Connect, Azure Express Route, GCP Cloud Interconnect, and so on) which can have a significant deleterious effect on performance if the Citrix platform is to recover in the Cloud and be accessible over the WAN. Also, be mindful of VPN Gateway limits if in use with a cloud provider. For example, AWS Transit Gateways presently have a maximum limit of 1.25 Gbps. This limit can require scaling out gateways or perhaps using multiple VPCs if these are critical to the cloud architecture. Azure and GCP VPN Gateways will have higher limits. Many virtual firewalls have licensed limits on the throughput that they can process or maximums even at their highest limit. This constraint can require scaling out the number of firewalls and load-balancing them in some manner.

Recommendations

When establishing throughput sizing calculations, assume the full DR capacity load. Capture the following data per concurrent user:

Estimated ICA session bandwidth
Estimated application communication bandwidth per session if they traverse security boundaries
Estimated bandwidth for file services per session if they traverse security boundaries

For the previous metrics, it can be useful to gather data on current traffic patterns to and from VDAs in production. It is also important to consider what other data flows unrelated to Citrix are also anticipated to be using these network paths. Be sure to engage the network and security teams in planning Citrix DR to ensure any bandwidth estimates traversing security zones and network segments is understood and can be accommodated.

Consideration – Authentication Services

Impact Areas

Availability Performance Security

Details

Authentication services are critical to the operation of Citrix platforms. Most continue to rely on Microsoft Active Directory Domain Services (AD DS). If using Citrix Federated Authentication Service (FAS), Microsoft Certificate Authorities (CAs) or supported third-party CAs are also critical to the ongoing operation of the platform. That is, if they are down, Citrix is down.

Some organizations continue to resist placing authentication services infrastructure in public clouds, often citing security concerns. As a result, authentication and brokering performance are impeded when backhauling to writeable on-premises Domain Controllers (DCs) during regular operation, and the availability and reachability of DCs and CAs in a DR scenario must not be overlooked.

Recommendations

Implement sufficient compensating controls to address the concerns of the security team and place writeable DCs and CAs (if using FAS) near the Citrix infrastructure in the cloud. If non-negotiable, ensure that recovery is planned for, and reachability validated for these servers if the production data center fails or is unreachable.

Consideration – Windows Desktop OS Licensing

Impact Areas

Cost

Details

There are potentially complex licensing considerations for Microsoft Desktop OS instances running on different cloud platforms. Microsoft revised their cloud licensing rights in August 2019 which can affect VDI cost implications in some deployment scenarios. Microsoft licensing for Office 365 / Microsoft 365 apps has also been a changing landscape about support on public clouds and OS types.

Recommendations

Licensing of Microsoft products is a continuously evolving matter. Refer to Microsoft and the cloud provider (if relevant) for the current details on licensing and supportability statements when determining the use case architecture. If there is a potential cost challenge for VDI in the DR solution, consider supplementing with Hosted Shared Desktops (augmentation of RDS CALs can be required), if feasible, as they can provide greater flexibility at a lower cost of operation. Restrictions or future support changes require reevaluating the platform used for DR entirely.

Consideration – Timing of VDA Scale Out (Before or During DR)

Impact Areas

Cost Availability

Details

Organizations are drawn to the cloud by the appeal of paying for capacity only when it is needed. This solution can reduce DR costs dramatically by not paying for reserved infrastructure whether it is used or not.

However, at large scales, a cloud provider can't commit to an SLA of powering on hundreds or thousands of VMs at once. This solution becomes particularly challenging if the VDA footprint for DR is anticipated to run into hundreds or thousands of instances. Cloud providers tend to maintain bulk capacity for various instance sizes on hand; however, this provider can vary from moment to moment. If a disaster affecting a geographical area occurs, there can be significant contention from other tenants also requesting on-demand capacity.

In the case of Azure, do not confuse reserved instances with reserved (on-demand) capacity. If choosing reserved instances to act as a hedge against resource availability in a DR region on demand, it would be prudent to deploy them as Microsoft does not guarantee the capacity is available on demand. With on-demand capacity, Microsoft sets aside capacity outright (at pay-as-you-go rates). Ultimately, it would be prudent to assume that only running VDAs are guaranteed to be available.

Key questions that need to be answered that can influence decisions:

Is always 100% DR capacity required?
What types of disasters are being planned for? If a regional disaster is one, would it be wise to use a single cloud region or a cloud region close to production?
Do we host critical workloads only (that is, only a subset of production)?
Is it viable to scale out at the time of DR? If so, have the use cases been prioritized by DR Tiers to better understand the varying RTOs of each use case to support a phased scale out of capacity?
Can we scale up the OS instances supporting app or hosted shared desktop use cases to save on provisioning time and power off these instances to conserve operating costs?

Recommendations

Citrix recommends first engaging with your cloud provider to determine the viability of powering on the anticipated capacity within the RTO timeframe and if it can be met with on-demand instances or not. To hedge against capacity availability limitations for VDAs in a DR scenario, Citrix recommends provisioning VDAs across as many availability zones as possible.

At large scales, it can be worthwhile provisioning across various cloud regions, and adjusting the architecture accordingly. Some cloud providers have suggested employing varying sizes of VM instance types to further mitigate VM exhaustion.

Some organizations want to diversify vendor risk and adopt a multi-cloud deployment wherein use cases are provisioned across two different cloud providers to hedge against an entire cloud platform failing (DNS, routing, breach, and so on).
This scenario does increase complexity and has not performance or feature parity between clouds, but these potentially minor caveats are eclipsed by the perceived value of platform diversification. Where possible, it would be prudent to pre-provision VDA instances and update them regularly (keeping them online or offline would be at the discretion and risk tolerance of the customer). Provisioning is a resource and time-intensive process and scaling out VDA DR capacity on-demand can be expedited to a degree by pre-provisioning.

If the organization has little appetite for capacity availability risk, it can require applying reserved or dedicated compute capacity and budgeting accordingly to guarantee resource availability. It is possible to combine on-demand and reserved/dedicated models by referencing the DR recovery tier wherein certain use cases can require VDAs to be immediately available, while others can have the flexibility to be recovered over a longer RTO of several days or weeks if they are less critical to sustaining the business.

Consideration – Application and User Data

Impact Areas

Cost Availability Performance

Details

The location of user data and application back-ends can have a notable impact on the performance and sometimes, availability of the Citrix DR environment. Some customer scenarios use a multi-data center DR approach where not all application back-ends or user data such as home and departmental drives can't be recovered to the cloud alongside Citrix. This gap can result in unanticipated latency which can impact the performance or even functionality of the applications. From a throughput perspective, this gap can add further strain to available network and security appliance bandwidth.

Recommendations

Where possible, keep application and user data local to the Citrix platform in DR to maintain performance as optimal as possible by reducing latency and bandwidth demands across the WAN. Determine what actually needs to be recoverable in DR (FSLogix Office Containers might not be critical, and some use cases work fine with a new profile being created) and ensure that application packages and containers are available in the DR site. Determining the most appropriate and cost-effective replication solution is critical. This topic is discussed further later in this guide.

Disaster Recovery Planning for Citrix Cloud

There are several notable differences between on-prem or “traditional” deployment of Citrix Virtual Apps and Desktops, versus Citrix DaaS provided by Citrix Cloud regarding DR planning:

Citrix manages most control components for the partner/customer, removing significant DR requirements for the Citrix Site and its components from their responsibility.
The deployment of a DR environment for Citrix resources merely requires a customer to deploy Citrix Cloud Connectors in the recovery “Resource Location”, and optionally StoreFront and NetScaler (for Citrix Gateway).
Citrix Cloud's unique service architecture is geographically redundant and resilient by design.
Access tier DR is not required if using Citrix Workspace and Citrix Gateway services.

Beyond these core differences, many of the DR considerations from prior sections still require partner/customer planning, as they retain responsibility for the Citrix VDAs, user data, application back-ends, and Citrix access tier if Citrix Gateway Service and Citrix Workspace service are not used from Citrix Cloud.

This section covers key topics to assist organizations in defining an appropriate DR strategy for Citrix Cloud.

Citrix DaaS Simplifies Disaster Recovery

The following is a typical conceptual diagram outlining the conceptual architecture of Citrix DaaS, in addition to the separation of responsibility for Citrix-managed components and partner/customer-managed components. Not illustrated here are the WEM, Analytics, and Citrix Gateway “Services” which are elective Citrix Cloud components related to Citrix DaaS which would fall under “Managed by Citrix”.

As illustrated in the diagram, a significant portion of Control components requiring recovery decisions fall under Citrix's management scope. Being a cloud-based service, the Citrix DaaS architecture is highly resilient within the Citrix Cloud region. It is part of Citrix Cloud's “Secret Sauce” and is considered within Citrix Cloud's SLAs.

Service availability management responsibilities are as follows:

Citrix. Control Plane and access “services” if in use (Citrix Workspace, Citrix Gateway Service).
Customer. Resource Location components including Cloud Connectors, VDAs, application back-ends, infrastructure dependencies (AD, DNS, and so on), and access tier (StoreFront, NetScaler) if not using Citrix Cloud access tier.

Organizations gain the following advantages regarding disaster recovery on Citrix DaaS:

Lower administrative burden due to fewer components to manage, and fewer independent configurations to replicate and maintain between locations.
Reduced chances of human error and configuration discrepancies between Citrix deployments due to the centralized configuration of the “Citrix Site” in the Cloud.
Streamlined operations due to simplification in resource management for production and disaster recovery deployments due to fewer Citrix Sites and components to be configured and maintained, no access tier to manage between locations (optional), and less complex disaster recovery logic for Citrix resources.
Reduced operational costs with fewer server components to deploy and maintain and gain a single pane of glass view of environment trends across deployments through the centralization of the monitoring database.

Citrix DaaS Disaster Recovery Considerations

Although many components for recovery planning are removed from the organization’s scope of management, organizations remain accountable the planning and management of DR and high availability (optional) for the components located within the Resource Location.

The most significant difference in how we address availability comes down to how we interpret and configure Resource Locations. Within Citrix DaaS itself, Resource Locations are presented as Zones. With Zone Preference we can manage failover between Resource Locations based on the logic we specify. Using Zone Preference within a traditionally deployed Citrix Virtual Apps and Desktops Site would be considered a high-availability design but not a valid DR design. In the context of Citrix Cloud, this option is a valid DR solution.

Most of the Disaster Recovery Options discussed earlier applying to Citrix DaaS so there are numerous options to fit organizational recovery goals and budgets.

When planning DR for Citrix Cloud's Citrix DaaS service, several key guiding principles from an infrastructure planning perspective need to be understood:

Resource Locations. Production and DR locations are set up as independent Resource Locations in Citrix Cloud.
Cloud Connectors. Each Resource Location must have a minimum of two Cloud Connectors deployed. For clarity, Cloud Connectors are not a component that must be “recovered” manually or automatically during a DR event. They must be deemed “hot standby” components and kept online within each location always.
Customer-Managed Access Controllers (Optional). Organizations can elect to deploy their own NetScaler for Citrix Gateway and StoreFront servers and not consume Citrix Workspace or Citrix Gateway Service for several reasons. These can include:
- Custom authentication flows
- Increased branding capabilities
- Greater HDX traffic routing flexibility
- Greater insight into HDX network performance (HDX Insight)
- Auditing of ICA connections and integration into SIEM platforms
- Ability to continue to operate if the Cloud Connector's connection to Citrix Cloud is severed, by using the Local Host Cache function of the Cloud Connectors with StoreFront
As with the Cloud Connectors, it is recommended to keep StoreFront and Citrix Gateway components deployed as “hot standby” in the recovery location and not recovering them during a DR event.

Operation Considerations

Maintaining the DR platform is essential to maintaining its integrity to avoid unforeseen issues when the platform is required. The following guidelines are recommended regarding operating and maintaining the DR Citrix environment:

The Citrix DR Platform is Not Non-Prod. Organizations with a “hot-standby” environment can be tempted to cut corners and treat DR as a test platform. This treatment negatively impacts the integrity of the solution. In fact, DR would likely be the last platform to promote changes to, to ensure that its utility is not affected in the event maintenance went catastrophically wrong in a manner that was not exhibited in non-prod environments.
Patching and Maintenance. Routine maintenance in lockstep with production when using “hot standby” Citrix platforms is essential to maintaining a functional DR platform. Keeping DR in sync with production regarding OS, Citrix product, and application patches are important. To mitigate risk, allowing several days to a week between patching production and patching DR is suggested.
Periodic Testing. Regardless of whether DR involves replication of production to a recovery facility or the use of “hot standby” environments, it is important to periodically test (once or twice a year ideally) the DR plan to ensure that the teams are familiar with recovery processes and any flaws in the platform or workflows are identified and remediated. Practice makes perfect. A plan can work on paper but testing reveals the expectations for RTO and RPO objectives are unachievable, an order of operation needs adjustment, or an oversight must be accounted for. For example, the actual time to sync and failover storage might result in an unexpected amount of data loss, requiring adjustment to expectations or replication configuration.
Capacity Management. True for both Active-Passive and Active-Active “always on” Citrix environments, capacity or use case changes in production must be considered for DR as well. Especially when Active-Active models are used, it is easy for resource utilization to increase beyond say, a 50% steady-state utilization threshold of resources at each environment, only for a DR event to occur and the surviving platform becoming overwhelmed and either performing poorly or failing due to the overload. Capacity can be reviewed monthly or quarterly.

Data Recovery Considerations

In this section, we will briefly cover several concepts and considerations related to dealing with user profiles, VDA images, and applications (App-V, App Layers, MSIX, and so on) in multi-location and DR scenarios.

Replication versus Backup

Both functions of data availability and integrity are important factors when considering the recovery of critical data sets such as core infrastructure, images, user data, and apps. While replication features typically offered in public cloud solutions provide convenient mechanisms to ensure the resilience of critical data, replication does not address data corruption or DR scenarios where even replicated data has been compromised. Replication is not a replacement for backups, and where financially feasible, both must be used. Replication can assist in the rapid recovery of data, while backups can assist in recovering from data loss or corruption (to the extent dictated by the retention policy and backup interval). Whether on-premises or in the cloud, numerous backup options are available.

While on the topic, it is prudent to consider backup retention policies carefully in the current cybersecurity landscape due to ransomware attacks now employing “time bombing” where data sets are compromised but appear normal while the ransomware remains dormant until a specific date. By playing the “long game”, cybercriminals can render backups with shorter retention periods useless upon recovery. Extending retention periods increase costs but improve the probability of recovering critical data.

What Do I Recover? Classifying Data Criticality

While much of this article has focused on availability and recovery options for Citrix infrastructure, consideration is also be given to the availability of various data sets to align business expectations with cost. For example, if user profiles are unavailable in the DR location, will this merely be an inconvenience or a significant impediment to business operations? The answer to this question will directly impact the complexity and costs of the infrastructure components. The following is a list of considerations of common data associated with a Citrix platform:

User Profiles. Includes Citrix Profile Management (CPM) profiles or containers, User Layers, or FSLogix containers. Organizations must determine if they must be recovered in the DR, and their RTO and RPO. For some use cases, the cost to recover profiles in DR is not worth if creating profiles is merely an inconvenience with 10-15 minutes of extra user burden to set up new profiles in DR. For other use cases, it is non-negotiable. Office Containers can not be worth replicating since they are caches of online data. This differentiation within the user profile/user data discussion allow a reduction in storage and replication costs by electing to not replicate these containers (and makes the case to separate FSLogix profile containers from Office containers).
Images / Layers. In most cases, there is no real decision to be had on whether to recover images or not (master images, PVS vDisks, App Layering layers) but how they are to be recovered. Are they replicated automatically? Or are they maintained independently between production and DR?
Applications. Here we specifically are referring to containerized applications (elastic layers, MSIX, App-V). As with images, the decision tends to be more about “how” we recover and not “if”.

Replication Options

Whether in a public cloud or on-premises, multiple replication technologies are available to choose from. This section covers many common options but is not exhaustive. Note that in many scenarios where replication is used, data loss to some degree is inevitable, unless the DR location allows synchronous replication. Otherwise, usually, recovery options have an RPO consideration, often the lowest being 15 minutes (depending on platform/tool/scenario). The more frequent the replication interval, the higher the cost, typically.

User profiles (of all forms), User Layers, apps, and images might be able to withstand a day’s worth of data loss for many organizations, reducing the amount of cross-site traffic and bandwidth consumption. In some use cases, user profile data loss is inconsequential to the business in the context of DR and replication costs. Regardless of the option chosen, both IT and the business must be aware of expectations in terms of potential data loss during an unplanned DR event. Planned DR events allow the opportunity to complete a full replication of data before failing over.

In public clouds, native file services solutions such as Azure NetApp Files, Azure Files, and AWS FSx are commonplace for storing user data and often have built-in replication capabilities within a region or to other regions. If planning for production and DR within a cloud platform and replicating data to other regions, be mindful when selecting regions to use as to what their region pairs are, to ensure that the data can be replicated to the region you desire. Azure maintains a list of cross-replication region pairs here. Azure NetApp Files also has its own cross-replication region pair list here. If deploying in regions that are not paired in Azure data can recover to a region that generates more latency to your VDAs.

In some instances, this can be overcome by not relying on such technologies and using app-level replication capabilities such as FSLogix (if used) Cloud Cache for example. AWS FSx in contrast does not restrict replication via region pairs, allowing greater choice. Be aware of cloud replication options that will not retain SMB connections. Azure for example if using geo-redundant storage (GRS) will not maintain SMB handles or leases, which can impact users depending on the failure scenario and require a logout/login to remount the shares affected for apps, layers, or profiles.

Third-party options, notably NetApp Cloud Volumes OnTap (CVO) are also available on major public clouds (including GCP) and are popular for organizations experienced with NetApp and that desire a similar management and capability experience in clouds.

As with the Azure GRS example, be aware that many technologies that involve file share recovery between data centers or public cloud regions will not retain SMB connections. This can result in the need for users to log out and in upon failover to remount the file shares for technologies that do not automatically retry access and also require manual recovery of the share/volume as well. This is not be noticeable in a full DR scenario where users are logging into a different VDA than production to recover productivity.

On-premises, Citrix Professional Services has long recommended organizations use native filer platforms (that is, IP network file services provided by SANs or NAS devices) where available based on their robust feature sets and perceived lower overhead as compared to Windows-based file server solutions. This is especially important when planning storage replication functions for a DR site.

If hardware/appliance filer solutions are unavailable, Windows-based solutions are a fallback (be sure they are tuned per Microsoft guidance and duly backed up), with various options available but are likely to be more limited in their performance and capacity scalability than built-in solutions:

Standalone File Server. A basic Windows file server for hosting Citrix data needs including profile and app containers, user profiles, Microsoft Folder Redirection, and so on provided the environment is not 24/7 and can withstand outages.
- Pros: Simple to set up and maintain
- Cons: No HA, no replication ability unless layering on another technology such as DFS-R, Storage Replica, or VMware Site Recovery Manager (SRM), Veeam, robocopy, and so on, reboots (planned or unplanned) will interrupt user profiles, user data (such as home drives and Microsoft Folder Redirection), mounted app packages, Personal vDisk access (likely resulting in crashed PVS targets)
Windows File Cluster. A mainstay technology in Windows environments and relies on Windows Failover Clustering. Intended for creating high availability for file shares within a local data center.
- Pros: Robust and mature solution ideal for Citrix data needs for various profile and container types, allows failover between nodes without interrupting SMB sessions of clients
- Cons: Not intended as a DR solution, more of an HA solution. Can be more complex to set up and maintain depending on the shared storage architecture used, not ideal for spanning data centers unless being recovered using a VM-level recovery orchestration tool such as VMware SRM, or using DFS-R or even robocopy to replicate changes between two clusters
Distributed File System (DFS). DFS Replication (DFS-R) and DFS Namespace (DFS-N) are two of the oldest file server replication and availability options in Windows, initially used to power Active Directory replication functions. They can be configured independently of each other but are often configured together. Its prevalence within long-lived Microsoft-based organizations makes it a familiar solution to many admins. However, it has limited supported applicability to Citrix environments.
Pros: Familiarity to admins, easy to set up, can span data centers, not reliant on complex cluster configurations (but can be layered over top of them)
Cons: DFS has many limitations and is not ideal for most Citrix-related data replication use cases. DFS-N and DFS-R are unsupported by Microsoft in combination with each other for various common user data and profile scenarios but can be supported in active-passive (one-way replication) scenarios with manual failover (one that would require users to log out and log back in to re-establish file server access), not suitable for hosting app packages/containers unless acceptable for users to log out and back in
Storage Replica. Storage Replica enables replication of volumes between servers or clusters in multi-data center scenarios. It can be configured in three forms including stretch clusters, cluster-to-cluster, and server-to-server. Both asynchronous and synchronous replication are possible.
- Pros: Supports automatic failover when used with Storage Spaces Direct to form a stretch cluster. May be suitable (subject to testing failover performance) for user profiles, user data, profile containers, app containers, and layers. If not using stretch clusters, may be suitable for FSLogix containers only
- Cons: If not using stretch clusters, failover must be performed manually and will sever SMB sessions requiring a logout/login to re-establish access to user profiles and data. Not suitable for hosting app packages/containers if not using stretch clusters
Scale-Out File Server. Scale-Out File Servers are ideal for use cases with larger files such as VM disks, or app data that is not particularly write intensive.
- Pros: SMB continuous availability (reduces downtime when failing over between nodes), ideal for VHDs, app containers, layers, profile containers, can span data centers
- Cons: Not recommended for user profiles (roaming profiles, CPM), Microsoft Folder Redirection per Microsoft

Depending on the data set, it may not be critical to keep the same namespace path between production and DR data centers. Keeping independent paths for user profiles, app package shares, Elastic or User Layers, profile containers, and so on may reduce complexity. Evaluate implications on a per-dataset type basis, including considerations for reverse replication and failback if not operating in an active-active scenario. In active-active scenarios, homing users to one regional data center over another would be ideal, to avoid mounting apps and data containers between sites and adding latency.

Recovery Considerations for App Layering

For organizations running App Layering, review the Citrix reference architecture for App Layering there are several components where backups are appropriate, including the appliance, Elastic Layer Shares, and User Layer Shares.

Appliance backups ensure that all the layers are available even if the appliance is somehow destroyed. These backups ensure that all the layers are available even if the appliance is somehow destroyed or corrupted.
Elastic Layers are mounted from an SMB share at logon. It is critical to ensure that the file server/services used for Elastic Layers are highly available by using appropriate clustering. Using native hardware (or cloud) file share solutions or Windows-based solutions that support continuous SMB availability is important. Using multiple DFS-R targets is not sufficient for this use case because if a target fails, the mounted VHD files can't be remapped to another target until another logon happens. Similarly, Storage Replica in several configurations is not suitable for similar reasons as discussed earlier.
- For DR, there are two models that can be used to address Elastic Layers
  - Replication Model
  - Dual Appliance Model
User Layers are more complicated from a DR perspective. These writable VHDs are mounted by Windows at logon and locked by the Windows desktop.
- As with Elastic Layers, native hardware or cloud file share solutions are optimal due to limitations of common Windows-based methods such as DFS-R or robocopy. Understand that User Layers tend to be larger, with blocks changing frequently, and replication methods that replicate at the block level versus copying the entire file when changed, will be superior otherwise organizations may see untenable amounts of User Layer replication between sites.
- Due to the difficulty of replicating large disks or the inherent cost implications, organizations most often opt to provide DR desktops for users without a replicated User Layer.

Recovery Considerations for Profiles

Recovery of profiles requires a suitable storage replication and access strategy regardless of whether the profiles are located in on-premises data centers or in the public cloud. This section is intended to build upon the prior considerations for recovery discussed earlier on a per-profile solution basis. For each use case, be sure to ask “Does the profile in fact need to be recovered in a DR scenario?” If it does not, costs and complexity can likely be greatly reduced.

General Recommendations:

Confirm if the profile data must be replicated at all, or if only a portion of it does. For example, if using FSLogix profile containers, you may want to split it into a profile and an Office container (potentially on different shares for granular replication control if relying on storage replication) to provide flexibility in replicating the profile containers but not Office containers (which during a DR event can be created on demand).
Lead with using purpose-built “native” file services (filer) platforms on-premises or in the cloud. Numerous Windows options are available but are typically less robust (both in features and scalability), have more overhead, and have more caveats. Local redundancy is strongly recommended in addition to cross-site redundancy to avoid some of the common pitfalls of failover scenarios that can interrupt SMB connections.
If replicating VHDs (which tend to be large files) with a high rate of change such as User Layers and profile containers, consider replication solutions that can replicate at the block level and do not trigger a full copy of the file on change otherwise storage providers and network throughput can become overwhelmed.
Consider the RPO interval. The shorter it is, the higher the cost and load on storage and network systems. An asynchronous replication option that balances data loss tolerance and replication interval is optimal.
Design the solution so that profiles are local to the VDAs in either location to improve performance.
If using CPM and planning to use DFS, refer to the CPM DR documentation for DFS considerations.
Subject to testing, consider Cloud Cache to simplify replication and recovery as the feature handles replication of profile data at the application level, and allows administrators to set a preferred list of container target shares (one per file server storage provider) on a per-data center basis (typically via GPO / OU structures). Other considerations for Cloud Cache are provided by Microsoft and must be referenced in planning. Cloud Cache does have drawbacks and certain scenarios can result in outages, so organizations are encouraged to review the potential risks to understand if the rewards of handling replication and recovery of profiles at the application level outweigh them.
Plan for failback which may entail reversal of replication from DR to production.

Recovery Considerations for Application Packages

When we refer to application packages, we refer to App-V, MSIX, Elastic Layers, and similar application packages collectively. These are characteristically read-intensive but not write-intensive and as compared to profiles, are not updated as frequently. The following are some general recommendations for this data type:

Lead with using purpose-built “native” file services (filer) platforms on-premises or in the cloud.
Consider maintaining separate file shares for the applications (that is, not using the same path or namespace) and pointing the VDAs to use their local file share. Keep the file shares in sync manually or via robocopy.

Recovery Considerations for Images

Master images and PVS vDisks must also be considered in the DR recovery strategy. Master images if lost, result in considerable effort to rebuild. App Layering falls into this classification but has been covered already. The following are some general recommendations to consider:

Replicate master images and PVS vDisks between data centers or cloud regions if relying on the same image between production and DR. Consider manual replication to avoid the possibility of a corrupt image being pushed to DR.
Back up master images and PVS vDisks (including version files and manifest files), ideally to an offsite location rather than production.
Use Azure Compute Gallery (formerly Shared Image Gallery) if using Azure, to replicate images across regions, subscriptions, and tenants, providing a shared image repository.

Summary

We have covered a fair bit of ground on the topic of disaster recovery in Citrix. The following points summarize the core messages and takeaways of this guide for architects and customers alike to consider when planning their Citrix recovery strategy:

Understanding the business requirements and technological capabilities of a customer environment influences the disaster recovery strategy required for Citrix. The recovery strategy chosen must align to business objectives.
High availability is not synonymous with disaster recovery. However, disaster recovery can include high availability.
Sharing administrative boundaries between physical locations does not make up a disaster recovery solution.
Developing a disaster recovery tiering classification for an organization’s technology portfolio provides visibility, flexibility, and potential cost savings when developing a recovery strategy.
RTO requirements for the Citrix infrastructure and the applications that are hosted on Citrix, is the most significant influencing factor in defining the disaster recovery design. A shorter RTO often requires a higher solution cost.
Disaster recovery architectures for Citrix which do not employ a “hot” disaster recovery platform present limits on the technologies a customer can use, such as Citrix MCS, NetScaler, and Provisioning.
A Disaster recovery strategy for Citrix must consider the recovery and recovery time of user data and application back-ends. Citrix can be engineered to recover rapidly, however, users can nonetheless be unable to work if application and data dependencies are not available in a similar timeframe.
Disaster recovery in cloud environments has a different set of considerations not present in on-premises environments which organizations must review to avoid unforeseen bottlenecks or operational impacts when invoking disaster recovery in cloud environments.
It is imperative that disaster recovery components are kept up-to-date to match production updates and configurations to maintain the integrity of the platform.
Platforms that use an active-active configuration for Citrix between locations must be mindful of capacity monitoring to ensure in the event of a disaster, sufficient resources are available to support the projected load at one or more surviving locations.
Organizations must test their disaster recovery plan for Citrix periodically to validate its operation and ability to serve its core purpose.
Citrix DaaS significantly simplifies many aspects of the DR architecture and allows Resource Location redundancy via Zone Preference configurations.
Plan and choose your replication options wisely for user data, application packages, layers, vDisks, and profile containers to ensure that their limitations are known, and their capabilities align with business requirements.
Do not confuse replication with backup. They go hand-in-hand and give deeper consideration to back up retention because of the prevalence of ransomware “time bombing” backups.

Sources

The goal of this article is to assist you with planning your own implementation. To make this task easier, we would like to provide you with source diagrams that you can adapt for your own needs: source diagrams.