Design Decision: The Economics of Delivering Citrix DaaS in Azure on AMD Compute
This paper is a joint effort between Microsoft, Citrix, and AMD to help consumers make better decisions on selecting instance types within Azure for hosting their Citrix Workloads. The goals of this document include determining the most efficient instance types to host Citrix DaaS workloads and providing guidance for customers when selecting AMD compute instances, with or without GPUs, within Azure.
Azure is Microsoft’s cloud environment where physical assets, such as computers, disk drives, and networking, are virtualized and made available online or via direct connections to Microsoft data centers worldwide. Each data center location is in a region. Regions are available across the continents with dedicated regions for government access. Each region is a collection of zones isolated from each other within the region. This distribution of resources provides several benefits, including redundancy if there was failure and reduced latency by locating resources closer to clients.
By provisioning Citrix desktop and application workloads on Microsoft Azure, businesses can avoid expenses for internal infrastructure and rely instead on Azure to supply the necessary compute, networking, and storage resources for user workloads.
Citrix Desktop as a Service (DaaS) secures the delivery of Windows, Linux, Web, and SaaS applications and desktops to any device, empowering today’s modern digital workspace. Citrix DaaS provides advanced management and scalability, a rich multimedia experience over any network, and self-service applications with universal device support across a full range of endpoints—including desktops, laptops, thin clients, tablets, and smartphones.
With application and desktop virtualization technologies, it is easy for customers to manage resources centrally and apply the optimal combination of local and hosted delivery models to match user requirements. Hosted resources can be provisioned on Microsoft Azure for single-session and multi-session scenarios.
As a hybrid cloud solution, Citrix DaaS allows organizations to choose the workload deployment option that best aligns with their enterprise cloud strategy. When deployed on the Microsoft Azure Platform, Citrix DaaS gives IT departments the flexibility of delivering infrastructure services for Windows and Linux applications and desktops with the elastic scale of the public cloud. At the same time, organizations can also integrate one or more on-premises environments for optimal flexibility.
Citrix DaaS is hosted in Citrix Cloud, a control plane of services running within Microsoft Azure. It is used in this series of tests to control and manage workloads. The numbers herein focus on the scalability and performance of a single virtual machine instance running Citrix’s Multi-session OS Virtual Delivery Agent (VDA) on Windows Server 2019 and Windows 10 Multi-session. To learn more about Citrix DaaS, please visit here.
AMD is a leader in high-performance computing, delivering technologies to accelerate a full range of data center workloads from general-purpose computing to cloud-native computing. All the instances tested in this document ran AMD processors and graphics cards within Microsoft Azure.
AMD EPYC™ processors power the most energy-efficient x86 servers, delivering exceptional performance and density to reduce server energy costs. AMD EPYC™ CPUs help minimize environmental impacts from data center operations while advancing your company’s sustainability objectives. AMD EPYC™ also comes with an all-in-one feature set across each processor series, so you have the I/O, memory, and memory bandwidth to accomplish your goals, regardless of the number of cores you choose.
The Das_v5 instances in this study use the third Gen AMD EPYC™ 7763v processor in a multi-threaded configuration with up to 256 MB L3 cache. These virtual machines offer a combination of vCPUs and memory to meet the requirements associated with Citrix DaaS workloads.
The NVv4-series instances in this study use AMD’s Radeon™ Instinct™ MI25 GPU and are optimized for VDI and remote visualization. The MI25 GPU supports Single Root I/O virtualization (SR-IOV) passthrough which offers the ability to securely share a GPU with up to 8 virtual machine guests through time-sliced multiplexing. When sharing the GPU with multiple virtual machines, each guest gains full use of the GPU for a few milliseconds, numerous times per second. NVv4 offers the right size for workloads requiring large and small GPU allocations.
The NVv4-series instances come in 4 different sizes:
|Size||vCPU||GPU||GPU memory (GB)||Memory (GB)|
These instances allow the operating system to access GPU acceleration for rendering and hardware encoding (the latter helps accelerate remote protocols such as Microsoft RDP and Citrix HDX).
The load was simulated during the test run using Login Enterprise to generate an artificial workload in the user session on a single host. The test run data was then used to analyze the scalability of different AMD Microsoft Azure instances.
Login VSI helps organizations proactively manage their virtual desktops and applications’ performance, cost, and capacity. The Login Enterprise platform is 100% agentless and can be used in all major VDI and DaaS environments, including Citrix and Microsoft. With Login VSI, IT teams can plan and maintain successful digital workplaces with less cost, fewer disruptions, and lower risk.
The Login Enterprise appliance provides two scores that help determine the recommended number of users for the instance:
End-user experience score (EUX): The EUX Score quantifies the user experience within their virtual session using metrics and performance counters gathered during the session. EUX scores range from 1.0 (Poorest experience) to 10.0 (Best experience). Generally speaking, scores below 5.5 indicate that the user experience is unacceptable.
Maximum Recommended Users (VSImax): The VSImax scoring process involves using individual session metrics to determine the number of simultaneous users that can run a given workload on the host. The VSImax value used in Login Enterprise is not comparable with the VSImax scores of previous versions of Login VSI Classic.
The algorithms that generate the EUX and VSImax values can change between versions as Login VSI strives to provide more accurate results. To maintain consistent results for comparison within this study, all the tests were completed using the same version. A test configuration was deemed successful when the test run was completed without failures, and the same VSImax value was received consistently at least three times.
Two different workloads were used to assess the scalability of the instance types. The first workload was run at a 1080p resolution (1920x1080) using Login Enterprise’s Knowledge Worker workload, simulating a robust Microsoft Office user. Knowledge Worker is the most commonly used workload for evaluating scalability. This workload includes the following applications run in a loop:
Microsoft Word Microsoft PowerPoint Microsoft Outlook Microsoft Excel Microsoft Edge, viewing a 1080p video
The Office applications had hardware acceleration enabled for the tests to take advantage of the GPU.
The second workload was a custom-created GPU-specific workload run at a 4K resolution (3840x2160) designed to generate GPU cycles within the session. This custom workload aimed to determine how effective GPU sharing was within a session. The workload consisted of only the following two applications:
The Citrix default is Microsoft Edge, viewing a 4K video at 10 frames per second (fps). Microsoft 3D Viewer, rendering four different 3D images sequentially, changing speed, and lighting effects.
With the NV_v4 instances supporting GPU hardware encoding but not decoding, we can expect the 4K video portion of the test to consume more CPU cycles and take longer than if the hardware GPU decoding were supported on the host. Citrix policies were used to disable media redirection, so the GPU was not used for decoding. Performance can be increased by enabling the redirection feature to avoid CPU decoding on the host side.
Multiple times during a user session, a short scoring app runs a set of instructions and records the time it takes to run each step. Those metrics can generate the EUX score and the VSImax value for the test run.
These scores are then used as input along with other performance indicators to generate an average EUX score for the test. The following graph provides a sample EUX scoring for a 22-user run on the D8as_V5 instance type. With this run, the average EUX score was 7.4, and the VSImax score was greater than 22.
The theoretical maximum EUX score is 10. However, the score is most useful when compared to itself by using the same workload across different virtual machine configurations or user loads. For instance, the following chart shows the same number of test users, 22, but with a lower EUX score of 7.3 and a VSImax value of 15. One interesting observation is that the EUX score dropped during the login storm, but returned quickly after the logins were completed.
For the scalability testing, the infrastructure VMs were configured as follows:
- One Login Enterprise virtual appliance running version 4.11.2.
- Four Login Enterprise launchers
- One Citrix Cloud Connector
- One Active Directory domain controller that was acting both as a profile and DNS Server
- Citrix virtual application workloads running on a single Windows Server 2019 Datacenter instance or a single Windows 10 Multi-Session instance with the following:
- Citrix Server Multi-session OS VDA 2203.0.2000.2076 (CU2) running at functional level 2106 (or later).
- Microsoft Office M365 Business
- Microsoft Defender with default settings
- Latest Windows updates available at the time of testing
- Out-of-the-box settings were used unless otherwise specified
- The Citrix Cloud DaaS service provided and managed the Delivery Controller, SQL Server, Workspace service (StoreFront equivalent), license server, and Studio management console. An Active Directory domain controller and Cloud Connector were installed separately within the Azure tenant.
The following figure depicts the test architecture.
This architectural design is for testing purposes only and does not reflect what a production environment with redundant components will look like. Administrators can refer to the Best Practices and Architecture documents on Citrix Tech Zone.
Scalability Testing Results
For this study, we focused on the AMD instance types between 4 and 32 vCPUs. The main driver for this decision was cost and efficiency. In the Session host virtual machine sizing guidelines, Microsoft recommends limiting the VM size to between 4 vCPUs and 24 vCPUs for the following reasons:
“For multi-session, having multiple users on a two-core VM leads to the UI and apps becoming unstable, which lowers the quality of user experience. Four cores are the lowest recommended number of cores that a stable multi-session VM should have.”
“VMs shouldn’t have more than 32 cores: as the number of cores increases, the system’s synchronization overhead also increases. For most workloads, at around 16 cores, the return on investment gets lower, with most of the extra capacity offset by synchronization overhead. You’re likely to have more users from two 16-core VMs compared to one 32-core one”
With DaaS workloads, scaling up is less efficient than scaling out, so these guidelines made perfect sense and allowed us to focus the testing on areas with the best configurations for customers.
Six different AMD instance types were tested with two different operating systems. Since the workload is identical across the instance types, scalability can be derived from the results. The following table shows the workloads run on each instance type.
|Windows Server 2019||Windows 10 Multi-Session|
Before determining the cost efficiency of the AMD instance types, we needed to determine how many users run successfully on an instance type with an ideal user experience. Fortunately, the VSImax score greatly approximates how many users we can expect to get comfortably in a particular configuration.
We start with the Knowledge Worker workload, which is the one that provides the broadest use case for Citrix DaaS. The following graph shows the final scores received for Windows 10 Multi-session and Server 2019 workloads across the tested instance types.
From this data, these conclusions can be drawn:
Scaling up the processors does not provide linear gains in the number of users. Based on earlier information about the synchronization overhead becoming less efficient, this finding is expected.
Windows 10 Multi-session is less efficient with resources than Server 2019. This finding is expected since Server 2019 is better optimized for hosting multiple users. Using GPU-enabled NV-series instances does not increase the number of users. This finding is also expected because of the overhead GPUs added to the CPU.
Continuing with that line of thought, the chart below clearly shows that as the number of vCPUs increases in the instance type, the number of users per vCPU decrease. The conclusion then with AMD processors are that scaling out is better than scaling up since you get more users on smaller machines than on the equivalent number of vCPUs on a larger machine.
Switching gears to the GPU Intensive workloads, where we see similar trends around the VSImax users per instance type. The GPU-intensive workloads were only run on the NV-series instances. The results are shown in the following graph.
We can draw two of the same conclusions from this data that we did for the Knowledge Worker workload.
Scaling up the processors does not provide linear gains in the number of users. This finding is expected since synchronization overhead is less efficient, and the NV_v4 series relies on CPU cycles to use the GPU. Part of this impact was due to the 4K video, which consumed CPU cycles that might not be offloaded to the GPU since redirection was not used. Windows 10 Multi-session is less efficient with resources than Server 2019. This finding is again expected since Server 2019 is better optimized for hosting multiple users.
However, diving further into the users per vCPU, the following chart shows that we do not have the same linear trend with Windows 10 as we do with Server 2019. In the case of Windows 10 multi-session, the 16-vCPU instance is the most efficient.
We can now use the expected user count divided into the instance cost per hour to derive the cost per user hour metric. Not all regions have these instance types available, and pricing does vary by region. The prices used reflect the costs in the Azure West US 2 region during August 2023. For simplicity, only the full licensed Pay-as-you-go rate and the Hybrid benefit rates were used for this comparison. If you use Microsoft savings plans or reserved instance pricing, your actual costs are lower.
The following chart shows the cost per user hour by the operating system and licensing model for each AMD instance type that we tested under the Knowledge Worker workload.
Since the Azure pricing is consistent for compute, the costs follow our performance graphs as expected. Our most efficient instance type is the one with the best scalability, the D4as_v5, coming in as low as 1.4 cents per hour under the hybrid licensing model and 3 cents per hour under the full license model on Windows Server 2019. A similar cost occurs with the GPU-intensive workload where the NV8as_v4 comes in with the lowest cost per user hour of 7.8 cents per hour with Hybrid and 13.9 cents per hour with Full licensing on Server 2019, as shown in the following chart.
Our study does not end there. We still need to cover the aspect of user experience. The final step is to review the End User Experience scores to determine the most efficient instances in terms of cost and user experience. The following graph shows the average EUX score associated with the runs which received the final VSImax score we published for the Knowledge Worker workload.
One clear finding from the data is that the most efficient instance type provides the lowest scores we tracked. Remember earlier, we stated scores below 5.5 would provide negative user experiences, so a score of 6.8x on the D4as_v5 is still respectable. However, looking at the next column, we see that the average EUX score jumps significantly. For an additional 0.2 cents per hour, we can provide our end users with a better experience. Looking at the GPU Intensive workload graph below, we have a similar situation.
In this case, the NV32as_v4 instance type provides a slightly better score overall than the NV16as_v4 instance, so even though the NV16as_v4 is the most efficient performance, it may not provide the best end-user experience. The lower EUX score on the NV16as_v4 results from higher user density per core than with the NV8as_v4 or NV32as_v4 instance types. Surprisingly, Windows 10 Multi-session has the best score for GPU-intensive workloads with the NV32as_V4. This is probably the result of the NV32as_v4 having more available CPU cycles during the 4K video playback part of the test since redirection was not used.
Key Findings and Recommendations
After reviewing the test run data and analyzing the results, here are our key findings and recommendations for selecting Azure instances powered by AMD:
Our testing results showed that Windows 10 multi-session instances accommodated 30% fewer users than Windows Server operating systems (Server 2019 or 2022). End User Experience (EUX) scores remained relatively consistent across the operating system types. While the most cost-efficient instance type for the Knowledge worker is the D4as_v5, the recommendation is to use the D8as_v5 instance type and take advantage of the better user experience for a fraction of a cent more per hour. If the workload has graphics requiring hardware decoding or video playback, then using the NV_v4 instance type provides a better user experience. A single GPU can be effectively shared across multiple sessions. However, determining which instance type to use depends mainly on the GPU type of workload.
As always, we recommend you do your own performance testing with workloads specific to your business. Suppose you have limited insight into your workload or what instance types you will ultimately end up with. In that case, you can use our provided information to make a suitable estimate. Usually, the lower vCPU instance types provide a better balance of cost and performance.