Product Documentation

Local Host Cache sizing and scaling

Jan 24, 2017

Introduction

Local Host Cache was introduced in XenApp and XenDesktop 7.12 to replace the connection leasing feature delivered in XenDesktop 7.6.  Local Host Cache covers more scenarios than connection leasing, but requires different design considerations.

Details on the connection leasing feature design considerations can be found at https://www.citrix.com/blogs/2014/11/11/xendesktop-7-6-connection-leasing-design-considerations/.

Highlights

  • Local Host Cache supports more use cases than connection leasing.
  • When operational, Local Host Cache requires more resources (CPU and memory) than connection leasing.
  • During outage mode, only a single broker per zone will handle VDA registrations and broker sessions.
  • An election process decides which broker will be active during outage, but does not take into account broker resources.
  • If any single broker in a zone would not be capable of handling all logons during normal operation, it won't work well in outage mode.
  • No site management is available during outage mode.
  • A highly available SQL Server is still the recommended design.
  • For intermittent database connectivity scenarios, it is still better to isolate the SQL Server and leave the site in outage mode until all underlying issues are fixed.
  • There is a limit of 5,000 VDAs per zone (not enforced).
  • There is no 14-day limit.
  • Pooled desktops are not supported in outage mode, in the default configuration.

Architecture

Connection leasing was introduced in XenDesktop 7.6 to allow continued access to resources during periods of site database outages.  It did not support VDI pooled desktops and by default, users must have connected to a resource within the previous 14 days.  A further restriction was that it would attempt to connect users to the last desktop or application host that they had connected to during normal operation.  If this was not available, then the connection would not be brokered.

The architecture of the two technologies (connection leasing and Local Host Cache) is very different, and they require different resources to operate.  Connection leasing creates individual XML lease files, which can require several GB of disk space, depending on the number of resources in a site.  Local Host Cache uses a local SQL Server database and is more efficient in disk space usage, but does require considerably more memory and CPU than connection leasing.  Both have phases of synchronization where details from the main site database are synced to the brokers (Controllers).   The connection leasing initial sync can cause considerable IOPS due to the sheer number of individual files being created on the file system.  Although Local Host Cache uses a SQL database that still requires IOPS, it has the advantage of SQL optimizing these writes.

In a connection leasing setup with multiple brokers, each broker has a copy of the XML leases and is able to broker connections during an outage, helping balance the load.  However, with Local Host Cache a single broker is elected to broker all connections and handle VDA registrations.  All VDAs in the site will re-register with this single broker, and as such, that broker will experience a higher demand for resources compared to a multi-broker site in normal operation, especially in sites with large numbers of VDAs.

Local Host Cache uses Microsoft LocalDB, which appears in Task Manager as the sqlserver.exe process.  It has been configured to use up to 1 GB of memory for database buffer pool caching.  However, the process will grow beyond this as the SQL engine needs memory for itself and other smaller caches.   In general, the longer the outage and the more resources are accessed during outage mode, the more LocalDB memory use will increase.  However, when the site database connectivity is restored, sqlserver.exe will hold on to this memory and not immediately return it to the main pool.

Effect of CPU sockets and cores during outage mode

In previous releases of XenApp and XenDesktop, administrators did not necessarily concern themselves with the CPU configuration of the broker (Controller) machine: the layout of the number of sockets and cores, either in a physical or virtual machine.

Local Host Cache uses a runtime version of SQL Server called LocalDB which has specific licensing that limits it to the lesser of four cores or a single socket.  This can have a significant performance effect when the physical or virtual machine has been configured with multiple sockets with only a single or dual core.  A broker machine with 4 sockets and one core per socket will limit LocalDB to using a single core, whereas the same VM configured as a 1 socket 4 core machine means that LocalDB can access all 4 cores (albeit sharing them with other processes).   During outage mode, LocalDB will be running the same broker and SQL code as during normal operation.  Many of the SQL queries can be CPU intensive and have a direct impact on the performance of brokering during outage mode.

Other factors include the site configuration itself:

  • The number of published applications
  • The number of users being brokered
  • The rate at which users are attempting to launch sessions
  • Active Directory performance

As the total broker CPU utilization approaches 100%, brokering response time will increase, logons will take longer to process and some logon attempts may fail.

Sites with multiple brokers

During site outage mode, only a single broker processes registration and logon requests.   In a multi-broker site an election process takes place to nominate the broker that will be active during the outage.  However, this election process does not consider the physical resources available to the brokers.  This means that in a site where brokers have different amounts of resources, the elected broker will not necessarily be the most powerful in terms of CPU or RAM, which could potentially lead to poor performance during outage mode.   It is important that each broker meets the additional requirements of Local Host Cache, in case it is elected.

Synchronization with the site database

The CitrixConfigSync Service handles the importing of data from the site database into a local copy on the brokers.  It monitors the site database for changes to the site configuration, and triggers a new import when changes occur.  A copy of the current local database is made before the import starts. The larger the number of resources (such as VDAs) in a site, the longer the import will take, but it should be less than ten minutes for a site with 5000 VDAs.

Database location

The local database is stored in:

C:\Windows\ServiceProfiles\NetworkService\HaDatabaseName.mdf

To ensure reliability, the CitrixConfigSync Service makes a backup of the previously successful synchronized database import, before starting a new site database sync. If for any reason the sync does not succeed, the backup is used until a successful sync completes. You should not copy the database manually.

Comparison of Local Host Cache with connection leasing

 

Connection leasing

Local Host Cache

Disk Space

2 GB recommended

Depends on site configuration, see PowerShell script (in Appendix) to monitor size.

For 50 RDS hosts with 125 K users, 300 MB is used.

RAM

100 MB

3 GB, ~1GB for SQL Server, 2 GB for High Availability Service and CitrixConfigSync Service.

Time to sync configuration

IOPS dependent; 40,000 VDAs: ~26minutes

5,000 VDAs: ~ 7 minutes

Time to activate during outage

150 seconds, 30 second default SQL timeout + 120 second wait period (configurable)

Depends on number of VDAs and last registration sync with the broker.  Only a single broker will be available for VDA registration during outage mode, so for large numbers of VDAs it can be many minutes before all VDAs are registered.

Time to restore normal operations

120 seconds for leasing to deactivate, then VDAs must re-register with brokers

As above, VDAs will need to deregister from secondary broker and re-register with the primary broker.

Number of VDAs supported

50,000

5,000. A site can have more than this, but the time needed to sync the site database will grow with the number of VDAs. Performance of a single broker with a large number of VDAs might result in some connections not being brokered during the outage.

Site management during outage

No

No

 

Configuring LocalDB memory limits

The LocalDB process has been limited to 1 GB of RAM for caching, which should not normally need to be changed.   This value can be changed if required; in order for the changes to take effect, the site must be in normal operation (not outage mode) and a site database sync must be forced.

To reduce the memory to 768 MB:

Step 1. Edit the file C:\Program Files\Citrix\Broker\Service\HighAvailabilityService.exe.config. 

Look for the <appSettings> section and add an entry:

<add key="MaxServerMemoryInMB" value="768" />

Step 2. To force database sync, stop the High Availability Service. If sqlserver.exe is running, stop it by using Task Manager or PowerShell.

Step 3. Now make a trivial change to the site, for example change the description of a Delivery Group by adding a “.”, then remove it again.  Then, start the High Availability Service; that should start sqlserver.exe and a sync should occur.

Reducing the memory too much will affect performance and is not recommended.  However, increasing it to 2 GB does not improve performance significantly, and the CPU resource is more of a bottleneck than RAM.

Enabling or disabling Local Host Cache

Both connection leasing and Local Host Cache can be disabled, but only one of them can be active at a time.

Set-BrokerSite –ConnectionLeasingEnabled $False

Set-BrokerSite –LocalHostCacheEnabled $True

Limitations

Desktops must have been assigned before they can be used in outage mode.  Unassigned desktops will not be available for brokering. This can lead to desktops being unavailable and reporting “in maintenance mode” if an outage occurs before all assignments have been synchronized, despite a user having actually being assigned a desktop.

Pooled desktops are not supported in outage mode, in the default configuration. There is a workaround, but it has potential security and performance implications. If you configure a Delivery Group containing pooled desktops to not restart on logoff, any powered-on pooled desktops in that group will be available in outage mode. However, after a user logs off, the desktop will not be in a clean state because the desktop is not restarted. This could be a security concern in any scenario. If the next user of that desktop is a local administrator of that desktop, data from a previous user might be accessible.  And although that risk is less of a concern for standard (non-admin) users, keep in mind that applications could behave improperly and cause performance issues over time. Important: Administrators should carefully consider the potential implications of using this workaround for using non-restarted pooled desktops in outage mode.

As with connection leasing, no site changes can be made during outage; the database is effectively a snapshot of the main site database and is discarded every time a new synchronization occurs.

Performance comparison of 6 and 8 vCPU broker under stress conditions

A site was configured with 50 RDS workers and 5075 VDI VDAs. Each RDS worker is capable of supporting 2500 simulated users.  A launch rate of 20 users per second was set.  Differing numbers of published applications were added to the site.

For RDS workers, 100,000 users were launched, for VDI 5075.

Tests were conducted in normal and Local Host Cache operational (outage) mode.

In the 6 vCPU system, there is very little CPU headroom, and a small number of exceptions (<10) occurred when the published app count was >0.

Windows Updates were disabled by group policy as during a number of runs, the tiworker.exe process (Windows Installer Module) was found to use nearly a whole core for extended periods of time.  This resulted in large numbers of launch failures, so the tests were rerun.  The broker processor is quite old, but the tiworker process will consume a single core of a newer one, affecting the test.

Hypervisor configuration

  • XenServer 7.0
  • AMD Opteron 8431 2.4 GHz – 4x6 cores
  • 128 GB RAM
  • Read caching enabled
  • Windows Storage Server 2012R2 with SSD based storage

Broker configuration

  • 2 socket with 3 cores and 2 sockets with 4 cores
  • 10 GB RAM
  • Windows Server 2016
  • Single Delivery Group for each VDA type
  • All applications visible to the user
  • XenDesktop 7.12 

StoreFront configuration

  • 6 StoreFront servers
  • Windows Server 2016
  • 4 vCPU
  • 10 GB RAM

NetScaler configuration

  • VPX 8000 version 11.063.16

SQL Server configuration

  • Intel E5-2630 2.3GHz 2x12 cores 
  • SQL Server 2012
  • 64 GB RAM

Note:  Only the site database was taken offline; the monitoring and configuration databases were still accessible during the test.  This means that the Monitor Service was consuming some CPU during the tests.

localized image

Generally, as the number of applications increases, the logon response time increases and outage response times are worse than normal operation.  Increasing the number of vCPUs improves response times.  Bear in mind the CPU used in the tests is fairly old and more modern CPUs will generally give better performance.

localized image

The 'theoretical min' row is the absolute minimum time 100,000 users would take to log on if the environment was able to process 20 launches per second, giving 1 hour 23 minutes 20 seconds.  In these tests, the 0 applications row managed 1:30:57 in the 6 vCPU case, and 1:30:48 in the 8 vCPU case.  The performance of the Active Directory domain will have some impact on how quickly users are authenticated.

localized image

Under normal conditions, LocalDB memory use will increase during operation and hang on to that memory unless there is pressure on the system; the more users that are processed, the larger the memory used, up to the limit of 1 GB.  In the first VDI test, LocalDB had not released memory from the previous RDS run. For subsequent VDI tests, the broker was rebooted before the test and less memory was used.

localized image
localized image

Note: Values have been normalized to total system CPU, so 33.33% is two out of six cores, 30% is 2-1/2 cores on the 8 vCPU setup. 

The term “off” refers to normal site operation; “on” means Local Host Cache is active.

The request launch rate of 20 users per second is quite demanding on the broker, even under normal operation.   Under normal operation, the 6 vCPU system has no headroom; the broker CPU increases as the number of published applications increases, resulting in slower response times.  When Local Host Cache is active, the secondary broker (HA) service has to compete with LocalDB (SQL) for CPU resource, giving worse response times than normal operation.

In the 8 vCPU system, there is some headroom and the LocalDB will scale up, but in this environment, it peaked at 2-1/2 cores, despite there being a small amount of CPU still available.

During normal operation, LocalDB will not be consuming CPU unless a synchronization is occurring.

During outage, the primary broker will use virtually no CPU. 

Database size

For the 5075 VDI configuration, the LocalDB was around 40 MB, for the 100,000 RDS this varied between 100-300 MB, depending on the number of applications and logons. As a copy of the database is taken before a new import starts, allow 1 GB of space for LocalDB.

Summary

During site database outage, Local Host Cache supports a wider range of resources and conditions than connection leasing, but requires more CPU and memory when operating. 

In multiple broker sites, any broker might be elected to be the outage broker, and therefore all must have enough resources to cope in outage mode.  No evaluation of broker resources is done, so in a site with lesser and more powerful brokers, it is possible for the least powerful broker to be elected during outage.

The layout of cores and sockets must be considered as part of the design of the brokers.

The number of published applications will have an effect on logon response times and the maximum logon throughput.

Brokers with insufficient CPU resource may result in failed launches.

An additional two cores and 2 GB of RAM is a good starting point for testing performance in Local Host Cache outage mode compared to connection leasing.

1 GB of disk space will be sufficient for the LocalDB database.

An overloaded broker will result in failed connections.

Appendix

The following PowerShell scripts can provide more information about the operation of Local Host Cache and the status of the site.

Force a database sync Copy

Stop-Service -Name CitrixHighAvailabilityService

Stop-Process -Name sqlservr -Force

$dg1 = Get-BrokerAssignmentPolicyRule -MaxRecordCount 1

$desc = $dg1.Description

Set-BrokerAssignmentPolicyRule  -Description ($desc + ".")  -Name $dg1.Name

sleep 2

Set-BrokerAssignmentPolicyRule  -Description $desc  -Name $dg1.Name

Start-Service -Name CitrixHighAvailabilityService

Monitor database size Copy

# Assumes that c:\logs exists and is writeable

while ($true) {

       try {

       $size = (get-item -path C:\Windows\ServiceProfiles\NetworkService\HaDatabaseName.mdf).Length /1KB

       Write-Output ("$(Get-Date -DisplayHint Time)  Size of LocalDB: $size KB") |Tee-Object -append -file 'c:\logs\dbsize.txt'

       Start-Sleep 10

       }

       catch {

       $e = $_.Exception.message

             Write-Output ("$(Get-Date -DisplayHint Time)  Got exception $e ") |Tee-Object -append -file 'c:\logs\dbsize.txt'

       }

}

Monitor sync time Copy

# Measure how long it took to run a LHC secondary broker import

# The default is scan for any imports over the last 24hours

# Add a command line parameter to change this

 

param([UInt32]$timeWindowInHours=24

)

 

 

 

function Get-CssHistory ($timeWindowInHours) {

    $startTime =  (Get-Date).AddHours(-$timeWindowInHours)

    Write-Output ("Filtering by last $timeWindowInHours hours.  Going back to $startTime ")

   

    $update = $null

    Get-EventLog -LogName Application -Source 'Citrix ConfigSync Service' -After  (Get-Date).AddHours(-$timeWindowInHours) | sort Index | %{

        if ($_.InstanceId -eq 503) {

            # Update start

            if ($update) {

                # Previous update must have failed :(

                '{0} {1} {2} *** No completion recorded ***' -f $update.TimeGenerated,$update.EntryType, $update.InstanceId

                $update = $null

            }

            $update = $_

        } elseif ($_.InstanceId -eq 504) {

            # Update complete

            if (-not $update) {

                '{0} {1} {2} *** Update completion recorded when no update in progress ***' -f $_.TimeGenerated,$_.EntryType, $_.InstanceId

            } else {

                $elapsed = $_.TimeGenerated - $update.TimeGenerated

                '{0} {1} {2} Update took {3}' -f $update.TimeGenerated, $update.EntryType, $update.InstanceId, $elapsed

            }

            $update = $null

        } else {

            '{0} {1} {2} "{3}"' -f $_.TimeGenerated,$_.EntryType, $_.InstanceId, $_.Message

        }

    }

}

 

Get-CssHistory $timeWindowInHours

Monitor registered VDAs on primary and secondary broker Copy

# Monitor the number of VDAs register against the

# Primary or 2ndry broker

# To allow access to run commands against the 2ndry broker

 

#HKLM\SOFTWARE\Citrix\DesktopServer\LHC

#EnableCssTestMode DWORD = 1

 

 

 

function dbstate()

{   # Set the SQL server where the site database is stored

    $sqlServer="Cam-sql-1.chsys2.citrix.com"

    $dbname = 'JDElek81RTMSite'

    # Note you will need to change the Version= to match your SQL CLR types, for 2016 the Version is 13.0.0.0

    Add-Type -AssemblyName "Microsoft.SqlServer.Smo, Version=13.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91"

    $server = New-Object Microsoft.SqlServer.Management.Smo.Server ($sqlServer)

   

    if($server.Databases[$dbname]) {

        if ($server.Databases[$dbname].IsAccessible)

        {

            $dbState = ("$dbname is ONLINE")

        }

        else

        {

            $dbState = ("$dbname is OFFLINE")

        }

    }

    else {

           $dbState = ("$dbname not found on server $sqlserver")   

        }

        return $dbState

}

 

asnp Citrix*

 

 

Write-Output ("Setting Registry key to enable commands against 2ndry Broker")

# Set the registry to allow us to run commands against 2ndry broker   

$regkey = "HKLM:\SOFTWARE\Citrix\DesktopServer\LHC"      

$name = "EnableCssTestMode"

if ( -not (Test-Path $regkey ) ){

        New-Item $regkey -Force

 }

 

$exists = Get-ItemProperty -Path "$regkey" -Name "$name" -ErrorAction SilentlyContinue

If (($exists -ne $null) -and ($exists.Length -ne 0)) {

   Remove-ItemProperty $regkey -Name $name   

}

New-ItemProperty $regkey -Name $name -Value 1 -PropertyType "DWord"

 

$lastPrimaryCount =-1

$lastSecondaryCount =-1

 

# Port addresses of the primary and 2ndry brokers

$s = @{'AdminAddress'='localhost:89'}

$p = @{'AdminAddress'='localhost:80'}

$currentBroker =$p

$lastDBState = dbstate

 

Write-Output("Ready to start measuring registered VDA count.") |Tee-Object -file 'c:\logs\vdareg.txt'

Write-Output("Logging to c:\logs\vdareg.txt")

Write-Output ("$(Get-Date -DisplayHint Time) At start of test $lastDBState" ) |Tee-Object -file 'c:\logs\vdareg.txt' -Append

 

while ($true)

{

    $dbstatus = dbstate

    if ($lastDBState -ne $dbstatus)

    {

       Write-Output ("$(Get-Date -DisplayHint Time) $dbstatus" ) |Tee-Object -file 'c:\logs\vdareg.txt' -Append

       $lastDBState = $dbstatus

    }

   

    try {

    # don't forget -AdminAddress localhost:80 for primary or locahost:89 for 2ndry   

    $lastPrimaryCount= $primarycount   

    $primarycount = ((Group-BrokerMachine  -erroraction SilentlyContinue -property 'RegistrationState' -DesktopGroupName 'simulated desktops' @p) |where {$_.Name -eq 'Registered'}).Count   

   

  }

      catch {

      $e = $_.Exception.message

      $lastprimaryCount =-1

      $primaryCount = -1

    }

 

    try {

        $lastSecondaryCount= $secondarycount        

        $secondarycount = ((Group-BrokerMachine -erroraction SilentlyContinue -property 'RegistrationState' -DesktopGroupName 'simulated desktops' @s) |where {$_.Name -eq 'Registered'}).Count   

   

        }

 

    catch {

      $e = $_.Exception.message

        # If there has been no sync yet, "Data store has not been configured " error will occur

        #Write-Output @($_.Exception.Message)

        Write-Output ("$(Get-Date -DisplayHint Time) Exception triggered getting 2ndry Broker data: $e") |Tee-Object -file 'c:\logs\vdareg.txt' -Append

        #Write-Output ( Unable to get 2ndry VDA count.  Has there been a sync yet? ")

        $lastsecondaryCount =-1

        $primarycount = -1

    }

    $primarydiff = [math]::abs($primarycount - $lastprimaryCount)

    $secondarydiff = [math]::abs($secondarycount - $lastsecondaryCount)

  Write-Output ("$(Get-Date -DisplayHint Time)  Current Primary registered VDAs: $primarycount, last  $lastprimaryCount diff :$primarydiff  2ndry :$secondarycount, last $lastsecondaryCount, diff: $secondarydiff  " ) |Tee-Object -file 'c:\logs\vdareg.txt' -Append

 

  Start-Sleep 10

 

}

 

Read-Host 

This article was written by Joe Deller.