The policy infrastructure on the Citrix®
NetScaler® appliance supports the ASCII and UTF-8
character sets. The default character set is ASCII. If the traffic
for which you are configuring an expression consists of only ASCII
characters, you need not specify the character set in the
expression. However, you must specify the character set in every
simple expression that is meant for UTF-8 traffic. To specify the
UTF-8 character set in a simple expression, you must include the
with <charset> specified as UTF_8, as shown in the following examples:
In an expression, the SET_CHAR_SET()
function must be introduced at the point in the expression after
which data processing must be carried out in the specified
character set. For example, in the expression HTTP.REQ.BODY(1000).AFTER_REGEX(re/following
example/).BEFORE_REGEX(re/In the preceding
example/).CONTAINS_ANY("Greek_ alphabet"), if the strings
stored in the pattern set "Greek_alphabet" are in UTF-8, you must
include the SET_CHAR_SET(UTF_8) function
immediately before the CONTAINS_ANY("<string>") function, as
example/).BEFORE_REGEX(re/In the preceding
The SET_CHAR_SET() function sets the
character set for all further processing (that is, for all
subsequent functions) in the expression unless it is overridden
later in the expression by another SET_CHAR_SET() function that changes the
character set. Therefore, if all the functions in a given simple
expression are intended for UTF-8, you can include the SET_CHAR_SET(UTF_8) function immediately after
functions that identify text (for example, the HEADER("<name>") or BODY(<int>) functions). In the second example that follows the first paragraph
above, if the ASCII
arguments passed to the AFTER_REGEX() and
BEFORE_REGEX() functions are changed to
UTF-8 strings, you can include the SET_CHAR_SET(UTF_8) function immediately after
the BODY(1000) function, as follows:
The UTF-8 character set is a superset of the
ASCII character set, so expressions configured for the ASCII
character set continue to work as expected if you change the
character set to UTF-8.
Compound Expressions with Different Character Sets
In a compound expression, if one subset of expressions is
configured to work with data in the ASCII character set and the
rest of the expressions are configured to work with data in the
UTF-8 character set, the character set specified for each
individual expression is considered when the expressions are
evaluated individually. However, when processing the compound
expression, just before processing the operators, the appliance
promotes the character set of the returned ASCII values to UTF-8.
For example, in the following compound expression, the first simple
expression evaluates data in the ASCII character set while the
second simple expression evaluates data in the UTF-8 character
However, when processing the compound expression, just before
evaluating the "is equal to" Boolean operator, the NetScaler
appliance promotes the character set of the value returned by HTTP.REQ.HEADER("MyHeader") to UTF-8.
The first simple expression in the following example evaluates
data in the ASCII character set. However, when the NetScaler
appliance processes the compound expression, just before
concatenating the results of the two simple expressions, the
appliance promotes the character set of the value returned by HTTP.REQ.BODY(10) to UTF-8.
Consequently, the compound expression returns
data in the UTF-8 character set.
Specifying the Character Set Based on the Character Set of Traffic
You can set the character set to UTF-8 on the basis of traffic
characteristics. If you are not sure whether the character set of
the traffic being evaluated is UTF-8, you can configure a compound
expression in which the first expression checks for UTF-8 traffic
and subsequent expressions set the character set to UTF-8.
Following is an example of a compound expression that first checks
the value of "charset" in the request's Content-Type header for
"UTF-8" before checking whether the first 1000 bytes in the request
contain the UTF-8 string Bücher:
'; ', '"').VALUE("charset").EQ("UTF-8") &&
If you are sure that the character set of the
traffic being evaluated is UTF-8, the second expression in the
example is sufficient.
Character and String Literals in Expressions
During expression evaluation, even if the current character set is ASCII, character literals and string literals, which are enclosed in single quotation marks ('') and quotation marks (""), respectively, are considered to be literals in the UTF-8 character set. In a given expression, if a function is operating on character or string literals in the ASCII character set and you include a non-ASCII character in the literal, an error is returned.
Values in Hexadecimal and Octal Formats
When configuring an expression, you can enter values in octal and hexadecimal formats. However, each hexadecimal or octal byte is considered a UTF-8 byte. Invalid UTF-8 bytes result in errors regardless of whether the value is entered manually or pasted from the clipboard. For example, "\xce\x20" is an invalid UTF-8 character because "c8" cannot be followed by "20" (each byte in a multi-byte UTF-8 string must have the high bit set). Another example of an invalid UTF-8 character is "\xce \xa9," since the hexadecimal characters are separated by a white-space character.
Terminal Connection Settings for UTF-8
When you set up a connection to the NetScaler appliance by using a terminal connection (by using PuTTY, for example), you must set the character set for transmission of data to UTF-8.