ADC

Specify the character set in expressions

September 21, 2020

Contributed by:

S S

The policy infrastructure on the Citrix® Citrix ADC® appliance supports the ASCII and UTF-8 character sets. The default character set is ASCII. If the traffic for which you are configuring an expression consists of only ASCII characters, you need not specify the character set in the expression. However, you must specify the character set in every simple expression that is meant for UTF-8 traffic. To specify the UTF-8 character set in a simple expression, you must include the SET_CHAR_SET(<charset>) function, with <charset> specified as UTF_8, as shown in the following examples:

HTTP.REQ.BODY(10).SET_CHAR_SET(UTF_8).CONTAINS("ß")

HTTP.RES.BODY(100).SET_CHAR_SET(UTF_8).BEFORE_STR("Bücher").AFTER_STR("Wörterbuch")
<!--NeedCopy-->

In an expression, the SET_CHAR_SET() function must be introduced at the point in the expression after which data processing must be carried out in the specified character set. For example, in the expression HTTP.REQ.BODY(1000).AFTER_REGEX(re/following example/).BEFORE_REGEX(re/In the preceding example/).CONTAINS_ANY(“Greek_ alphabet”), if the strings stored in the pattern set “Greek_alphabet” are in UTF-8, you must include the SET_CHAR_SET(UTF_8) function immediately before the CONTAINS_ANY(“<string>”) function, as follows:

HTTP.REQ.BODY(1000).AFTER_REGEX(re/following example/).BEFORE_REGEX(re/In the preceding example/).SET_CHAR_SET(UTF_8).CONTAINS_ANY("Greek_ alphabet")

The SET_CHAR_SET() function sets the character set for all further processing (that is, for all subsequent functions) in the expression unless it is overridden later in the expression by another SET_CHAR_SET() function that changes the character set. Therefore, if all the functions in a given simple expression are intended for UTF-8, you can include the SET_CHAR_SET(UTF_8) function immediately after functions that identify text (for example, the HEADER(“<name>”) or BODY(<int>) functions). In the second example that follows the first paragraph above, if the ASCII arguments passed to the AFTER_REGEX() and BEFORE_REGEX() functions are changed to UTF-8 strings, you can include the SET_CHAR_SET(UTF_8) function immediately after the BODY(1000) function, as follows:

HTTP.REQ.BODY(1000).SET_CHAR_SET(UTF_8).AFTER_REGEX(re/Bücher/).BEFORE_REGEX(re/Wörterbuch/).CONTAINS_ANY("Greek_alphabet")

The UTF-8 character set is a superset of the ASCII character set, so expressions configured for the ASCII character set continue to work as expected if you change the character set to UTF-8.

Compound expressions with different character sets

In a compound expression, if one subset of expressions is configured to work with data in the ASCII character set and the rest of the expressions are configured to work with data in the UTF-8 character set, the character set specified for each individual expression is considered when the expressions are evaluated individually. However, when processing the compound expression, just before processing the operators, the appliance promotes the character set of the returned ASCII values to UTF-8. For example, in the following compound expression, the first simple expression evaluates data in the ASCII character set while the second simple expression evaluates data in the UTF-8 character set:

HTTP.REQ.HEADER("MyHeader") == HTTP.REQ.BODY(10).SET_CHAR_SET(UTF_8)

However, when processing the compound expression, just before evaluating the “is equal to” Boolean operator, the Citrix ADC appliance promotes the character set of the value returned by HTTP.REQ.HEADER(“MyHeader”) to UTF-8.

The first simple expression in the following example evaluates data in the ASCII character set. However, when the Citrix ADC appliance processes the compound expression, just before concatenating the results of the two simple expressions, the appliance promotes the character set of the value returned by HTTP.REQ.BODY(10) to UTF-8.

HTTP.REQ.BODY(10) + HTTP.REQ.HEADER("MyHeader").SET_CHAR_SET(UTF_8)

Consequently, the compound expression returns data in the UTF-8 character set.

Specify the character set based on the character set of traffic

You can set the character set to UTF-8 on the basis of traffic characteristics. If you are not sure whether the character set of the traffic being evaluated is UTF-8, you can configure a compound expression in which the first expression checks for UTF-8 traffic and subsequent expressions set the character set to UTF-8. Following is an example of a compound expression that first checks the value of “charset” in the request’s Content-Type header for “UTF-8” before checking whether the first 1000 bytes in the request contain the UTF-8 string Bücher:

HTTP.REQ.HEADER("Content-Type").SET_TEXT_MODE(IGNORECASE).TYPECAST_NVLIST_T('=', '; ', '"').VALUE("charset").EQ("UTF-8") && HTTP.REQ.BODY(1000).SET_CHAR_SET(UTF_8).CONTAINS("Bücher")

If you are sure that the character set of the traffic being evaluated is UTF-8, the second expression in the example is sufficient.

Character and string literals in expressions

During expression evaluation, even if the current character set is ASCII, character literals and string literals, which are enclosed in single quotation marks (‘’) and quotation marks (“”), respectively, are considered to be literals in the UTF-8 character set. In a given expression, if a function is operating on character or string literals in the ASCII character set and you include a non-ASCII character in the literal, an error is returned.

Note:

The string literals in advanced policy expressions are now as long as the policy expression. The expression is allowed to be 1499 or 8191 bytes long.

Values in hexadecimal and octal formats

When configuring an expression, you can enter values in octal and hexadecimal formats. However, each hexadecimal or octal byte is considered a UTF-8 byte. Invalid UTF-8 bytes result in errors regardless of whether the value is entered manually or pasted from the clipboard. For example, “\xce\x20” is an invalid UTF-8 character because “c8” cannot be followed by “20” (each byte in a multi-byte UTF-8 string must have the high bit set). Another example of an invalid UTF-8 character is “\xce \xa9,” since the hexadecimal characters are separated by a white-space character.

Functions that return UTF-8 strings

Only the <text>.XPATH and <text>.XPATH_JSON functions always return UTF-8 strings. The following MYSQL routines determine at runtime which character set to return, depending on the data in the protocol:

MYSQL_CLIENT_T.USER
MYSQL_CLIENT_T.DATABASE
MYSQL_REQ_QUERY_T.COMMAND
MYSQL_REQ_QUERY_T.TEXT
MYSQL_REQ_QUERY_T.TEXT(<unsigned int>)
MYSQL_RES_ERROR_T.SQLSTATE
MYSQL_RES_ERROR_T.MESSAGE
MYSQL_RES_FIELD_T.CATALOG
MYSQL_RES_FIELD_T.DB
MYSQL_RES_FIELD_T.TABLE
MYSQL_RES_FIELD_T.ORIGINAL_TABLE
MYSQL_RES_FIELD_T.NAME
MYSQL_RES_FIELD_T.ORIGINAL_NAME
MYSQL_RES_OK_T.MESSAGE
MYSQL_RES_ROW_T.TEXT_ELEM(<unsigned int>)

Terminal connection settings for UTF-8

When you set up a connection to the Citrix ADC appliance by using a terminal connection (by using PuTTY, for example), you must set the character set for transmission of data to UTF-8.

The official version of this content is in English. Some of the Cloud Software Group documentation content is machine translated for your convenience only. Cloud Software Group has no control over machine-translated content, which may contain errors, inaccuracies or unsuitable language. No warranty of any kind, either expressed or implied, is made as to the accuracy, reliability, suitability, or correctness of any translations made from the English original into any other language, or that your Cloud Software Group product or service conforms to any machine translated content, and any warranty provided under the applicable end user license agreement or terms of service, or any other agreement with Cloud Software Group, that the product or service conforms with any documentation shall not apply to the extent that such documentation has been machine translated. Cloud Software Group will not be held responsible for any damage or issues that may arise from using machine-translated content.

Specify the character set in expressions

September 21, 2020

Contributed by:

S S

September 21, 2020

Contributed by:

S S

Specify the character set in expressions

Compound expressions with different character sets

Specify the character set based on the character set of traffic

Character and string literals in expressions

Values in hexadecimal and octal formats

Functions that return UTF-8 strings

Terminal connection settings for UTF-8

In this article