This SAS KB article discusses relevant information that you must consider when you are deciding whether to run your SAS programs with UTF-8 SAS session encoding. UTF-8 is an encoding form of the Unicode standard.
Why do I want to use UTF-8 encoding?
- Using SAS in a UTF-8 session encoding is recommended for multilingual environments.
- SAS clients like SAS® Studio, SAS® Visual Analytics, and SAS® Viya® typically execute statements in a server environment that is running UTF-8 encoding. If you prefer to run SAS in a different encoding, support for all regional encodings is available beginning with SAS® Viya® 3.3.
How do I know whether my SAS session already uses UTF-8 encoding?
If you are unsure whether SAS is already running in UTF-8 encoding, look in the SAS log after submitting the following code:
proc options option=encoding;
run;
What issues could I encounter with the use of UTF-8?
- The size of data sets can increase. The first 128 code points in UTF-8 encoding (in the range 0–127) are identical to 7-bit ASCII. The characters 128–255 (extended ASCII characters) use multiple bytes in UTF-8 for storage. Therefore, the Length attribute that is assigned to the character variables in your SAS data set might need to increase or character data truncation can occur. For an estimate of the increase, see the appendix in SAS® and UTF-8: Ultimately the Finest. Your Data and Applications Will Thank You!
- SAS data sets store an encoding indicator in the descriptor portion. If you read existing data sets created in another encoding, Cross-Environment Data Access (CEDA) is used to process the file and CEDA restrictions apply. See Restrictions for CEDA. The CONTENTS procedure can be used to view the encoding information of a SAS data set to determine if CEDA processing will occur. The code in SAS Note 55054 can be used to print the encoding and data representation of all the data sets in the library.
- CEDA transcodes the data by default. The below warnings or errors can occur when more bytes are required in UTF-8 and the variable lengths are not long enough to hold the values:
ERROR: Some character data was lost during transcoding in the dataset libref.data-set-name.
NOTE: The data step has been abnormally terminated.
Some character data was lost during transcoding in the data set libref.data-set-name. Either the
data contains characters that are not representable in the new encoding or truncation occurred
during transcoding.
- Use the character variable padding (CVP) engine to remove the warning or error. The read-only CVP engine expands the character variable lengths, and SAS transcodes the data successfully to create a new data set in UTF-8 session encoding. Here is a syntax example that uses the CVP engine:
libname mylib cvp 'path';
data new;
set mylib.wlatin1;
run;
- If you used the CVP engine but still receive the warning or error, you might try specifying a different multiplication factor for expansion by using the CVPMULT= option. For other solutions for this issue, see Demystifying and resolving common transcoding problems.
- The CIMPORT procedure, which you use to convert a transport file to a SAS data set, can generate a warning in the SAS log under these conditions:
- The transport file has a data set that is not encoded as UTF-8 or US-ASCII.
- The columns in the data set are not long enough to hold the data that was transcoded to UTF-8.
The warning is as follows:
WARNING: The destination buffer size was not sufficient for the transcoded data
To prevent the warning, you can do one of the following:
-
- Execute the CIMPORT procedure in the same session encoding in which the transport file was created.
- Re-create the transport file after validating the character data compatibility. To check the compatibility, you can use the %VALIDCHS macro.
PROC CIMPORT in SAS® Viya® 3.5 has new options that enable you to specify a multiplier for character columns as well as automatically expand the size of formats.
I want my data sets to all be in UTF-8 encoding. How do I do that?
- There are many tools SAS provides to assist you in migrating your data to UTF-8. For example, reading the source file using the CVP engine in the LIBNAME statement will multiply character column lengths by 1.5, or you can use the CVPMULT= option to specify another factor. The macro %COPY_TO_NEW_ENCODING can be used to evaluate all character variables in the source data set and set the proper lengths for the variables that will require it. That can help mitigate the increase in data set size.
- If you need to convert an entire deployment, see Migrating Data to UTF-8 for SAS® Viya® 3.4. (This documentation is also applicable to SAS® 9.4.) Note: The MIGRATE procedure does not currently support the CVP engine. If your data contains non-ASCII characters, data might be lost if the column widths are insufficient to hold the values.
Do I need to modify existing programs?
- SAS programs that are written for a single-byte environment and rely on column input need to be modified. Column input is difficult in a UTF-8 environment. The column input statement in SAS is byte-oriented and the columns in the input file are character-aligned. As a best practice, the input file should have delimiter-aligned or byte-aligned columns.
- Character strings that contain multibyte data require the use of SAS K functions in place of traditional SAS string functions. See Have a Comprehensive Understanding of SAS® K functions.
- The lengths of SAS character formats might need to be increased, or your data values might be truncated when you display them. For formats that are supplied by SAS, the CVPFORMATWIDTH= option is on by default when you use the CVP engine and will expand their widths. Special consideration is needed to migrate user-defined formats if the labels contain non-ASCII characters.
- Transcoding issues can occur in existing SAS programs that contain non-ASCII characters and were saved from a single-byte SAS session such as Wlatin1 or Latin1. If the program is included in a UTF-8 Code Editor window like in SAS Studio, you can see character substitution or data truncation or invalid-data errors in the log after submitting the program.
Resources
Bales, Elizabeth, and Wei Zheng. 2017. “SAS® and UTF-8: Ultimately the Finest. Your Data and Applications Will Thank You!” Proceedings of the SAS Global Forum 2017 Conference. Cary, NC: SAS Institute Inc. http://support.sas.com/resources/papers/proceedings17/SAS0296-2017.pdf.
Bouedo, Mickaël. 2020. "The SAS® encoding journey: A byte at a time." Proceedings of the SAS Global Forum 2020 Conference. Cary, NC: SAS Institute Inc. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4561-2020.pdf.
Carlton, Jody. 2018. “A transcoding story (or, How Oliver S. Füßling lost his last name and comes to find it again).” Cary, NC: SAS Institute Inc. https://blogs.sas.com/content/sgf/2018/06/22/a-transcoding-story-or-how-oliver-s-fusling-lost-his-last-name-and-comes-to-find-it-again/.
Carlton, Jody. 2017. “Demystifying and resolving common transcoding problems.” Cary, NC: SAS Institute Inc. https://blogs.sas.com/content/sgf/2017/05/19/demystifying-and-resolving-common-transcoding-problems/.
Lawhorn, Bari. 2014. “Encoding: helping SAS speak your language.” Cary, NC: SAS Institute Inc. https://blogs.sas.com/content/sgf/2014/09/26/encoding-helping-sas-speak-your-language/.
SAS Institute Inc. 2019. Migration Focus Area. Cary, NC: SAS Institute Inc. http://support.sas.com/rnd/migration/index.html.
SAS Institute Inc. 2019. SAS® 9.4 National Language Support (NLS): Reference Guide, Fifth Edition. Cary, NC: SAS Institute Inc. https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=nlsref&docsetTarget=titlepage.htm&locale=en.
SAS Institute Inc. 2018. Migrating Data to UTF-8 for SAS® Viya® 3.4. Cary, NC: SAS Institute Inc. https://go.documentation.sas.com/?docsetId=viyadatamig&docsetTarget=p1e9huvrtpq0upn1jjht4vn2gctb.htm&docsetVersion=3.4&locale=en.
Xie, Edwin (You). 2020. "Your data will go on: Practice for character data migration." Proceedings of the SAS Global Forum 2020 Conference. Cary, NC: SAS Institute Inc. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4195-2020.pdf.