To be a globally competitive company, you need rapid access to critical data in a wide variety of formats and languages. Learn how Unicode makes this possible in your SAP ERP system, and gain an understanding of its technical requirements and restrictions.
Key Concept
In the past, hundreds of different and sometimes conflicting encoding systems represented the characters required for different languages. Unlike standard SAP code pages that hold multiple language keys,
Unicode defines a character set that includes virtually all characters used in the world and provides a consistent, global character encoding. The size and scope of Unicode have made it the default character-encoding schema of Internet communication such as XML, Java, and HTML. The Unicode consortium of leading companies in the IT industry defines Unicode. See
www.unicode.org for additional information.
Before BW 3.5, the first version of BW that is Unicode compliant, it was impossible to create BW reports in multiple languages on one system. Unicode allows you to have multiple language capabilities with SAP NetWeaver BI by addressing the problem of multiple, possibly incompatible code pages. It defines more than 98,000 characters, classifying each character once, with room for more than 1 million characters. Unicode allocates each character with an individual number regardless of the platform, program, or language. That not only allows the application to expand the computer’s supply of characters, but it also maintains a trouble-free understanding among the different language components within a system.
You can use Unicode for the system code page, front end, and printing. In an SAP system that is Unicode compliant, you can display and maintain character data from any language with any logon language. Unicode provides a widespread, accepted international standard that supports virtually all the world’s scripts. This helps avoid potential conflicts between individual platforms and languages. SAP NetWeaver BI supports the Unicode standard to aid your globalization and localization efforts. This means that SAP NetWeaver BI can:
- Interpret and display Unicode characters
- Extract data from source systems with specific code pages (non-Unicode or Unicode)
- Extract data from an SAP source system running mixed code pages (Multiple Display, Multiple Processing [MDMP], the old technology for mixing incompatible non-Unicode code pages)
- Interface to third-party systems and support correct code page conversion
All SAP software will be Unicode compliant by 2007. SAP has already stopped supporting multiple code page systems. When you upgrade, you may choose a Unicode or non-Unicode system, but only the Unicode system allows you to report in multiple languages that you can use concurrently and in any combination. We will explain the technical requirements, restrictions, and prerequisites of Unicode so that you will be better prepared for this transition.
Technical Requirements
Unicode-based SAP components deploy the SAP Web Application Server (Web AS) 6.20 or newer. Unicode-enabled Web AS installations support the Unicode standard, including essential new functionality, syntax improvements, and extended semantics. Only one source code exists for Unicode-based and non-Unicode-based systems so that the systems can exchange new upgrades in Unicode technology.
Extended SAP interfaces (e.g., Remote Function Call [RFC]) enhance communication between other Unicode-based systems or non-Unicode-based systems. SAP provides standard tools for the installation of and conversion to Unicode-based systems that you can also use for checking and Unicode-enabling of customer developments. In addition, SAP supports Unicode on several platforms (Table 1).
| |
| SQL server |
X
|
—
|
—
|
—
|
—
|
—
|
—
|
—
|
| Oracle4 |
X
|
X
|
X
|
X
|
X
|
X
|
—
|
—
|
| DB/2 |
X
|
X
|
—
|
X
|
—
|
X
|
—
|
—2
|
| Max DB |
X
|
X
|
X
|
X
|
X
|
X
|
—
|
—
|
| 1 |
64-bit versions only. Note that support of HP Tru64 UNIX OS as an SAP-certified operating system ends with SAP NetWeaver ’04. This does not affect Tru64 for SAP Web AS 6.40 as part of SAP NetWeaver ’04. For existing installations, SAP continues its support under the respective maintenance and support agreements. SAP and HP provide migration support from Tru64 to HP-UX on the Itanium Processor Family (IPF) server. |
| 2 |
OS/390 support is released as of Web AS 6.40 for 64-bit application servers. |
| 3 |
Web AS 6.20 64-bit kernel and Web AS 6.40 64-bit kernel are released. |
| 4 |
If you are using Oracle, SAP BW/Unicode requires Release 9.2 or higher. Note that Informix support is not planned. Reliant Unix support is not planned. |
|
|
| Table 1 |
Platform-specific information |
The International Organization for Standardization and the International Electrotechnical Commission (ISO/IEC) 10646 series of standards as well as the Unicode consortium define the character set that Unicode supports. Various encoding methods are suggested for the current set of supported characters and scripts. There are 8-, 16-, and 32-bit encodings for Unicode characters:
- Unicode Transformation Format (UTF)-8: UTF based on 8-bit representation (the way the information is stored in the database)
- CESU-8: Compatibility Encoding Scheme of UTF-16 on an 8-bit base
- UTF-16: Unicode Transformation Format based on 16-bit representation
- UTF-32: Unicode Transformation Format based on 32-bit representation
- UCS-2: Universal Character Set 2-byte variation
- UCS-4: Universal Character Set 4-byte variation
Note
CESU-8 is intended for internal use within systems processing Unicode to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation.
Each encoding uses a different base length and the length of a character in a Unicode encoding can be either variable or fixed. The Unicode encoding determines the length of a character. A character in one of the Unicode encodings can be greater than one byte, and therefore Unicode characters can be longer than characters defined in other standard code pages. This leads to larger hardware demands.
Each type of encoding offers advantages and disadvantages. The 8-bit encodings are well suited for data transfer, because all 7-bit US American Standard Code for Information Interchange (ASCII) characters retain the same code points. This improves communication with legacy, non-Unicode systems. The downside is the variable character length. In the 32-bit encodings, UTF-32/UCS-4, all characters have a fixed length. The extensive memory requirements outweigh this programming advantage of fixed-length characters.
The 16-bit encodings offer a compromise because they do not require as much memory as UTF-32 but offer quasi-fixed character length. UCS-2 has a fixed character length, but it cannot define more than 65,636 characters (216). UTF-16, on the other hand, can access all of the characters in version 4.0 of the Unicode standard by using the surrogate area. Both UTF-32 and UTF-16 are byte order, or “endian” dependent.
The first customer conversions to Unicode indicate that their database actually shrinks because database reorganization (export and import of the database) more than offsets the database size increase. The extra database space requirements depend on the script and the Unicode encoding you use. To see what additional hardware you may require, see Table 2.
| CPU size |
30-35% |
| Memory size |
50% |
| DB size |
36% (UTF-8/CESU-8)
60-70% (UTF-16) |
| Network load |
Almost no change due to efficient compression |
|
| Table 2 |
Additional hardware requirements
|
Note
Processor architecture differs in the way Unicode text is serialized into bytes (either big-endian or little-endian). Depending on the architecture, the serialization can result in the bytes going in either order. This is significant for the byte order UTF encodings as it can potentially cause problems in interchange between different systems if the byte order is unmarked. Special characters within the file that indicate the correct byte order usually mitigate these problems.
Table 3 shows the length of A and Ä in four different code pages: 1100, the SAP code page corresponding to ISO 8859-1; 8000, the SAP code page corresponding to SJIS; UTF-8; and UTF-16. In 1100, 8000, and CESU-8/UTF-8, all 7-bit ASCII characters are one byte long in both non-Unicode and Unicode systems. Other characters from single-byte code pages are twice as long; for example, Ä.
|
A
|
1
|
1
|
1
|
2
|
|
Ä
|
1
|
Character not represented
|
2
|
2
|
|
| Table 3 |
The length of A and Ä shown in four different code pages |
In the next example, the first character is a single-byte Japanese Katakana character, and the second is a double-byte Korean Hangul character. The length of a single-byte Katakana character is either doubled or tripled. The size of all double-byte characters either remains unchanged or increases slightly (Table 4).
|
Character not represented
|
1
|
3
|
2
|
|
Character not represented
|
2
|
3
|
2
|
|
| Table 4 |
The length of a Japanese Katakana character (row 1) and a Korean Hangul character (row 2) shown in four different code pages |
If the database contains only characters from a single byte code page, then the length of all characters can double if the database uses the code page UTF-16. In all other cases, the increase depends on the encoding and the code pages used.
SAP supports the most recent release, 4.0, of the current Unicode standard. All SAP systems that are Unicode-based support all Unicode encodings. SAP’s application server uses UTF-16 and its database uses UTF-8, CESU-8, or UTF-16. UTF-16 includes the support for surrogates.
Surrogates are characters represented by a pair of code points where the first code point is located within the hex interval (0xD800, 0xDBFF) and the second code point is located within the interval (0xDC00, 0xDFFF). The conversion occurs algorithmically. Note that hex is short for hexadecimal. A byte can be represented as two consecutive hexadecimal digits.
SAP can technically support all two-character language keys in the ISO 639 standard. ISO 639 is code that represents the names of the languages (e.g., “ar” for Arabic and “en” for English). Technical support means that these languages can be used as language keys. It is possible to fill the language data with English or another language’s menus when no translation exists. All data is encoded as Unicode characters, and a Unicode system can use all Unicode characters regardless of the language key.
All SAPGUIs (HTML, Java, Windows) support Unicode alongside all the non-Unicode code pages already supported. Because SAPGUI is backward compatible, a single SAPGUI can access both Unicode and non-Unicode systems. Therefore, only one GUI is needed per front end.
For full support of languages with multi-byte system locales (Japanese, Traditional Chinese, Simplified Chinese, and Korean), SAPGUI 6.40 is required. SAP recommends that Unicode customers use Release 6.40 of the SAPGUI (see SAP note 710720). BW 3.1 Content/Unicode and BW 3.5 require at least Windows 2000 or Windows XP on the front-end client if you are using multibyte languages.
Known Restrictions
Currently most SAP products are Unicode enabled. It is SAP’s intention to prepare future editions of the mySAP Business Suite for Unicode. For specific release information, go to https://service.sap.com/globalization. All of SAP’s software will be available in a Unicode version. Check the above link for specific releases. Going forward, the default for new installations of SAP software is the Unicode version. Unicode is also the explicitly recommended system type.
When using a release in which the software component is also available in a Unicode version and the non-Unicode version is installed, note the following drawbacks and shortcomings of the non-Unicode version:
- Adding new languages other than those included in the non-Unicode code page installed requires a Unicode conversion of the system. This includes a complete database conversion.
- The Unicode conversion is mandatory for all software components based on the successor releases of SAP NetWeaver ‘04, as the support for MDMP ends with SAP NetWeaver ‘04.
- Only Unicode systems can achieve full integration between ABAP and Java components because the J2EE standard is based on Unicode. In non-Unicode systems, there is always danger of data loss during text data transfer from J2EE to ABAP.
- Only for ISO-1 (Western European) installations: You cannot use characters that are not part of the standard non-Unicode code page for Western Europe (ISO 8859-1, SAP code page number 1100) for plain text input in ISO-1 non-Unicode systems. In particular, the following characters are not available in non-Unicode ISO-1 systems: U+20AC Euro sign and the U+2122 trademark sign.
Data Transfer in a Unicode/Non-Unicode System
The communication between two systems is problematic when:
- Sender and receiver system deploy different code pages
- Sender or receiver system are Java-based
- A Unicode system communicates with an Asian non-Unicode system (double-byte code page)
Table 5 displays the meaning of the colors used in Tables 6 and 7. Table 6 displays the data transfer compatibility among several types of systems. Table 7 is a chart of the various code pages and the support of data transfer between the code pages.
| |
100% data transfer |
| |
7-bit ASCII data transfer; solution for 100% data transfer implemented in some applications |
| |
7-bit ASCII and some additional 100% data transfer |
| |
7-bit ASCII data transfer only |
| |
No data transfer solution yet (under investigation) |
|
| Table 5 |
Key for tables 6 and 7 |
| Single code page |
|
|
|
|
| MDMP |
|
|
|
|
| Java application |
|
|
|
|
| Unicode |
|
|
|
|
|
| Table 6 |
System communication compatibility |
| ISO-1 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-2 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-3 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-5 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-6 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-7 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-9 |
|
|
|
|
|
|
|
|
|
|
|
|
| ISO-11 |
|
|
|
|
|
|
|
|
|
|
|
|
| SJIS |
|
|
|
|
|
|
|
|
|
|
|
|
| Big 5 |
|
|
|
|
|
|
|
|
|
|
|
|
| KSC 5601 |
|
|
|
|
|
|
|
|
|
|
|
|
| GB 1324 |
|
|
|
|
|
|
|
|
|
|
|
|
| *ISO-X = ISO 8859-X |
|
| Table 7 |
Communication between single code page SAP systems |
Prerequisites
The installation option Multiple Components in One Database (MCOD) is released for Unicode installations. A prerequisite is that the MCOD system contains Unicode instances only. Mixed solutions are not supported. Unicode generally only refers to texts. SAP recommends that InfoObject keys in US-7-bit-ASCII be defined (that is, without special characters). If the keys in the source system contain special characters, you must ensure that the keys in a customer project (especially in the case of an MDMP source system) are consistently converted into SAP NetWeaver BI. Otherwise, the key references may be incorrect and the data may be corrupted as a result.
The connection of an SAP NetWeaver BI/Unicode system to an SAP source system with several MDMP code pages has only been available on a case-by-case basis. Although there is an automatic conversion for text tables with a language key, check whether all extractors adhere to this condition. You may need to adjust the extraction process. You can easily connect single-code pages and Unicode source systems.
Only a dedicated non-Unicode code page (for example, in 1100 for Latin-1) can operate the SAP NetWeaver BI BEx front end (BEx Analyzer, BEx Query Designer, and Web Application Designer [Web AD]). This restriction does not apply for Web browser-based Web applications that can be displayed in Unicode. The SAP NetWeaver BI BEx front end (BEx Analyzer, BEx Query Designer, and Web AD) does not support any right-to-left (RTL) languages such as Hebrew and Arabic. Web applications support these languages.
Saving queries and Web templates in a language that is not part of this dedicated front-end code page (for example, Polish logon language in a Latin-1 front end) can result in corrupt texts. Therefore, make sure those users of the BEx Analyzer, the BEx Query Designer, or the Web AD only log on using a language that is contained in the front-end code page installed.
Changing or saving queries or Web templates in a code page other than the original one can lead to corrupt texts on the database. That is, a query with Japanese text elements should always be changed in the same (Japanese) code page. Changing the language to Chinese, for example, results in incorrect characters in the texts on the database.
Matt Kangas
Matt Kangas works for SAP Labs as a US product manager for SAP NetWeaver Application Server. He specializes in systems topics including architecture, software lifecycle management, platforms, ITS, high availability, installations, upgrades, and monitoring. Matt has lent his multiple skills and talents to SAP for more than seven years, and spent more than five years in the field as a Basis consultant.
You may contact the author at editor@BWExpertOnline.com.
If you have comments about this article or publication, or would like to submit an article idea, please contact the editor.
Anthony Andreacchio
Anthony Andreacchio has more than eight years of SAP experience. He has been a BW product manager for more than four years.
You may contact the author at Anthony.Andreacchio@SAP.com.
If you have comments about this article or publication, or would like to submit an article idea, please contact the editor.