原文标题:UTF-8 and Unicode FAQ for Unix/Linux
原文链接:http://www.cl.cam.ac.uk/~mgk25/unicode.html
原文作者:Markus Kuhn
译文标题: Unix/Linux环境下关于UTF-8和Unicode的常见问答
译文作者:aXqd
本文为您提供POSIX系统上,关于Unicode/UTF-8使用的全面信息。这里,你不仅能找到适合每个用户的介绍性文字,也能找到为更有经验用户准备的细节参考。
Unicode已经开始在各个层面取代ASCII、ISO 8859以及EUC。它不仅使用户能够处理这个星球上任何的实际的文字和语言,它也支持大量而全面的数学与技术符号,以简化科学信息的交流。
结合UTF-8编码,Unicode能够在完全围绕ASCII设计的环境(如Unix)中,方便的被使用并保持后向兼容。UTF-8是Unix、Linux和类似系统使用Unicode的方式。当前情况下,你需要熟悉Unicode,并且保证你的软件能平滑的支持UTF-8。
The international standard ISO 10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets. This means simply that no information is lost if you convert any text string to UCS and then back to its original encoding.
UCS contains the characters required to represent practically all known languages. This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, Japanese and Korean Han ideographs as well as scripts such as Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi, and others. For scripts not yet covered, research on how to best encode them for computer usage is still going on and they will be added eventually. This includes not only historic scripts such as Cuneiform, Hieroglyphs and various Indo-European notations, but even some selected artistic scripts such as Tolkien’s Tengwar and Cirth. UCS also covers a large number of graphical, typographical, mathematical and scientific symbols, including those provided by TeX, PostScript, APL, the International Phonetic Alphabet (IPA), MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many word processing and publishing systems. The standard continues to be maintained and updated. Ever more exotic and specialized symbols and characters will be added for many years to come.
ISO 10646 originally defined a 31-bit character set. The subsets of 216 characters where the elements differ (in a 32-bit integer representation) only in the 16 least-significant bits are called the planes of UCS.
The most commonly used characters, including all those found in major older encoding standards, have been placed into the first plane (0×0000 to 0xFFFD), which is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that were later added outside the 16-bit BMP are mostly for specialist applications such as historic scripts and scientific notation. Current plans are that there will never be characters assigned outside the 21-bit code space from 0×000000 to 0x10FFFF, which covers a bit over one million potential future characters. The ISO 10646-1 standard was first published in 1993 and defines the architecture of the character set and the content of the BMP. A second part ISO 10646-2 was added in 2001 and defines characters encoded outside the BMP. In the 2003 edition, the two parts were combined into a single ISO 10646 standard. New characters are still being added on a continuous basis, but the existing characters will not be changed any more and are stable.
UCS assigns to each character not only a code number but also an official name. A hexadecimal number that represents a UCS or Unicode value is commonly preceded by “U+” as in U+0041 for the character “Latin capital letter A”. The UCS characters U+0000 to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to U+F8FF and also larger ranges outside the BMP are reserved for private use. UCS also defines several methods for encoding a string of characters as a sequence of bytes, such as UTF-8 and UTF-16.
The full reference for the UCS standard is
International Standard ISO/IEC 10646, Information technology — Universal Multiple-Octet Coded Character Set (UCS) . Third edition, International Organization for Standardization, Geneva, 2003.
The standard can be ordered online from ISO as a set of PDF files on CD-ROM for 112 CHF.
In September 2006, ISO released a free online PDF copy of ISO 10646:2003 on its Freely Available Standards web page. The ZIP file is 82 MB long.
Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented characters, like those used in the orthographies of common languages, have codes of their own in UCS to ensure backwards compatibility with older character sets. They are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings that have no combining characters, such as ISO 8859. The combining-character mechanism allows one to add accents and other diacritical marks to any character. This is especially important for scientific notations such as mathematical formulae and the International Phonetic Alphabet, where any possible combination of a base character and one or several diacritical marks could be needed.
Combining characters follow the character which they modify. For example, the German umlaut character Ä (“Latin capital letter A with diaeresis”) can either be represented by the precomposed UCS code U+00C4, or alternatively by the combination of a normal “Latin capital letter A” followed by a “combining diaeresis”: U+0041 U+0308. Several combining characters can be applied when it is necessary to stack multiple accents or add combining marks both above and below the base character. The Thai script, for example, needs up to two combining characters on a single base character.
Not all systems can be expected to support all the advanced mechanisms of UCS, such as combining characters. Therefore, ISO 10646 specifies the following three implementation levels: