中文XML论坛--Unicode 5.0 发表 - 支持最新版的 GB 18030

详见：http://www.unicode.org/versions/Unicode5.0.0/

Unicode 5.0.0

Unicode 5.0.0 is a [URL=http://www.unicode.org/versions/]major version[/URL] of the Unicode Standard and supersedes all previous versions. The publication of the book, The Unicode Standard, Version 5.0, is pending and is expected in the fourth quarter of 2006.

However, all of the [URL=http://www.unicode.org/Public/5.0.0/ucd/]online data files[/URL] for version 5.0 of the [URL=http://www.unicode.org/ucd/]Unicode Character Database[/URL] are stable and final. In order to provide an opportunity for developers to develop Unicode 5.0 as soon as possible, these data files have been released ahead of the publication of the text of the standard.

The text of the Unicode Standard Annexes for Version 5.0 is currently in copy edit; online versions of these will also be available in the fourth quarter of 2006. The Unicode Standard Annexes will also be published in the book.

Version 5.0.0 of the Unicode Standard consists of the publication The Unicode Standard, Version 5.0 plus the Unicode Character Database, Version 5.0.0. The book gives the general principles, requirements for conformance, and guidelines for implementers, followed by character code charts and names and the text of all of the Unicode Standard Annexes.

To order The Unicode Standard, Version 5.0, see the [URL=http://www.unicode.org/book/bookform.html]online order form[/URL].

A complete specification of the contributory files for Unicode 5.0.0 is found on [URL=http://www.unicode.org/versions/components-5.0.0.html]the Components page[/URL]. Version 5.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)

Online Edition
The text of The Unicode Standard, Version 5.0 will be available online via the navigation links on this page, starting in the first quarter of 2007. Those pdf files may be viewed but may not be printed. The Unicode 5.0 Web Bookmarks page will have links to all sections of the online text.

Final character code charts for Version 5.0 will be available online soon.

What's New in Version 5.0
For the first time, the book provides the complete text of the standard, including all the Unicode Standard Annexes. The book will also be printed in a smaller, lighter, easier-to-use format.

For stability of protocols on the Internet and elsewhere, Unicode 5.0 also makes changes to guarantee case-folding stability. Unicode 5.0 incorporates all the changes introduced in Unicode 4.1, including full interoperability with the most recent versions of GB 18030, JIS X 0213, and HKSCS, and support for stable identifiers and pattern syntax characters.

Unicode 5.0 revises and improves property values and behavioral specifications in areas such as character, word, line, and sentence segmentation, and tightens conformance requirements on Bidi implementations (used for Arabic and Hebrew). The text is significantly revised for clarity and completeness, especially for Unicode conformance.

Unicode 5.0 covers the full repertoire of ISO/IEC 10646:2003, including Amendments 1 and 2, which add characters required for some languages of India, for mathematicians, for minority languages, and for academic use.

The Unicode Standard is closely connected with other Unicode software globalization standards in such key areas as collation (used for sorting, searching, and matching), character set conversion, regular expressions, and the interchange and registration of locale data for the world's languages and local cultural conventions [[URL=http://www.unicode.org/cldr/]CLDR[/URL]]. It has been further significantly augmented by several new Unicode Technical Standards that provide recommendations and data to assist in secure implementation of Unicode, and to establish the registration mechanism for Ideographic Variation Sequences needed by the publishing industry for Chinese and Japanese.

Other major additions to Version 5.0 since Version 4.0 are discussed in the sections below.

New Characters
1,369 new character assignments were made to the Unicode Standard, Version 5.0 (over and above what was in Unicode 4.1.0). These additions include new characters for Cyrillic, Greek, Hebrew, Kannada, Latin, math, phonetic extensions, symbols, and five new scripts: Balinese, N’Ko, Phags-pa, Phoenician, and Sumero-Akkadian Cuneiform.

The new character additions were to both the BMP and the SMP (Plane 1). The following table shows the allocation of code points in Unicode 5.0.0. For more information on the specific characters, see the file [URL=http://www.unicode.org/Public/UNIDATA/DerivedAge.txt]DerivedAge.txt[/URL] in the [URL=http://www.unicode.org/ucd/]Unicode Character Database[/URL].

Graphic 98,884
Format 140
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 875,441

The character repertoire corresponds to ISO/IEC 10646:2003 plus Amendment 1, Amendment 2, and four Sindhi characters from Amendment 3. For more details of character counts, see Appendix D, Changes from Unicode Version 4.0.

Unicode Character Database
The Unicode Character Database (UCD) was extended to cover the character repertoire additions, and new block definitions and script values were added. A number of other updates were made, as listed here:

Scripts. Unassigned code points were given a new Script property value of "Zzzz": this may require some change in code using this property. Three Mongolian punctuation marks and two archaic letters changed script value.
Case-Related Properties. To allow for the new policy on case-folding stability, lowercase variants of several characters were added, and the mappings for the uppercase variants changed.

Bidirectional Behavior. The list of characters with the Bidi_Mirrored property was made consistent for brackets and quotation marks, in preparation for new constraints on bidi mirroring. The Bidi_Class property for five archaic characters was changed to L.
Line Break. The Line_Break property of seven punctuation characters and two bracket characters was changed to Alphabetic (AL) to better match their expected behavior. Numerous characters for Southeast Asian scripts, which require complex contextual linebreaking, were changed to Complex_Context (SA).
New Properties. Normative_Name_Alias and the metaproperty, Deprecated, were added. The Jamo_Short_Name property was documented as a contributory property.
General Category. Seven archaic characters plus U+0294 LATIN LETTER GLOTTAL STOP changed categories.

Numeric Properties. The archaic character U+10341 GOTHIC LETTER NINETY was given the numeric value 90.
Unihan. The kIICore field was made a normative property, and three new provisional properties were added: kCheungBauer, kCheungBauerIndex, and kFourCornerCoverage. There were numerous additions to the kCangjie property.
Text Breaking. Grapheme_Link was deprecated as a property.
For more information, see the file [URL=http://www.unicode.org/Public/5.0.0/ucd/UCD.html]UCD.html[/URL] in the [URL=http://www.unicode.org/Public/5.0.0/ucd/]Unicode Character Database[/URL].

Conformance
Details regarding the conformance changes to the standard for Version 5.0 are specified in the text of the standard itself, including the Unicode Standard Annexes. As noted above, the book and the Unicode Standard Annexes will be available in the fourth quarter of 2006.

Chapter 3, Conformance, was substantially improved by incorporating much of the Unicode Property Model, enhancing the treatment of combining characters, and further clarifying canonical ordering behavior through the addition of clearly defined principles. Additionally, conformance clauses and definitions were renumbered for overall readability and clarity of the text. Significant clarifications or modifications to character behavior include those listed below:

Stability of Cased Letters. If uppercase characters are added in cased scripts, the corresponding lowercase characters will be added as well, so that case folding is stable.
Stability of Named Character Sequences. An initial provisional phase was incorporated into the process for defining Named Character Sequences, so that approved Named Character Sequences will be immutable.
Disunification of Diacritics. Criteria for disunifying diacritics were established.
Indic Scripts. Zero width joiner and zero width non-joiner can now be used to encourage or discourage ligation in Bengali; the sequence for Gurmukhi double vowels was determined, and the shaping of ra in Tamil was updated.
Combining Marks. The use of combining grapheme joiner with Latin script diacritics was clarified.
Unicode Standard Annexes
In UAX #9, "Bidirectional Algorithm," for better interoperability, the algorithm was modified to tighten up the conformance requirements for using mirrored glyphs for characters. Higher level protocols are discouraged, due to interoperability and security considerations. The definition of directional run was changed to be the same as level run, and the use of soft-hyphen with bidi text was clarified.
In UAX #14, "Line Breaking Properties," a number of rules were modified, the use of soft hyphen in cursive scripts was documented, the conformance clauses were restated and the algorithm was reorganized into tailorable and non-tailorable sections, and the normative status was made consistent with Chapter 3, Conformance. As a result of the restatement of conformance, the Line_Break property became normative.
In UAX #15, "Unicode Normalization Forms," the new Stream-Safe Text Format was added, allowing the use of normalization in protocols designed for streaming. The stability guarantees are described in more detail, with guidelines provided for guaranteeing process stability, and a new appendix listing precisely those characters sequences that require special handling. Additional figures clarify the effects of normalization, and the types of characters affected.
In UAX #29, "Text Boundaries," the format of the rules was changed to make them much easier to implement -- without changing the results. The guidelines for how to use regex-style rules was revamped completely. A number of edge cases are also now handled properly, and information was added on the relation to identifiers, use of normalization, tailoring, application to spelling checkers, and how to use the supplied test data. Tailorings for text boundaries can now also be entered into the Unicode Common Locale Data Repository [[URL=http://www.unicode.org/cldr/]CLDR[/URL]].
UAX #31, "Identifier and Pattern Syntax," introduced profiles, and added notes on profiles of identifiers for natural languages and the use of spaces in identifiers.


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	1,187.500ms