Edit D:\app\Administrator\product\11.2.0\dbhome_1\perl\html\lib\Encode\Unicode.html
<?xml version="1.0" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Encode::Unicode -- Various Unicode Transformation Formats</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <link rev="made" href="mailto:" /> </head> <body style="background-color: white"> <table border="0" width="100%" cellspacing="0" cellpadding="3"> <tr><td class="block" style="background-color: #cccccc" valign="middle"> <big><strong><span class="block"> Encode::Unicode -- Various Unicode Transformation Formats</span></strong></big> </td></tr> </table> <!-- INDEX BEGIN --> <div name="index"> <p><a name="__index__"></a></p> <ul> <li><a href="#name">NAME</a></li> <li><a href="#synopsis">SYNOPSIS</a></li> <li><a href="#abstract">ABSTRACT</a></li> <li><a href="#size__endianness__and_bom">Size, Endianness, and BOM</a></li> <ul> <li><a href="#by_size">by size</a></li> <li><a href="#by_endianness">by endianness</a></li> </ul> <li><a href="#surrogate_pairs">Surrogate Pairs</a></li> <li><a href="#error_checking">Error Checking</a></li> <li><a href="#see_also">SEE ALSO</a></li> </ul> <hr name="index" /> </div> <!-- INDEX END --> <p> </p> <h1><a name="name">NAME</a></h1> <p>Encode::Unicode -- Various Unicode Transformation Formats</p> <p> </p> <hr /> <h1><a name="synopsis">SYNOPSIS</a></h1> <pre> use Encode qw/encode decode/; $ucs2 = encode("UCS-2BE", $utf8); $utf8 = decode("UCS-2BE", $ucs2);</pre> <p> </p> <hr /> <h1><a name="abstract">ABSTRACT</a></h1> <p>This module implements all Character Encoding Schemes of Unicode that are officially documented by Unicode Consortium (except, of course, for UTF-8, which is a native format in perl).</p> <dl> <dt><strong><a name="http_www_unicode_org_glossary_says" class="item"><a href="http://www.unicode.org/glossary/">http://www.unicode.org/glossary/</a> says:</a></strong> <dd> <p><em>Character Encoding Scheme</em> A character encoding form plus byte serialization. There are Seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 (UCS-4), UTF-32BE (UCS-4BE) and UTF-32LE (UCS-4LE), and UTF-7.</p> </dd> <dd> <p>Since UTF-7 is a 7-bit (re)encoded version of UTF-16BE, It is not part of Unicode's Character Encoding Scheme. It is separately implemented in Encode::Unicode::UTF7. For details see <a href="file://C|\ADE\aime_smenon_perl_090715\perl\html/lib/Encode/Unicode/UTF7.html">the Encode::Unicode::UTF7 manpage</a>.</p> </dd> </li> <dt><strong><a name="quick_reference" class="item">Quick Reference</a></strong> <dd> <pre> Decodes from ord(N) Encodes chr(N) to... octet/char BOM S.P d800-dfff ord > 0xffff \x{1abcd} == ---------------+-----------------+------------------------------ UCS-2BE 2 N N is bogus Not Available UCS-2LE 2 N N bogus Not Available UTF-16 2/4 Y Y is S.P S.P BE/LE UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd UTF-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf UTF-32 4 Y - is bogus As is BE/LE UTF-32BE 4 N - bogus As is 0x0001abcd UTF-32LE 4 N - bogus As is 0xcdab0100 UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d ---------------+-----------------+------------------------------</pre> </dd> </dl> <p> </p> <hr /> <h1><a name="size__endianness__and_bom">Size, Endianness, and BOM</a></h1> <p>You can categorize these CES by 3 criteria: size of each character, endianness, and Byte Order Mark.</p> <p> </p> <h2><a name="by_size">by size</a></h2> <p>UCS-2 is a fixed-length encoding with each character taking 16 bits. It <strong>does not</strong> support <em>surrogate pairs</em>. When a surrogate pair is encountered during <code>decode()</code>, its place is filled with \x{FFFD} if <em>CHECK</em> is 0, or the routine croaks if <em>CHECK</em> is 1. When a character whose ord value is larger than 0xFFFF is encountered, its place is filled with \x{FFFD} if <em>CHECK</em> is 0, or the routine croaks if <em>CHECK</em> is 1.</p> <p>UTF-16 is almost the same as UCS-2 but it supports <em>surrogate pairs</em>. When it encounters a high surrogate (0xD800-0xDBFF), it fetches the following low surrogate (0xDC00-0xDFFF) and <code>desurrogate</code>s them to form a character. Bogus surrogates result in death. When \x{10000} or above is encountered during <code>encode()</code>, it <code>ensurrogate</code>s them and pushes the surrogate pair to the output stream.</p> <p>UTF-32 (UCS-4) is a fixed-length encoding with each character taking 32 bits. Since it is 32-bit, there is no need for <em>surrogate pairs</em>.</p> <p> </p> <h2><a name="by_endianness">by endianness</a></h2> <p>The first (and now failed) goal of Unicode was to map all character repertoires into a fixed-length integer so that programmers are happy. Since each character is either a <em>short</em> or <em>long</em> in C, you have to pay attention to the endianness of each platform when you pass data to one another.</p> <p>Anything marked as BE is Big Endian (or network byte order) and LE is Little Endian (aka VAX byte order). For anything not marked either BE or LE, a character called Byte Order Mark (BOM) indicating the endianness is prepended to the string.</p> <p>CAVEAT: Though BOM in utf8 (\xEF\xBB\xBF) is valid, it is meaningless and as of this writing Encode suite just leave it as is (\x{FeFF}).</p> <dl> <dt><strong><a name="bom_as_integer_when_fetched_in_network_byte_order" class="item">BOM as integer when fetched in network byte order</a></strong> <dd> <pre> 16 32 bits/char ------------------------- BE 0xFeFF 0x0000FeFF LE 0xFFFe 0xFFFe0000 -------------------------</pre> </dd> </dl> <p>This modules handles the BOM as follows.</p> <ul> <li> <p>When BE or LE is explicitly stated as the name of encoding, BOM is simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).</p> </li> <li> <p>When BE or LE is omitted during <code>decode()</code>, it checks if BOM is at the beginning of the string; if one is found, the endianness is set to what the BOM says. If no BOM is found, the routine dies.</p> </li> <li> <p>When BE or LE is omitted during <code>encode()</code>, it returns a BE-encoded string with BOM prepended. So when you want to encode a whole text file, make sure you <code>encode()</code> the whole text at once, not line by line or each line, not file, will have a BOM prepended.</p> </li> <li> <p><code>UCS-2</code> is an exception. Unlike others, this is an alias of UCS-2BE. UCS-2 is already registered by IANA and others that way.</p> </li> </ul> <p> </p> <hr /> <h1><a name="surrogate_pairs">Surrogate Pairs</a></h1> <p>To say the least, surrogate pairs were the biggest mistake of the Unicode Consortium. But according to the late Douglas Adams in <em>The Hitchhiker's Guide to the Galaxy</em> Trilogy, <code>In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move</code>. Their mistake was not of this magnitude so let's forgive them.</p> <p>(I don't dare make any comparison with Unicode Consortium and the Vogons here ;) Or, comparing Encode to Babel Fish is completely appropriate -- if you can only stick this into your ear :)</p> <p>Surrogate pairs were born when the Unicode Consortium finally admitted that 16 bits were not big enough to hold all the world's character repertoires. But they already made UCS-2 16-bit. What do we do?</p> <p>Back then, the range 0xD800-0xDFFF was not allocated. Let's split that range in half and use the first half to represent the <code>upper half of a character</code> and the second half to represent the <code>lower half of a character</code>. That way, you can represent 1024 * 1024 = 1048576 more characters. Now we can store character ranges up to \x{10ffff} even with 16-bit encodings. This pair of half-character is now called a <em>surrogate pair</em> and UTF-16 is the name of the encoding that embraces them.</p> <p>Here is a formula to ensurrogate a Unicode character \x{10000} and above;</p> <pre> $hi = ($uni - 0x10000) / 0x400 + 0xD800; $lo = ($uni - 0x10000) % 0x400 + 0xDC00;</pre> <p>And to desurrogate;</p> <pre> $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);</pre> <p>Note this move has made \x{D800}-\x{DFFF} into a forbidden zone but perl does not prohibit the use of characters within this range. To perl, every one of \x{0000_0000} up to \x{ffff_ffff} (*) is <em>a character</em>.</p> <pre> (*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit integer support!</pre> <p> </p> <hr /> <h1><a name="error_checking">Error Checking</a></h1> <p>Unlike most encodings which accept various ways to handle errors, Unicode encodings simply croaks.</p> <pre> % perl -MEncode -e '$_ = "\xfe\xff\xd8\xd9\xda\xdb\0\n"' \ -e 'Encode::from_to($_, "utf16","shift_jis", 0); print' UTF-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184. % perl -MEncode -e '$a = "BOM missing"' \ -e ' Encode::from_to($a, "utf16", "shift_jis", 0); print' UTF-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.</pre> <p>Unlike other encodings where mappings are not one-to-one against Unicode, UTFs are supposed to map 100% against one another. So Encode is more strict on UTFs.</p> <p>Consider that "division by zero" of Encode :)</p> <p> </p> <hr /> <h1><a name="see_also">SEE ALSO</a></h1> <p><a href="file://C|\ADE\aime_smenon_perl_090715\perl\html/lib/Encode.html">the Encode manpage</a>, <a href="file://C|\ADE\aime_smenon_perl_090715\perl\html/lib/Encode/Unicode/UTF7.html">the Encode::Unicode::UTF7 manpage</a>, <a href="http://www.unicode.org/glossary/">http://www.unicode.org/glossary/</a>, <a href="http://www.unicode.org/unicode/faq/utf_bom.html">http://www.unicode.org/unicode/faq/utf_bom.html</a>,</p> <p>RFC 2781 <a href="http://rfc.net/rfc2781.html">http://rfc.net/rfc2781.html</a>,</p> <p>The whole Unicode standard <a href="http://www.unicode.org/unicode/uni2book/u2.html">http://www.unicode.org/unicode/uni2book/u2.html</a></p> <p>Ch. 15, pp. 403 of <code>Programming Perl (3rd Edition)</code> by Larry Wall, Tom Christiansen, Jon Orwant; O'Reilly & Associates; ISBN 0-596-00027-8</p> <table border="0" width="100%" cellspacing="0" cellpadding="3"> <tr><td class="block" style="background-color: #cccccc" valign="middle"> <big><strong><span class="block"> Encode::Unicode -- Various Unicode Transformation Formats</span></strong></big> </td></tr> </table> </body> </html>
Ms-Dos/Windows
Unix
Write backup
jsp File Browser version 1.2 by
www.vonloesch.de