Unified Hangul Code
Layout of the Unified Hangul Code | |
Alias(es) | Windows Code Page 949, IBM Code Page 1363 |
---|---|
Language(s) | Korean |
Standard | WHATWG Encoding Standard (as "EUC-KR")[1] |
Classification | Extended ISO 646,[lower-alpha 1] variable-width encoding, CJK encoding |
Extends | EUC-KR |
Other related encoding(s) | KPS 9566-2003, KPS 9566-2011 |
| |
Unified Hangul Code (UHC),[2][lower-alpha 1] or Extended Wansung,[4][lower-alpha 2] also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C 5601:1987, encoded as EUC-KR) to include all 11172 Hangul syllables present in Johab (KS C 5601:1992 annex 3).[4][2] This corresponds to the pre-composed syllables available in Unicode 2.0 and later.
Wansung Code has the drawback that it only assigns codes for the 2350 precomposed Hangul syllables which have their own KS X 1001 (KS C 5601) codepoints (out of 11172 in total, not counting those using obsolete jamo), and requires others to use eight-byte composition sequences, which are not supported by some partial implementations of the standard.[5] UHC resolves this by assigning single codes for all possible syllables constructed using modern jamo, by making assignments outside of the encoding space used for KS X 1001.
The lead byte range is extended to 0x81–FE, and the trail byte range is extended to 0x41–5A, 0x61–7A and 0x81–FE (in EUC-KR, both ranges are 0xA1–FE). The codes outside the EUC-KR ranges are used for the additional hangul.[6]
Terminology
Unified Hangul Code is not registered with IANA as a standard to communicate information over the Internet.[7] Alternatives include UTF-8. However, the W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of "EUC-KR".[1]
Microsoft assigns Windows-949 the label "ks_c_5601-1987",[8][9] which properly applies to KS X 1001 itself (KS C 5601 being the original name of KS X 1001).[10] The WHATWG treat the label "ks_c_5601-1987" interchangeably with "EUC-KR" with the intent of being "compatible with deployed content".[11] The Unicode Consortium's "OBSOLETE/EASTASIA" collection of withdrawn mappings included mappings for Unified Hangul Code as "KSC5601.TXT", with the automatically derived mappings for 7-bit KS X 1001 being included as "KSX1001.TXT".[12]
IBM's code page 949 is another, otherwise unrelated, extension of EUC-KR. International Components for Unicode (ICU) uses "cp949", "949" or "ibm-949" to refer to that IBM code page,[13] and "ms949" or "windows-949" (or several variants of "ks_c_5601-1987") to refer to the Windows mapping of UHC.[14] Python, by contrast, recognises "cp949", "949", "ms949" and "uhc" as labels for UHC, and does not include an IBM-949 codec.[15] Out of the labels incorporating the code page number, the WHATWG recognise only "windows-949".[11]
IBM's code page for Unified Hangul Code is called Code page 1363 (IBM-1363), or "Korean MS-Win". It is a combination of SBCS Code page 1126 and DBCS Code page 1362.[16][17][18][19][20] It differs in having a single byte mapping of 0x5C to the Won sign (U+20A9);[21][22][23] Windows maps 0x5C to U+005C (the Unicode code point for the backslash) as in ASCII,[14] although fonts often still render it as a Won sign.[24] Unicode mapping of the wave dash (0xA1AD) also differs, with the IBM mapping favouring U+301C,[25] while the Microsoft mapping favours U+223C (Tilde Operator).[26] The IBM mapping for UHC is available as "ibm-1363" in ICU,[21] whereas the ICU "windows-949" codec is referred to as IBM-1261 in some ICU source code comments.[27]
Footnotes
References
- ↑ 1.0 1.1 van Kesteren, Anne, "5. Indexes (§ index EUC-KR)", Encoding Standard (WHATWG), https://encoding.spec.whatwg.org/#index-euc-kr
- ↑ 2.0 2.1 "INFO: Hangul (Korean) Character Sets", Microsoft Support (Microsoft), https://support.microsoft.com/en-gb/help/170557/info-hangul-korean-character-sets
- ↑ "한글 코드에 대하여" (in ko). W3C. http://www.w3c.or.kr/i18n/hangul-i18n/ko-code.html.
- ↑ 4.0 4.1 Zsigri, Gyula (2002-06-18). "KSC and UHC". http://zsigri.tripod.com/fontboard/cjk/ksc.html.
- ↑ Shin, Jungshik. "What are KS X 1001(KS C 5601) and other Hangul codes?". Hangul & Internet in Korea FAQ. http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html.
- ↑ Lunde, Ken (13 January 2009), "Appendix F: Vendor encoding Methods", CJKV Information Processing (O'Reilly Media), ISBN 978-0-596-51447-1, https://resources.oreilly.com/examples/9780596514471/blob/master/cjkvip2e-appF.pdf
- ↑ "Character Sets". https://www.iana.org/assignments/character-sets.
- ↑ "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft. https://msdn.microsoft.com/en-us/library/system.text.encoding.windowscodepage(v=vs.110).aspx.
- ↑ "Code Page Identifiers", Windows Dev Center (Microsoft), https://docs.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers
- ↑ "convrtrs.txt". International Components for Unicode. https://opensource.apple.com/source/ICU/ICU-59180.0.1/icuSources/data/mappings/convrtrs.txt.auto.html. "<quote from="Jungshik Shin"> [...] using KS C 5601 or related names to denote EUC-KR or windows-949 is very much misleading [...] It's just the name of a 94 x 94 Korean coded character set standard which can be invoked on either GL (with MSB reset) or GR (with MSB set)."
- ↑ 11.0 11.1 van Kesteren, Anne. "4.2. Names and labels". Encoding Standard. WHATWG. https://encoding.spec.whatwg.org/#names-and-labels.
- ↑ Jungshik Shin. "KSX1001.TXT: KS X 1001 to Unicode table". Unicode, Inc. https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT.
- ↑ "ibm-949_P110-1999 (alias cp949)", Converter Explorer (International Components for Unicode), https://ssl.icu-project.org/icu-bin/convexp?conv=cp949
- ↑ 14.0 14.1 "windows-949-2000", Converter Explorer (International Components for Unicode), http://demo.icu-project.org/icu-bin/convexp?conv=windows-949-2000
- ↑ "codecs — Codec registry and base classes § Standard Encodings". Python 3.7.2 documentation. Python Software Foundation. https://docs.python.org/3.7/library/codecs.html#standard-encodings.
- ↑ "Coded character set identifiers - CCSID 1363", IBM Globalization (IBM), https://www-01.ibm.com/software/globalization/ccsid/ccsid1363.html
- ↑ "Code page 1126 information document". https://www-01.ibm.com/software/globalization/cp/cp01126.html.
- ↑ "CCSID 1126 information document". http://www-01.ibm.com/software/globalization/ccsid/ccsid1126.html.
- ↑ "Code page 1362 information document". https://www-01.ibm.com/software/globalization/cp/cp01362.html.
- ↑ "CCSID 1362 information document". http://www-01.ibm.com/software/globalization/ccsid/ccsid1362.html.
- ↑ 21.0 21.1 "ibm-1363", Converter Explorer (International Components for Unicode), http://demo.icu-project.org/icu-bin/convexp?conv=ibm-1363
- ↑ Code Page CPGID 01126 (pdf), IBM, ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP01126.pdf
- ↑ Code Page CPGID 01126 (txt), IBM, ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP01126.txt
- ↑ Kaplan, Michael S. (2005-09-17), "When is a backslash not a backslash?", Sorting it all out, http://archives.miloush.net/michkap/archive/2005/09/17/469941.html
- ↑ "ibm-1363_P110-1997 (lead byte A1)". ICU Demonstration - Converter Explorer. International Components for Unicode. https://ssl.icu-project.org/icu-bin/convexp?conv=ibm-1363_P110-1997&b=A1&s=ALL#layout.
- ↑ "windows-949-2000 (lead byte A1)". ICU Demonstration - Converter Explorer. International Components for Unicode. https://ssl.icu-project.org/icu-bin/convexp?conv=windows-949-2000&b=A1&s=ALL#layout.
- ↑ See, for reference, ucnv_lmb.cpp (Brendan Murray, Jim Snyder-Grant), where the lead byte 0x11 is commented as referring to "Korean: ibm-1261" after the definition of
ULMBCS_GRP_KO
, but it is mapped to the"windows-949"
ICU codec in theOptGroupByteToCPName
array later in the file.
External links
- Microsoft's Reference for Windows-949
- IBM's documentation for IBM-1363
- Mapping of Windows-949 to Unicode
- International Components for Unicode (ICU) mapping files: ibm-1363_P110-1997.ucm, ibm-1363_P11B-1998.ucm, and windows-949-2000.ucm
- ICU demonstration for Windows-949 (with ASCII mappings)
- ICU demonstration for IBM-1363 (with 0x5C as Won sign)