Character encoding

Short description: Using numbers to represent text characters

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, W is encoded as `1010111`.

Character encoding is a convention of using a numeric value to represent each character of a writing script. Not only can a character set include natural language symbols, but it can also include codes that have meanings or functions outside of language, such as control characters and whitespace. Character encodings have also been defined for some constructed languages. When encoded, character data can be stored, transmitted, and transformed by a computer.^[1] The numerical values that make up a character encoding are known as code points and collectively comprise a code space or a code page.

Early character encodings that originated with optical or electrical telegraphy and in early computers could only represent a subset of the characters used in languages, sometimes restricted to upper case letters, numerals and limited punctuation. Over time, encodings capable of representing more characters were created, such as ASCII, ISO/IEC 8859, and Unicode encodings such as UTF-8 and UTF-16.

The most popular character encoding on the World Wide Web is UTF-8, which is used in 98.9% of surveyed web sites, as of January 2026^[update].^[2] In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options.^[3]

History

The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means. The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher, Braille, international maritime signal flags, and the 4-digit encoding of Chinese characters for a Chinese telegraph code (Hans Schjellerup, 1869). With the adoption of electrical and electro-mechanical techniques these earliest codes were adapted to the new capabilities and limitations of the early machines. The earliest well-known electrically transmitted character code, Morse code, introduced in the 1840s, used a system of four "symbols" (short signal, long signal, short space, long space) to generate codes of variable length. Though some commercial use of Morse code was via machinery, it was often used as a manual code, generated by hand on a telegraph key and decipherable by ear, and persists in amateur radio and aeronautical use. Most codes are of fixed per-character length or variable-length sequences of fixed-length codes (e.g. Unicode).^[4]

Common examples of character encoding systems include Morse code, the Baudot code, the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode, a well-defined and extensible encoding system, has replaced most earlier character encodings, but the path of code development to the present is fairly well known.

The Baudot code, a five-bit encoding, was created by Émile Baudot in 1870, patented in 1874, modified by Donald Murray in 1901, and standardized by CCITT as International Telegraph Alphabet No. 2 (ITA2) in 1930. The name baudot has been erroneously applied to ITA2 and its many variants. ITA2 suffered from many shortcomings and was often improved by many equipment manufacturers, sometimes creating compatibility issues.

Hollerith 80-column punch card with EBCDIC character set

Herman Hollerith invented punch card data encoding in the late 19th century to analyze census data. Initially, each hole position represented a different data element, but later, numeric information was encoded by numbering the lower rows 0 to 9, with a punch in a column representing its row number. Later alphabetic data was encoded by allowing more than one punch per column. Electromechanical tabulating machines represented date internally by the timing of pulses relative to the motion of the cards through the machine.

When IBM went to electronic processing, starting with the IBM 603 Electronic Multiplier, it used a variety of binary encoding schemes that were tied to the punch card code. IBM used several binary-coded decimal (BCD) six-bit character encoding schemes, starting as early as 1953 in its 702^[5] and 704 computers, and in its later 7000 Series and 1400 series, as well as in associated peripherals. Since the punched card code then in use was limited to digits, upper-case English letters and a few special characters, six bits were sufficient. These BCD encodings extended existing simple four-bit numeric encoding to include alphabetic and special characters, mapping them easily to punch-card encoding which was already in widespread use. IBM's codes were used primarily with IBM equipment. Other computer vendors of the era had their own character codes, often six-bit, such as the encoding used by the UNIVAC I.^[6] They usually had the ability to read tapes produced on IBM equipment. IBM's BCD encodings were the precursors of their Extended Binary-Coded Decimal Interchange Code (usually abbreviated as EBCDIC), an eight-bit encoding scheme developed in 1963 for the IBM System/360 that featured a larger character set, including lower case letters.

In 1959, the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps. While Fieldata addressed many of the then-modern issues (e.g. letter and digit codes arranged for machine collation), it fell short of its goals and was short-lived. In 1963 the first ASCII code was released (X3.4-1963) by the ASCII committee (which contained at least one member of the Fieldata committee, W. F. Leubbert), which addressed most of the shortcomings of Fieldata, using a simpler seven-bit code. Many of the changes were subtle, such as collatable character sets within certain numeric ranges. ASCII63 was a success, widely adopted by industry, and with the follow-up issue of the 1967 ASCII code (which added lower-case letters and fixed some "control code" issues) ASCII67 was adopted fairly widely. ASCII67's American-centric nature was somewhat addressed in the European ECMA-6 standard.^[7] Eight-bit extended ASCII encodings, such as various vendor extensions and the ISO/IEC 8859 series, supported all ASCII characters as well as additional non-ASCII characters.

While trying to develop universally interchangeable character encodings, researchers in the 1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users). In 1985, the average personal computer user's hard disk drive could store only about 10 megabytes, and it cost approximately US$250 on the wholesale market (and much higher if purchased separately at retail),^[8] so it was very important at the time to make every bit count.

The compromise solution that was eventually found and was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to a particular sequence of bits. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points. Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, such as above 256 for eight-bit units, the solution was to implement variable-length encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point.

Terminology

The various terms related to character encoding are often used inconsistently or incorrectly.^[9] Historically, the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units – usually with a single character per code unit. However, due to the emergence of more sophisticated character encodings, the distinction between terms has become important.

Character

A character is the smallest unit of text that has semantic value.^[9]^[10] In linguistics, this is called a grapheme and each of the various ways it may be written are called glyphs. (For example, the serif form g and the sans-serif form g are each a glyph of the grapheme ⟨g⟩, Template:Unichar2.)

What constitutes a character varies between character encodings. For example, for letters with diacritics, there are two distinct approaches that can be taken to encode them. They can be encoded either as a single unified character (known as a precomposed character), or as separate characters that combine into a single glyph. The former simplifies the text handling system, but the latter allows any letter/diacritic combination to be used in text. Ligatures pose similar problems. Some writing systems, such as Arabic and Hebrew, have graphemes whose shape and joining depend on context.

Character set

A character set is a collection of characters used to represent text.^[9]^[10] For example, the Latin alphabet and Greek alphabet are character sets.

Coded character set

A coded character set is a character set with each item uniquely mapped to a numberic value.^[10]

This is also known as a code page,^[9] although that term is generally antiquated. Originally, code page referred to a page number in an IBM manual that defined a particular character encoding.^[11] Other vendors, including Microsoft, SAP, and Oracle Corporation, also published their own code pages, including notable Windows code page and code page 437. Despite no longer referring to specific pages in a manual, many character encodings are still identified to by the same number. Likewise, the term code page is still used to refer to character encoding.

In Unix and Unix-like systems, the term charmap is commonly used; usually in the larger context of locales.

IBM's Character Data Representation Architecture (CDRA) designates each entity with a coded character set identifier (CCSID), which is variously called a charset, character set, code page, or CHARMAP.^[12]

Character repertoire

A character repertoire is a set of characters that can be represented by a particular coded character set.^[10]^[13] The repertoire may be closed, meaning that no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series); or it may be open, allowing additions (as is the case with Unicode and to a limited extent Windows code pages).^[13]

Code point

A code point is the value or position of a character in a coded character set.^[10] A code point is represented by a sequence of code units. The mapping is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding:

UTF-8: code points map to a sequence of one, two, three or four code units.
UTF-16: code units are twice as long as 8-bit code units. Therefore, any code point with a scalar value less than U+10000 is encoded with a single code unit. Code points with a value U+10000 or higher require two code units each. These pairs of code units have a unique term in UTF-16: "Unicode surrogate pairs".
UTF-32: the 32-bit code unit is large enough that every code point is represented as a single code unit.
GB 18030: multiple code units per code point are common, because of the small code units. Code points are mapped to one, two, or four code units.^[14]

Code space

Code space is the range of numerical values spanned by a coded character set.^[10]^[12]

Code unit

A code unit is the minimum bit combination that can represent a character in a character encoding (in computer science terms, it is the word size of the character encoding).^[10]^[12] Common code units include 7-bit, 8-bit, 16-bit, and 32-bit. In some encodings, some characters are encoded as multiple code units.

For example:

ASCII: 7 bits
UTF-8, EBCDIC and GB 18030: 8 bits
UTF-16: 16 bits
UTF-32: 32 bits

Unicode encoding

Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes, Unicode separately defines a coded character set that maps characters to unique natural numbers (code points), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe the model precisely, Unicode uses existing terms and defines new terms.^[12]

Abstract character repertoire

An abstract character repertoire (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to the repertoire over time.

Coded character set

A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different code points.

Character encoding form

A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence is defined by a CEF.

Character encoding scheme

A character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE, and UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using a byte order mark or escape sequences; compressing schemes try to minimize the number of bytes used per code unit (such as SCSU and BOCU).

Although UTF-32BE and UTF-32LE are simpler CESes, most systems working with Unicode use either UTF-8, which is backward compatible with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or UTF-16BE,^{[citation needed]} which is backward compatible with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See comparison of Unicode encodings for a detailed discussion.

Higher-level protocol

There may be a higher-level protocol which supplies additional information to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as the same character. An example is the XML attribute xml:lang.

The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers.^[12]

Code point documentation

A character is commonly documented as 'U+' followed by its code point value in hexadecimal. The range of valid code points (the code space) for the Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 planes, identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called supplementary characters.

The following table includes examples of code points:

Character	Code point	Grapheme
Latin A	U+0041	Α
Latin sharp S	U+00DF	ß
Han for East	U+6771	東
Ampersand	U+0026	&
Inverted exclamation mark	U+00A1	¡
Section sign	U+00A7	§

Example

Consider, "ab̲c𐐀" – a string containing a Unicode combining character (U+0332 ̲ to underline the ⟨b⟩) as well as a supplementary character (U+10400 𐐀 ). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements:

Four composed characters:
a, b̲, c, 𐐀
Five graphemes:
a, b, _, c, 𐐀
Five Unicode code points:
U+0061, U+0062, U+0332, U+0063, U+10400
Five UTF-32 code units (32-bit integer values):
0x00000061, 0x00000062, 0x00000332, 0x00000063, 0x00010400
Six UTF-16 code units (16-bit integers)
0x0061, 0x0062, 0x0332, 0x0063, 0xD801, 0xDC00
Nine UTF-8 code units (8-bit values, or bytes)
0x61, 0x62, 0xCC, 0xB2, 0x63, 0xF0, 0x90, 0x90, 0x80

Note in particular that 𐐀 is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the grapheme, it is not obvious how the actual numeric byte values are related.

Transcoding

To support environments using multiple character encodings, software has been developed to translate text between character encoding schemes, a process known as transcoding. Notable software includes:

Web browser – Modern browsers feature automatic character encoding detection
iconv – Program and standardized API to convert encodings
luit – Program that converts encoding of input and output to programs running interactively
International Components for Unicode – A set of C and Java libraries for charset conversion
Encoding.Convert – .NET API^[15]
MultiByteToWideChar/WideCharToMultiByte – Windows API functions for converting between ANSI and Unicode^[16]^[17]

Common character encodings

The most used character encoding on the web is UTF-8, used in 98.9% of surveyed web sites, as of January 2026^[update].^[2] In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options.^[3]^[18]

ISO 646
- ASCII
EBCDIC
ISO 8859:
- ISO 8859-1 Western Europe
- ISO 8859-2 Western and Central Europe
- ISO 8859-3 Western Europe and South European (Turkish, Maltese plus Esperanto)
- ISO 8859-4 Western Europe and Baltic countries (Lithuania, Estonia, Latvia and Lapp)
- ISO 8859-5 Cyrillic alphabet
- ISO 8859-6 Arabic
- ISO 8859-7 Greek
- ISO 8859-8 Hebrew
- ISO 8859-9 Western Europe with amended Turkish character set
- ISO 8859-10 Western Europe with rationalised character set for Nordic languages, including complete Icelandic set
- ISO 8859-11 Thai
- ISO 8859-13 Baltic languages plus Polish
- ISO 8859-14 Celtic languages (Irish Gaelic, Scottish, Welsh)
- ISO 8859-15 Added the Euro sign and other rationalisations to ISO 8859-1
- ISO 8859-16 Central, Eastern and Southern European languages (Albanian, Bosnian, Croatian, Hungarian, Polish, Romanian, Serbian and Slovenian, but also French, German, Italian and Irish Gaelic)
CP437, CP720, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866, CP869, CP872
MS-Windows character sets:
- Windows-1250 for Central European languages that use Latin script, (Polish, Czech, Slovak, Hungarian, Slovene, Serbian, Croatian, Bosnian, Romanian and Albanian)
- Windows-1251 for Cyrillic alphabets
- Windows-1252 for Western languages
- Windows-1253 for Greek
- Windows-1254 for Turkish
- Windows-1255 for Hebrew
- Windows-1256 for Arabic
- Windows-1257 for Baltic languages
- Windows-1258 for Vietnamese
Mac OS Roman
KOI8-R, KOI8-U, KOI-7
MIK
ISCII
TSCII
VISCII
JIS X 0208 is a widely deployed standard for Japanese character encoding that has several encoding forms.
- Shift JIS (Microsoft Code page 932 is a dialect of Shift_JIS)
- EUC-JP
- ISO-2022-JP
JIS X 0213 is an extended version of JIS X 0208.
- Shift JIS-2004
- EUC-JIS-2004
- ISO-2022-JP-2004
Chinese Guobiao
- GB 2312
- GBK (Microsoft Code page 936)
- GB 18030
Taiwan Big5 (a more famous variant is Microsoft Code page 950)
- Hong Kong HKSCS
Korean
- KS X 1001 is a Korean double-byte character encoding standard
- EUC-KR
- ISO-2022-KR
Unicode (and subsets thereof, such as the 16-bit 'Basic Multilingual Plane')
- UTF-8
- UTF-16
- UTF-32
ANSEL or ISO/IEC 6937

References

↑ "Character Encoding Definition". September 24, 2010. http://techterms.com/definition/characterencoding.
↑ ^2.0 ^2.1 "Usage Survey of Character Encodings broken down by Ranking". https://w3techs.com/technologies/cross/character_encoding/ranking.
↑ ^3.0 ^3.1 "Charset". https://developer.android.com/reference/java/nio/charset/Charset. "Android note: The Android platform default is always UTF-8."
↑ Tom Henderson (17 April 2014). "Ancient Computer Character Code Tables – and Why They're Still Relevant". Smartbear. https://blog.smartbear.com/development/ancient-computer-character-code-tables-and-why-theyre-still-relevant/.
↑ "IBM Electronic Data-Processing Machines Type 702 Preliminary Manual of Information". 1954. p. 80. https://www.bitsavers.org/pdf/ibm/702/22-6173-1_702prelim_Feb56.pdf.
↑ "UNIVAC System". http://www.bitsavers.org/pdf/univac/univac1/UnivacI_RefCard.pdf.
↑ Tom Jennings (20 April 2016). "An annotated history of some character codes". https://www.sr-ix.com/Archive/CharCodeHist/index.html.
↑ Strelho, Kevin (April 15, 1985). "IBM Drives Hard Disks to New Standards". InfoWorld (Popular Computing Inc.): pp. 29–33. https://books.google.com/books?id=zC4EAAAAMBAJ&pg=PA29.
↑ ^9.0 ^9.1 ^9.2 ^9.3 Shawn Steele (15 March 2005). "What's the difference between an Encoding, Code Page, Character Set and Unicode?". https://learn.microsoft.com/en-us/archive/blogs/shawnste/whats-the-difference-between-an-encoding-code-page-character-set-and-unicode.
↑ ^10.0 ^10.1 ^10.2 ^10.3 ^10.4 ^10.5 ^10.6 "Glossary of Unicode Terms". Unicode Consortium. https://unicode.org/glossary/.
↑ "VT510 Video Terminal Programmer Information". Digital Equipment Corporation (DEC). 7.1. Character Sets - Overview. http://www.vt100.net/docs/vt510-rm/chapter7.html#S7.1. "In addition to traditional DEC and ISO character sets, which conform to the structure and rules of ISO 2022, the VT510 supports a number of IBM PC code pages (page numbers in IBM's standard character set manual) in PCTerm mode to emulate the console terminal of industry-standard PCs."
↑ ^12.0 ^12.1 ^12.2 ^12.3 ^12.4 Whistler, Ken; Freytag, Asmus (2022-11-11). "UTR#17: Unicode Character Encoding Model". Unicode Consortium. https://www.unicode.org/reports/tr17/.
↑ ^13.0 ^13.1 "Chapter 3: Conformance". The Unicode Standard Version 15.0 – Core Specification. Unicode Consortium. September 2022. ISBN 978-1-936213-32-0. https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf.
↑ "Terminology (The Java Tutorials)". Oracle. https://docs.oracle.com/javase/tutorial/i18n/text/terminology.html.
↑ "Encoding.Convert Method". Microsoft .NET Framework Class Library. https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.convert?redirectedfrom=MSDN&view=net-6.0#overloads.
↑ "MultiByteToWideChar function (stringapiset.h)". 13 October 2021. https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar.
↑ "WideCharToMultiByte function (stringapiset.h)". 9 August 2022. https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte.
↑ Galloway, Matt (9 October 2012). "Character encoding for iOS developers. Or UTF-8 what now?" (in en). https://www.galloway.me.uk/2012/10/character-encoding-for-ios-developers-utf8/. "in reality, you usually just assume UTF-8 since that is by far the most common encoding."

External links

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Character encoding. Read more

[1] "Character Encoding Definition". September 24, 2010. http://techterms.com/definition/characterencoding.

[W3TechsWebEncoding-2] 2.0 ^2.1 "Usage Survey of Character Encodings broken down by Ranking". https://w3techs.com/technologies/cross/character_encoding/ranking.

[:0-3] 3.0 ^3.1 "Charset". https://developer.android.com/reference/java/nio/charset/Charset. "Android note: The Android platform default is always UTF-8."

[4] Tom Henderson (17 April 2014). "Ancient Computer Character Code Tables – and Why They're Still Relevant". Smartbear. https://blog.smartbear.com/development/ancient-computer-character-code-tables-and-why-theyre-still-relevant/.

[5] "IBM Electronic Data-Processing Machines Type 702 Preliminary Manual of Information". 1954. p. 80. https://www.bitsavers.org/pdf/ibm/702/22-6173-1_702prelim_Feb56.pdf.

[6] "UNIVAC System". http://www.bitsavers.org/pdf/univac/univac1/UnivacI_RefCard.pdf.

[7] Tom Jennings (20 April 2016). "An annotated history of some character codes". https://www.sr-ix.com/Archive/CharCodeHist/index.html.

[Strelho-8] Strelho, Kevin (April 15, 1985). "IBM Drives Hard Disks to New Standards". InfoWorld (Popular Computing Inc.): pp. 29–33. https://books.google.com/books?id=zC4EAAAAMBAJ&pg=PA29.

[SteeleMSDN-9] 9.0 ^9.1 ^9.2 ^9.3 Shawn Steele (15 March 2005). "What's the difference between an Encoding, Code Page, Character Set and Unicode?". https://learn.microsoft.com/en-us/archive/blogs/shawnste/whats-the-difference-between-an-encoding-code-page-character-set-and-unicode.

[Unicode_glossary-10] 10.0 ^10.1 ^10.2 ^10.3 ^10.4 ^10.5 ^10.6 "Glossary of Unicode Terms". Unicode Consortium. https://unicode.org/glossary/.

[DEC_VT510-11] "VT510 Video Terminal Programmer Information". Digital Equipment Corporation (DEC). 7.1. Character Sets - Overview. http://www.vt100.net/docs/vt510-rm/chapter7.html#S7.1. "In addition to traditional DEC and ISO character sets, which conform to the structure and rules of ISO 2022, the VT510 supports a number of IBM PC code pages (page numbers in IBM's standard character set manual) in PCTerm mode to emulate the console terminal of industry-standard PCs."

[utr17-12] 12.0 ^12.1 ^12.2 ^12.3 ^12.4 Whistler, Ken; Freytag, Asmus (2022-11-11). "UTR#17: Unicode Character Encoding Model". Unicode Consortium. https://www.unicode.org/reports/tr17/.

[unicode15-13] 13.0 ^13.1 "Chapter 3: Conformance". The Unicode Standard Version 15.0 – Core Specification. Unicode Consortium. September 2022. ISBN 978-1-936213-32-0. https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf.

[14] "Terminology (The Java Tutorials)". Oracle. https://docs.oracle.com/javase/tutorial/i18n/text/terminology.html.

[15] "Encoding.Convert Method". Microsoft .NET Framework Class Library. https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.convert?redirectedfrom=MSDN&view=net-6.0#overloads.

[16] "MultiByteToWideChar function (stringapiset.h)". 13 October 2021. https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar.

[17] "WideCharToMultiByte function (stringapiset.h)". 9 August 2022. https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte.

[:1-18] Galloway, Matt (9 October 2012). "Character encoding for iOS developers. Or UTF-8 what now?" (in en). https://www.galloway.me.uk/2012/10/character-encoding-for-ios-developers-utf8/. "in reality, you usually just assume UTF-8 since that is by far the most common encoding."

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

v t e Character encodings
Early telecommunications	ASCII ISO/IEC 646 ISO/IEC 6937 / ITU T.51 ITU T.61 BCDIC Baudot code Morse code Telegraph code Wabun code Special telegraphy codes Non-Latin Chinese Cyrillic Needle telegraph codes
ISO/IEC 8859	Approved -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -13 -14 -15 -16 Abandoned -12 Adaptations ISO-IR-182 ISO-IR-200 ISO-IR-201 Proposed but not approved ISO-IR-111 ISO-IR-197 French/Dutch/Turkish draft
Bibliographic use	ANSEL ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 MARC-8
National standards	ArmSCII BraSCII CNS 11643 ELOT 927 GOST 10859 GB 18030 HKSCS I.S. 434 ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 PASCII SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	7-bit CN CN-EXT JP JP-EXT JP-1 JP-2 JP-3 KR ISO/IEC 4873 ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC CN KR JP TW CCCII
MacOS code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic CentEuro ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari Dingbats Farsi (Persian) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Japanese / ShiftJIS Keyboard Korean / EUC-KR Latin (Kermit) Maltese/Esperanto Ogham / I.S. 434 Roman Romanian Sámi Symbol Thai / TIS-620 Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	100 111 112 113 151 152 161 162 163 164 165 166 210 220 301 437 449 489 620 667 668 707 708 709 710 711 714 715 720 721 737 768 770 771 772 773 774 775 776 777 778 790 850 851 852 853 854 855/872 856 857 858 859 860 861 862 863 864/17248 865 866/808 867 868 869 874/1161/1162 876 877 878 881 882 883 884 885 891 895 896 897 898 899 900 903 904 906 907 909 910 911 926 927 928 929 932 934 936 938 941 942 943 944 946 947 948 949 950/1370 951 966 991 1034 1039 1040 1041 1042 1043 1044 1046 1086 1088 1092 1093 1098 1108 1109 1114 1115 1116 1117 1118 1119 1125/848 1126 1127 1131/849 1139 1167 1168 1300 1351 1361 1362 1363 1372 1373 1374 1375 1380 1381 1385 1386 1391 1392 1393 1394 57781 58152 58210 58335 59234 59829 60258 60853 61282 62306 CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický KOI8 Mazovia MIK
IBM AIX code pages	367 371 806 813 819 895 896 912 913 914 915 916 919 920 921/901 922/902 923 952 953 954 955 956 957 958 959 960 961 963 964 965 970 971 1004 1006 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1029 1036 1089 1111 1124 1129/1163 1133 1350 1382 1383
IBM Apple MacIntosh emulations	1275 1280 1281 1282 1283 1284 1285 1286
IBM Adobe emulations	1038 1276 1277
IBM DEC emulations	1020 1021 1023 1090 1100 1101 1102 1103 1104 1105 1106 1107 1287 1288
IBM HP emulations	1050 1051 1052 1053 1054 1055 1056 1057 1058
Windows code pages	CER-GS 874/1162 (TIS-620) 932/943 (Shift JIS) 936/1386 (GBK) 950/1370 (Big5) 949/1363 (EUC-KR) 1169 1174 Extended Latin-8 1200 (UTF-16LE) 1201 (UTF-16BE) 1250 1251 1252 1253 1254 1255 1256 1257 1258 1261 1270 54936 (GB18030) 65001 (UTF-8)
EBCDIC EBCDIC code pages	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37/1140 37-2 38 39 40 251 252 254 256 257 258 259 260 264 273/1141 274 275 276 277/1142 278/1143 279 280/1144 281 282 283 284/1145 285/1146 286 287 288 289 290 293 297/1147 298 300 310 320 321 322 330 351 352 353 355 357 358 359 360 361 363 382 383 384 385 386 387 388 389 390 391 392 393 394 395 410 420/16804 421 423 424/8616/12712 425 435 500/1148 803 829 833 834 835 836 837 838/1160 839 870/1110/1153 871/1149 875/4971/9067 880 881 882 883 884 885 886 887 888 889 890 892 893 905 918 924 930/1390 931 933/1364 935/1388 937/1371 939/1399 1001 1002 1003 1005 1007 1024 1025/1154 1026/1155 1027 1028 1030 1031 1032 1033 1037 1047 1068 1069 1070 1071 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1087 1091 1097 1112/1156 1113 1122/1157 1123/1158 1130/1164 1132 1136 1137 1150 1151 1152 1159 1165 1166 1278 1279 1303 1364 1376 1377 JEF KEIS
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish 7-bit Hebrew 8-bit Hebrew Special Graphics Technical (TCS)
Platform specific	Acorn Adobe Standard Adobe Latin 1 Amstrad CPC Apple I Apple II Apple III ATASCII Atari ST BICS Casio calculators CDC Compucolor II DEC Radix-50 DEC MCS/NRCS DG International ELWRO-Junior FIELDATA GEM GEOS GSM 03.38 HP Roman Extension HP Roman-8 HP Roman-9 HP FOCAL HP RPL LICS LMBCS Mattel Aquarius Minitel MSX NEC APC NeXT OricSCII PCW PETSCII Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International Ventura Symbol Videotex WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 (UTF-16LE/UTF-16BE) / UCS-2 UTF-32 (UTF-32LE/UTF-32BE) / UCS-4 UTF-EBCDIC GB 18030 BOCU-1 CESU-8 SCSU
TeX typesetting system	Cork LGR LY1 OML OMS OMX OT1 OT2 OT3 OT4 T2A T2B T2C T2D T3 T4 T5 TS1 TS3 U X2
Miscellaneous code pages	ABICOMP APL 293 310 (Graphic Escape) 351 (GDDM) 907 (OEM) ISO-IR-68 ARIB STD-B24 HZ IEC-P27-1 INIS 7-bit 8-bit Cyrillic ISO-IR-169 ISO 2033 Johab SEASCII Stanford/ITS TACE16 TRON UTF-5 UTF-6 WTF-8
Related topics	Code page Control character (C0 C1) CCSID Character encodings in HTML Charset detection Han unification Hardware ISO 6429/IEC 6429/ANSI X3.64 Mojibake
Character sets

Anonymous

Search

Character encoding

Namespaces

More

Page actions

Contents

History