What is the strangest unicode character

Character set / character sets

Computers and digital circuits can only store and process 0 and 1. Therefore each character is stored in a bit code. The character set defines which character corresponds to which bit code. A character set comprises a collection of characters to represent states and values. These include numbers, letters, umlauts, punctuation marks, symbols, special characters, control characters and formula characters. There are around 100 important characters for which 7 bits (128 characters) are sufficient.

  • Upper and lower case letters A-Z and a-z (52)
  • Digits from 0 to 9 (10)
  • Control and special characters (65)

US-ASCII

ASCII, American Standard Code for Information Interchange, is the mother of all character sets and was developed for a teleprinter in 1963. ASCII is a 7-bit character encoding and contains printable characters and control characters. ASCII corresponds to the US variant of ISO 646 and serves as the basis for later codings based on more bits for character sets (ISO-8859 and Unicode).

yx000001010011100101110111
0000NUL0DLE16 32048@64P.80 96p112
0001SOH1DC117!33149A.65Q81a97q113
0010STX2DC218"34250B.66R.82b98r114
0011ETX3DC319#35351C.67S.83c99s115
0100EOT4DC420$36452D.68T84d100t116
0101ENQ5NAK21%37553E.69U85e101u117
0110ACK6SYN22&38654F.70V.86f102v118
0111BEL7ETB23'39755G71W.87G103w119
1000BS8CAN24(40856H72X88H104x120
1001HT9EM25)41957I.73Y89i105y121
1010LF10SUB26*42:58J74Z90j106z122
1011VT11ESC27+43;59K75[91k107{123
1100FF12FS28,44<60L.76\92l108|124
1101CR13QS29-45=61M.77]93m109}125
1110SO14RS30.46>62N78^94n110126
1111SI15US31/47?63O79_95O111DEL127
 Control charactersCharacters
. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SP! "# $% & '() * +, -. / 3 0 1 2 3 4 5 6 7 8 9:; < ==""> ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [] ^ _ 6 `a b c d e f g h i j k l m n o 7 p q r s t u v w x y z {| } ~ DEL

ASCII's printable characters include the upper and lower case Latin alphabet, ten Arabic numerals, and some punctuation marks. The set of characters largely corresponds to a keyboard or typewriter for the English language. In computers and other electronic devices that display text, the text is usually stored in accordance with ASCII or backwards compatible (ISO 8859-1, Latin-1, Unicode).

Note: The typical processing width is not 7, but 8 bits. The reason why ASCII only uses 7 bits is because of its original use for data transmission. The 8th bit was added to the end of every ASCII code as a check bit. In this way one had a simple error detection for the data transmission.

Example: Hello world

textHallO W.elt
Bit sequence (7 bit)100 1000110 0001110 1100110 1100110 1111010 0000101 0111110 0101110 1100111 0100
ASCII code (decimal)072097108108111032087101108116
ASCII code (hex)48616C6C6F2057656C74

Over time, ASCII was expanded. For example, ASCII was expanded to include umlauts or frame characters (256 characters with 8 bits). Unfortunately, there was never a uniform standard. This is why the exchange of ASCII files can lead to strange representations of the special characters in the text.

ISO 8859

ISO 8859 comprises 15 different 8-bit character sets that are based on ASCII and have been expanded to include additional country-specific characters for different language areas. The first 128 characters (0 to 127) match ASCII. The remaining 128 characters (128 to 255) are assigned differently.
ISO / IEC no longer actively develops the ISO 8859 standards. 256 characters are simply not enough to represent all internationally valid characters. In the future, the ISO-8859 standard will be replaced by the Unicode standard. This is becoming more and more widespread, especially in the form of UTF-8 coding.

  • ISO 8859-1: Latin-1, Western European
  • ISO 8859-2: Latin-2, Central European
  • ISO 8859-3: Latin-3, South European
  • ISO 8859-4: Latin-4, Northern European
  • ISO 8859-5: Cyrillic
  • ISO 8859-6: Arabic
  • ISO 8859-7: Greek
  • ISO 8859-8: Hebrew
  • ISO 8859-9: Latin-5, Turkish
  • ISO 8859-10: Latin-6, Nordic
  • ISO 8859-11: Thai
  • ISO 8859-12: was developed but never specified
  • ISO 8859-13: Latin-7, Baltic
  • ISO 8859-14: Latin-8, Celtic
  • ISO 8859-15: Latin-9, Western European
  • ISO 8859-16: Latin-10, Europe

The sub-standards of ISO 8859 are closely related to one another. They only differ in that they have additional, non-displayable control characters in the free positions in ISO 8859.

Unicode

Unicode is an international standard that defines every character or text element of all known writing cultures and character systems in a digital code. The use of different and incompatible codes and character sets in different countries and cultures is to be eliminated. One goal is to have one character set for all languages ​​and all characters.
ISO / IEC 10646 is the ISO designation of the same meaning for the Unicode character set. Unicode is called Universal Character Set (UCS) by the ISO.
Every Unicode character has a stable code (e.g. Ux00DF hex), has a stable, valid character name (e.g. LATIN LETTER SHARP S), there is a demo representation (e.g. "ß") and has documented character properties (e.g. letter, lower case).

Unicode is constantly being supplemented with additional characters. The character set breaks the old 8-bit limit of ISO 8859. For this reason, Unicode is also available with 16 or 32 bits. UTF-8 is only required for ASCII and Latin-1 compatibility. This means that the first 256 characters correspond to ISO 8859-1 and in turn the first 128 characters correspond to ASCII.

  • UTF-8 (8-bit Unicode) / UCS-1 (1 byte)
  • UTF-16 (16-bit Unicode) / UCS-2 (2 bytes)
  • UTF-32 (32-bit Unicode) / UCS-4 (4 bytes)

In order to be able to recognize the corresponding coding of the characters in a text file, there are repeating code patterns or bit patterns that can be used to identify whether the characters are UTF-8, UTF-16 or UTF-32-coded.

Bit patternCharacter (hex)Memory per characterNumber of characters
0xxx xxxx0 ... 7F7 bit (ASCII)128
110x xxxx 10xx xxxx80 ... 7FF2 bytes2.048
1110 xxxx 10xx xxxx 10xx xxxx800 ... FFFF3 bytes65.536
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx10000 ... 10FFFF4 bytes2.097.152

Other character sets

Before Unicode was standardized, there were several other character sets.

  • various IBM code pages, e.g. B. CP427 (8 bit)
  • various Windows code pages, e.g. B. Windows-1252 (8 bit)
  • Mac, e.g. B. mac-roman
  • EBCDIC (IBM mainframe)
  • Big 5 (traditional chinese), GB-2312 (simplified chinese)
  • JIS 0208, JIS 0212 (Japanese)
  • Korean and Vietnamese character sets

Other related topics:

Everything you need to know about computer technology.

Computer technology primer

The computer technology primer is a book about the basics of computer technology, processor technology, semiconductor memory, interfaces, data storage devices, drives and important hardware components.

I want that!

Everything you need to know about computer technology.

Computer technology primer

The computer technology primer is a book about the basics of computer technology, processor technology, semiconductor memory, interfaces, data storage devices, drives and important hardware components.

I want that!