Unicode

Unicode

Unicode: a universal character encoding standard

In the same way that metrics have standardized measurements, Unicode has set the standard for how we define characters.

This encoding is universal, reaching across web documents, text files and any other type of document. Unicode is not the only system of standardization, another is ASCII.

ASCII, however, only supports English characters, capping itself off at 128 characters. Unicode, on the other hand, supports over 1,000,000 individual characters, including characters from all languages across the world.

Another major difference is that ASCII only supports one byte per character, where Unicode can handle up to four bytes.

Unicode itself has many different forms. Two of the most common however, are UTF- 8 and UTF- 16. Of those two, UTF- 8 is the most common form of character encoding, and is used on virtually all common platforms of web pages and software programs.

UTF breaks the characters into four groups, separated by how many bytes they are represented by: common English characters are represented by one byte; Latin, Hebrew and Arabic are each encoded with two bytes; the Asian characters are encoded with three bytes; and all others are four bytes long.

UTF- 16 was an earlier attempt at a universal encoding system like what UTF- 8 eventually became. It nearly fell through after it became clear that the 2^16 possibilities weren’t enough to cover the number of characters, and the Unicode consortium reps wouldn’t allow the 31- bit to progress.

Thus the UTF- 16 was a compromise, but one fraught with holes and incomplete sets. In addition, it wasn’t as inclusive as the later UTF- 8.

Read more