From Morse Code to Unicode: A Brief History of Character Encoding

Published March 15, 2025

Before computers: telegraph codes

The need to encode text as signals predates computers by over a century. In the 1840s, Samuel Morse and Alfred Vail developed Morse code, which represented letters and digits as sequences of dots and dashes. Morse code was variable-length — common letters like E (a single dot) were shorter than rare ones like Q (dash-dash-dot-dash).

In 1870, Émile Baudot introduced the Baudot code, a fixed-length 5-bit encoding that could represent 32 symbols. A “shift” mechanism toggled between letters and figures, effectively doubling the character set. The later ITA2 standard (1930) refined Baudot code and was used in teleprinters well into the 20th century.

EBCDIC and the mainframe era

When IBM launched the System/360 in 1964, it introduced EBCDIC (Extended Binary Coded Decimal Interchange Code), an 8-bit encoding with 256 possible values. EBCDIC descended from punch card encodings, and its character layout reflected that heritage — letters were not contiguous, making string operations more complex than they needed to be.

Different EBCDIC code pages supported different languages, but they were mutually incompatible. A file created on one mainframe could display garbled text on another if the code pages didn't match. Despite its quirks, EBCDIC remained the standard for IBM mainframes for decades and is still used in some legacy systems today.

ASCII: the first universal standard

In 1963, the American Standards Association published ASCII (American Standard Code for Information Interchange). At just 7 bits, ASCII defined 128 characters: 33 control characters (like newline and tab) and 95 printable characters (letters, digits, punctuation, and space).

ASCII's design was deliberate. Uppercase and lowercase letters differ by a single bit, making case conversion trivial. Digits 0–9 map to values 0x30–0x39, so converting a digit character to its numeric value is a simple subtraction.

As minicomputers and personal computers spread in the 1970s and 1980s, ASCII became the de facto standard. But it had a fundamental limitation: 128 characters are nowhere near enough for the world's writing systems.

The code page chaos

To support characters beyond ASCII, vendors and standards bodies created 8-bit extended ASCII encodings. The upper 128 values (128–255) were assigned differently depending on the encoding:

ISO 8859-1 (Latin-1) covered Western European languages with characters like é, ü, and ñ
Windows-1252 extended Latin-1 with smart quotes, em dashes, and the euro sign in the C1 control range
ISO 8859-5 covered Cyrillic, ISO 8859-6 covered Arabic, and so on — 15 parts in total
East Asian languages like Chinese, Japanese, and Korean required multi-byte encodings like Shift_JIS, EUC-JP, Big5, and GB2312

This fragmentation led to mojibake — garbled text caused by interpreting bytes in the wrong encoding. A document saved in Windows-1252 and opened as ISO 8859-1 would display curly quotes as garbage characters. It was a mess, and it was clear the world needed a single universal encoding.

The birth of Unicode

In 1991, the Unicode Consortium published Unicode 1.0 with the ambitious goal of assigning a unique number (a code point) to every character in every writing system, living or dead.

The initial design assumed 16 bits (65,536 code points) would be enough. It wasn't. Unicode was later expanded to 21 bits, supporting over 1.1 million code points organized into 17 planes. As of Unicode 16.0, over 154,000 characters have been assigned, covering 168 scripts.

Unicode also standardized encoding forms: UTF-32 (fixed 4 bytes per character), UTF-16 (2 or 4 bytes), and UTF-8 (1 to 4 bytes). Each trades off simplicity against space efficiency.

UTF-8 wins the web

UTF-8 was designed by Ken Thompson and Rob Pike in 1992 at a New Jersey diner. Its killer feature is ASCII compatibility: any valid ASCII text is also valid UTF-8. This meant existing systems didn't need to change to adopt it.

UTF-8 uses 1 byte for ASCII characters, 2 bytes for most Latin and Cyrillic characters, 3 bytes for most CJK characters, and 4 bytes for emoji and historic scripts. This variable-width design makes it space-efficient for the text most commonly found on the web.

The rise was gradual but decisive. In 2008, UTF-8 overtook ASCII and Latin-1 as the most common encoding on the web. Today, over 98% of all web pages use UTF-8, and it is the default encoding for HTML5, JSON, TOML, YAML, and most modern programming languages and protocols.

The story of character encoding is one of hard-won convergence. After decades of incompatible systems, fragmented code pages, and mojibake, the world has largely settled on Unicode encoded as UTF-8 — a single standard that can represent every character humanity has ever written.