UTF-8 vs. UTF-16: Differences, Trade-offs, and When to Use Each

Published March 15, 2025

Two ways to encode Unicode

UTF-8 and UTF-16 are both Unicode Transformation Formats — they encode the same set of Unicode code points into sequences of bytes. The difference ishow they lay out those bytes. Neither is inherently “better”; each has trade-offs that make it more suitable for certain contexts.

UTF-8 byte structure

UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. The leading bits of the first byte indicate how many bytes follow:

Bytes	Bit pattern	Code point range
1	`0xxxxxxx`	U+0000 – U+007F
2	`110xxxxx 10xxxxxx`	U+0080 – U+07FF
3	`1110xxxx 10xxxxxx 10xxxxxx`	U+0800 – U+FFFF
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	U+10000 – U+10FFFF

Because the 1-byte form is identical to ASCII, any valid ASCII document is also valid UTF-8. This backward compatibility is UTF-8's greatest advantage.

UTF-16 byte structure

UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (BMP, U+0000–U+FFFF) are encoded directly as a single 16-bit code unit. Characters outside the BMP (U+10000–U+10FFFF) are encoded using a surrogate pair:

High surrogate: a value in the range U+D800–U+DBFF
Low surrogate: a value in the range U+DC00–U+DFFF

Together the pair encodes the supplementary code point. This means code that assumes “one character = one 16-bit unit” will break on emoji and many historic scripts.

Byte-level example

Here's how three representative characters are encoded in each format (big-endian for UTF-16):

Char	Code point	UTF-8 bytes	UTF-16 bytes
A	`U+0041`	`41` (1 byte)	`00 41` (2 bytes)
€	`U+20AC`	`E2 82 AC` (3 bytes)	`20 AC` (2 bytes)
😀	`U+1F600`	`F0 9F 98 80` (4 bytes)	`D8 3D DE 00` (4 bytes)

Size trade-offs

The right encoding depends on the text content:

ASCII-heavy text (English, source code, markup): UTF-8 is ~50% smaller because ASCII characters take 1 byte instead of 2
CJK-heavy text (Chinese, Japanese, Korean): UTF-16 can be smaller because most CJK characters take 2 bytes in UTF-16 but 3 in UTF-8
Mixed text: depends on the ratio. The break-even point is roughly around characters in the U+0800–U+FFFF range

BOM (Byte Order Mark)

UTF-16 comes in two byte orders: big-endian (UTF-16BE) and little-endian (UTF-16LE). A Byte Order Mark (U+FEFF) at the start of a file indicates which order is used. If the first two bytes are FE FF, the file is big-endian; if FF FE, it's little-endian.

UTF-8 can also have a BOM (EF BB BF), but it's unnecessary since UTF-8 has no byte-order ambiguity. The Unicode Standard discourages its use, and it can cause issues with tools that don't expect it (like shell scripts with a shebang line).

Where each is used

UTF-8: the web (HTML5 default), Linux/macOS file systems, JSON, TOML, YAML, Go, Rust, Python 3 (default source encoding), modern APIs
UTF-16: JavaScript strings, Java char, .NET string, Windows APIs (Win32 wide-char functions), macOS NSString internals, ICU library

The trend is clear: new protocols and systems almost universally choose UTF-8. UTF-16 persists mainly in runtimes and operating systems that adopted Unicode before UTF-8 became dominant (JavaScript was designed in 1995, Java in 1996, Windows NT in 1993).