UTF-8 vs. UTF-16: Differences, Trade-offs, and When to Use Each
Published March 15, 2025
Two ways to encode Unicode
UTF-8 and UTF-16 are both Unicode Transformation Formats — they encode the same set of Unicode code points into sequences of bytes. The difference ishow they lay out those bytes. Neither is inherently “better”; each has trade-offs that make it more suitable for certain contexts.
UTF-8 byte structure
UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. The leading bits of the first byte indicate how many bytes follow:
| Bytes | Bit pattern | Code point range |
|---|---|---|
| 1 | 0xxxxxxx | U+0000 – U+007F |
| 2 | 110xxxxx 10xxxxxx | U+0080 – U+07FF |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx | U+0800 – U+FFFF |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | U+10000 – U+10FFFF |
Because the 1-byte form is identical to ASCII, any valid ASCII document is also valid UTF-8. This backward compatibility is UTF-8's greatest advantage.
UTF-16 byte structure
UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (BMP, U+0000–U+FFFF) are encoded directly as a single 16-bit code unit. Characters outside the BMP (U+10000–U+10FFFF) are encoded using a surrogate pair:
- High surrogate: a value in the range
U+D800–U+DBFF - Low surrogate: a value in the range
U+DC00–U+DFFF
Together the pair encodes the supplementary code point. This means code that assumes “one character = one 16-bit unit” will break on emoji and many historic scripts.
Byte-level example
Here's how three representative characters are encoded in each format (big-endian for UTF-16):
| Char | Code point | UTF-8 bytes | UTF-16 bytes |
|---|---|---|---|
| A | U+0041 | 41 (1 byte) | 00 41 (2 bytes) |
| € | U+20AC | E2 82 AC (3 bytes) | 20 AC (2 bytes) |
| 😀 | U+1F600 | F0 9F 98 80 (4 bytes) | D8 3D DE 00 (4 bytes) |
Size trade-offs
The right encoding depends on the text content:
- ASCII-heavy text (English, source code, markup): UTF-8 is ~50% smaller because ASCII characters take 1 byte instead of 2
- CJK-heavy text (Chinese, Japanese, Korean): UTF-16 can be smaller because most CJK characters take 2 bytes in UTF-16 but 3 in UTF-8
- Mixed text: depends on the ratio. The break-even point is roughly around characters in the U+0800–U+FFFF range
BOM (Byte Order Mark)
UTF-16 comes in two byte orders: big-endian (UTF-16BE) and little-endian (UTF-16LE). A Byte Order Mark (U+FEFF) at the start of a file indicates which order is used. If the first two bytes are FE FF, the file is big-endian; if FF FE, it's little-endian.
UTF-8 can also have a BOM (EF BB BF), but it's unnecessary since UTF-8 has no byte-order ambiguity. The Unicode Standard discourages its use, and it can cause issues with tools that don't expect it (like shell scripts with a shebang line).
Where each is used
- UTF-8: the web (HTML5 default), Linux/macOS file systems, JSON, TOML, YAML, Go, Rust, Python 3 (default source encoding), modern APIs
- UTF-16: JavaScript strings, Java
char, .NETstring, Windows APIs (Win32 wide-char functions), macOS NSString internals, ICU library
The trend is clear: new protocols and systems almost universally choose UTF-8. UTF-16 persists mainly in runtimes and operating systems that adopted Unicode before UTF-8 became dominant (JavaScript was designed in 1995, Java in 1996, Windows NT in 1993).