</>
character.codes
← Back to Learn

UTF-8 vs. UTF-16: Differences, Trade-offs, and When to Use Each

Published March 15, 2025

Two ways to encode Unicode

UTF-8 and UTF-16 are both Unicode Transformation Formats — they encode the same set of Unicode code points into sequences of bytes. The difference ishow they lay out those bytes. Neither is inherently “better”; each has trade-offs that make it more suitable for certain contexts.

UTF-8 byte structure

UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. The leading bits of the first byte indicate how many bytes follow:

BytesBit patternCode point range
10xxxxxxxU+0000 – U+007F
2110xxxxx 10xxxxxxU+0080 – U+07FF
31110xxxx 10xxxxxx 10xxxxxxU+0800 – U+FFFF
411110xxx 10xxxxxx 10xxxxxx 10xxxxxxU+10000 – U+10FFFF

Because the 1-byte form is identical to ASCII, any valid ASCII document is also valid UTF-8. This backward compatibility is UTF-8's greatest advantage.

UTF-16 byte structure

UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (BMP, U+0000–U+FFFF) are encoded directly as a single 16-bit code unit. Characters outside the BMP (U+10000–U+10FFFF) are encoded using a surrogate pair:

  • High surrogate: a value in the range U+D800U+DBFF
  • Low surrogate: a value in the range U+DC00U+DFFF

Together the pair encodes the supplementary code point. This means code that assumes “one character = one 16-bit unit” will break on emoji and many historic scripts.

Byte-level example

Here's how three representative characters are encoded in each format (big-endian for UTF-16):

CharCode pointUTF-8 bytesUTF-16 bytes
AU+004141 (1 byte)00 41 (2 bytes)
U+20ACE2 82 AC (3 bytes)20 AC (2 bytes)
😀U+1F600F0 9F 98 80 (4 bytes)D8 3D DE 00 (4 bytes)

Size trade-offs

The right encoding depends on the text content:

  • ASCII-heavy text (English, source code, markup): UTF-8 is ~50% smaller because ASCII characters take 1 byte instead of 2
  • CJK-heavy text (Chinese, Japanese, Korean): UTF-16 can be smaller because most CJK characters take 2 bytes in UTF-16 but 3 in UTF-8
  • Mixed text: depends on the ratio. The break-even point is roughly around characters in the U+0800–U+FFFF range

BOM (Byte Order Mark)

UTF-16 comes in two byte orders: big-endian (UTF-16BE) and little-endian (UTF-16LE). A Byte Order Mark (U+FEFF) at the start of a file indicates which order is used. If the first two bytes are FE FF, the file is big-endian; if FF FE, it's little-endian.

UTF-8 can also have a BOM (EF BB BF), but it's unnecessary since UTF-8 has no byte-order ambiguity. The Unicode Standard discourages its use, and it can cause issues with tools that don't expect it (like shell scripts with a shebang line).

Where each is used

  • UTF-8: the web (HTML5 default), Linux/macOS file systems, JSON, TOML, YAML, Go, Rust, Python 3 (default source encoding), modern APIs
  • UTF-16: JavaScript strings, Java char, .NET string, Windows APIs (Win32 wide-char functions), macOS NSString internals, ICU library

The trend is clear: new protocols and systems almost universally choose UTF-8. UTF-16 persists mainly in runtimes and operating systems that adopted Unicode before UTF-8 became dominant (JavaScript was designed in 1995, Java in 1996, Windows NT in 1993).