Sorting out Windows text encodings and line endings - Shift_JIS / UTF-8 / UTF-16, mojibake, CRLF / LF, and why it gets confusing

· · Windows, Text Encodings, Mojibake, Line Endings, UTF-8, CP932, PowerShell, Unicode

What to get straight first

A text file is not the string itself; it is bytes + encoding + line endings. A BOM may also be attached.

Most trouble falls into one of three buckets:

  1. The same byte sequence was read as a different encoding -> mojibake
  2. The two sides disagree on what marks a line break -> line-ending trouble (even when the encoding is correct)
  3. The misread content was saved as-is -> data corruption (not recoverable)

Just separating these three speeds up root-cause analysis dramatically.

Sorting out the vocabulary – fix the wording first

The same character is a different byte sequence in each encoding

Character: あ

UTF-8     : E3 81 82
CP932     : 82 A0
UTF-16LE  : 42 30

A character and a byte sequence are different things. Accidents happen at the boundary where one is converted to the other.

Common terminology traps on Windows

Term What it actually means Watch out
ANSI The active code page on that machine (CP932 in a Japanese environment) It is not ASCII. It is environment-dependent
Unicode (as shown in menus) Often refers to UTF-16LE “Save as Unicode” may mean UTF-16LE. It is not necessarily UTF-8
Shift_JIS In casual usage, often means CP932 Strictly speaking it differs from shift_jis on the Linux side
UTF-8N UTF-8 no BOM (an editor-specific label) It is not an official name
text Different people assume different things A spec should be specific, e.g. “UTF-8 no BOM, LF”

Mismatched labels create confusion: a conversation can sound like it agrees while the actual bytes do not.

What mojibake really is – a simple mechanism

original string -> serialized as bytes with encoding A -> read back as a string with encoding B -> assumptions disagree -> mojibake
  • Reading the UTF-8 byte sequence for (E3 81 82) as CP932 produces something like instead
  • At this point the bytes are not damaged yet (only the way of reading them was wrong)
  • What is dangerous is saving the broken-looking content as-is -> the original byte sequence is lost

Display glitch vs data corruption

Stage State Recoverable?
A UTF-8 file was just opened as CP932 Display is broken but bytes are intact Recoverable (reopen with the correct encoding)
The broken-looking content was saved The bytes have been overwritten with different ones Generally not recoverable

Line endings – a separate problem from the encoding

Line breaks are also bytes.

Line ending Bytes Where it lives
CRLF 0D 0A Traditional Windows text files, legacy tools
LF 0A Linux / macOS / modern dev tooling
CR 0D Classic Mac (rarely seen today)

The key point is that the encoding and the line ending are independent.

UTF-8 + LF       : 41 0A 42          <- "A\nB"
UTF-8 + CRLF     : 41 0D 0A 42       <- "A\r\nB"
CP932 + LF       : possible
UTF-16LE + CRLF  : possible

“I switched to UTF-8 and it still does not match” is often a case where the encoding is right but only the line ending is off.

Why Windows is especially confusing

1. Unicode and code pages coexist

Multiple text cultures live side by side on Windows:

  • UTF-8 (modern, web, cross-platform)
  • CP932 (legacy CSV, TXT, logs, business-system integrations)
  • UTF-16LE (some APIs, in-memory representations of certain apps)

A single Windows machine carries three different assumptions at once.

2. Tools quietly change the assumptions

Even when you do not specify anything, the layers below add their own assumptions:

  • Editor auto-detection
  • Automatic BOM addition / removal on save
  • Git’s CRLF <-> LF conversion (core.autocrlf)
  • The console code page (chcp)
  • PowerShell’s default encoding, which differs by version
  • An unexpected code page on CSV export

“It broke and I did not change anything” usually means a tool’s default changed.

3. Problems hide in the ASCII range

UTF-8 is ASCII-compatible, so a file that contains only ASCII alphanumerics will “happen to work”. The trouble shows up the moment a single Japanese line is added.

Common failure patterns

Situation What is mismatched Symptom
Opening a UTF-8 no BOM file in a legacy tool Reader misdetects it as CP932 Only the Japanese is mojibake
Feeding a CP932 CSV into a UTF-8-only pipeline Reader’s assumption is wrong , decode errors
Passing a UTF-16LE log to Unix-style tools Encoding mismatch NUL bytes appear, treated as binary
An LF source file is converted to CRLF in another environment Line-ending assumption differs Huge end-of-line diffs, broken shell scripts
Saving after seeing mojibake Bytes have been replaced Data corruption, not recoverable
A spec that just says “CSV” Interface is undefined Excel can read it, other tools cannot

Six rules to reduce accidents in practice

1. Define the convention for new files

“UTF-8” alone is not enough. At minimum, decide the following:

  • BOM yes / no
  • Line ending CRLF / LF
  • Who reads it (a legacy Windows tool? Linux / CI as well?)
Use case Recommendation
Cross-platform source code UTF-8 no BOM, LF
Integration with legacy Windows tools UTF-8 with BOM, CRLF, or CP932
CSV opened in Excel UTF-8 with BOM, CRLF

2. Do not silently convert existing files

  • Do not sneak in a UTF-8 conversion alongside a small everyday fix
  • Treat encoding conversion as a separate task
  • Check downstream consumers before converting

3. Write specs concretely

Bad:  "output as CSV"
Good: "UTF-8 with BOM, CRLF, comma-separated, with header row"

4. Make the encoding explicit in code

  • Specify the encoding when reading and writing files (do not rely on the implicit default)
  • Be aware of encoding when text crosses process boundaries
  • Do not put quick shell redirections on a production path

5. Share Git and editor settings

  • Pin the line-ending behavior with .gitattributes
  • Share line-ending and encoding settings across the team’s editors
  • Git can normalize line endings, but it will not save you from encoding accidents

6. Change how you report problems

Bad report Good report
“It is mojibake” “A UTF-8 no BOM file appears to be opened as CP932”
“The line endings look wrong” “An LF file was converted to CRLF and the diff exploded”

Just being able to say “what is mismatched” massively speeds up the investigation.

Five questions to drive an investigation

When you are stuck on mojibake or a line-ending issue, answer these five:

  1. What is this file’s byte sequence? (UTF-8 / UTF-8 with BOM / CP932 / UTF-16LE)
  2. Who wrote it under what assumption? (an editor / a legacy app / Excel export / a script)
  3. Who is reading it under what assumption? (editor auto-detection / console code page / library default)
  4. BOM yes or no, and is the line ending CRLF or LF?
  5. Has the misread content already been saved? (just a display issue, or are the bytes already gone?)

Once these five are filled in, the cause is usually visible.

Summary

Windows text encoding looks tangled not because Japanese is hard.

It is tangled because bytes + encoding + BOM + line ending + tool defaults all live independently, and the old and new text cultures coexist on top of that.

What to remember Detail
Cause of mojibake The same bytes were read with a different encoding
Line endings are separate Even with the right encoding, CRLF / LF can still mismatch
Do not over-trust the words “ANSI”, “Unicode”, and “Shift_JIS” mean different things in different tools
UTF-8 alone is not enough A real convention also pins down the BOM and the line ending
Display glitch != data corruption If you reopen before saving, you may be able to recover
Specs should be concrete Write “UTF-8 no BOM, LF”, not just “text”

References

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Back to the Blog