Sorting out Windows text encodings and line endings - Shift_JIS / UTF-8 / UTF-16, mojibake, CRLF / LF, and why it gets confusing
What to get straight first
A text file is not the string itself; it is bytes + encoding + line endings. A BOM may also be attached.
Most trouble falls into one of three buckets:
- The same byte sequence was read as a different encoding -> mojibake
- The two sides disagree on what marks a line break -> line-ending trouble (even when the encoding is correct)
- The misread content was saved as-is -> data corruption (not recoverable)
Just separating these three speeds up root-cause analysis dramatically.
Sorting out the vocabulary – fix the wording first
The same character あ is a different byte sequence in each encoding
Character: あ
UTF-8 : E3 81 82
CP932 : 82 A0
UTF-16LE : 42 30
A character and a byte sequence are different things. Accidents happen at the boundary where one is converted to the other.
Common terminology traps on Windows
| Term | What it actually means | Watch out |
|---|---|---|
| ANSI | The active code page on that machine (CP932 in a Japanese environment) | It is not ASCII. It is environment-dependent |
| Unicode (as shown in menus) | Often refers to UTF-16LE | “Save as Unicode” may mean UTF-16LE. It is not necessarily UTF-8 |
| Shift_JIS | In casual usage, often means CP932 | Strictly speaking it differs from shift_jis on the Linux side |
| UTF-8N | UTF-8 no BOM (an editor-specific label) | It is not an official name |
| text | Different people assume different things | A spec should be specific, e.g. “UTF-8 no BOM, LF” |
Mismatched labels create confusion: a conversation can sound like it agrees while the actual bytes do not.
What mojibake really is – a simple mechanism
original string -> serialized as bytes with encoding A -> read back as a string with encoding B -> assumptions disagree -> mojibake
- Reading the UTF-8 byte sequence for
あ(E3 81 82) as CP932 produces something like縺instead - At this point the bytes are not damaged yet (only the way of reading them was wrong)
- What is dangerous is saving the broken-looking content as-is -> the original byte sequence is lost
Display glitch vs data corruption
| Stage | State | Recoverable? |
|---|---|---|
| A UTF-8 file was just opened as CP932 | Display is broken but bytes are intact | Recoverable (reopen with the correct encoding) |
| The broken-looking content was saved | The bytes have been overwritten with different ones | Generally not recoverable |
Line endings – a separate problem from the encoding
Line breaks are also bytes.
| Line ending | Bytes | Where it lives |
|---|---|---|
| CRLF | 0D 0A |
Traditional Windows text files, legacy tools |
| LF | 0A |
Linux / macOS / modern dev tooling |
| CR | 0D |
Classic Mac (rarely seen today) |
The key point is that the encoding and the line ending are independent.
UTF-8 + LF : 41 0A 42 <- "A\nB"
UTF-8 + CRLF : 41 0D 0A 42 <- "A\r\nB"
CP932 + LF : possible
UTF-16LE + CRLF : possible
“I switched to UTF-8 and it still does not match” is often a case where the encoding is right but only the line ending is off.
Why Windows is especially confusing
1. Unicode and code pages coexist
Multiple text cultures live side by side on Windows:
- UTF-8 (modern, web, cross-platform)
- CP932 (legacy CSV, TXT, logs, business-system integrations)
- UTF-16LE (some APIs, in-memory representations of certain apps)
A single Windows machine carries three different assumptions at once.
2. Tools quietly change the assumptions
Even when you do not specify anything, the layers below add their own assumptions:
- Editor auto-detection
- Automatic BOM addition / removal on save
- Git’s CRLF <-> LF conversion (
core.autocrlf) - The console code page (
chcp) - PowerShell’s default encoding, which differs by version
- An unexpected code page on CSV export
“It broke and I did not change anything” usually means a tool’s default changed.
3. Problems hide in the ASCII range
UTF-8 is ASCII-compatible, so a file that contains only ASCII alphanumerics will “happen to work”. The trouble shows up the moment a single Japanese line is added.
Common failure patterns
| Situation | What is mismatched | Symptom |
|---|---|---|
| Opening a UTF-8 no BOM file in a legacy tool | Reader misdetects it as CP932 | Only the Japanese is mojibake |
| Feeding a CP932 CSV into a UTF-8-only pipeline | Reader’s assumption is wrong | �, decode errors |
| Passing a UTF-16LE log to Unix-style tools | Encoding mismatch | NUL bytes appear, treated as binary |
| An LF source file is converted to CRLF in another environment | Line-ending assumption differs | Huge end-of-line diffs, broken shell scripts |
| Saving after seeing mojibake | Bytes have been replaced | Data corruption, not recoverable |
| A spec that just says “CSV” | Interface is undefined | Excel can read it, other tools cannot |
Six rules to reduce accidents in practice
1. Define the convention for new files
“UTF-8” alone is not enough. At minimum, decide the following:
- BOM yes / no
- Line ending CRLF / LF
- Who reads it (a legacy Windows tool? Linux / CI as well?)
| Use case | Recommendation |
|---|---|
| Cross-platform source code | UTF-8 no BOM, LF |
| Integration with legacy Windows tools | UTF-8 with BOM, CRLF, or CP932 |
| CSV opened in Excel | UTF-8 with BOM, CRLF |
2. Do not silently convert existing files
- Do not sneak in a UTF-8 conversion alongside a small everyday fix
- Treat encoding conversion as a separate task
- Check downstream consumers before converting
3. Write specs concretely
Bad: "output as CSV"
Good: "UTF-8 with BOM, CRLF, comma-separated, with header row"
4. Make the encoding explicit in code
- Specify the encoding when reading and writing files (do not rely on the implicit default)
- Be aware of encoding when text crosses process boundaries
- Do not put quick shell redirections on a production path
5. Share Git and editor settings
- Pin the line-ending behavior with
.gitattributes - Share line-ending and encoding settings across the team’s editors
- Git can normalize line endings, but it will not save you from encoding accidents
6. Change how you report problems
| Bad report | Good report |
|---|---|
| “It is mojibake” | “A UTF-8 no BOM file appears to be opened as CP932” |
| “The line endings look wrong” | “An LF file was converted to CRLF and the diff exploded” |
Just being able to say “what is mismatched” massively speeds up the investigation.
Five questions to drive an investigation
When you are stuck on mojibake or a line-ending issue, answer these five:
- What is this file’s byte sequence? (UTF-8 / UTF-8 with BOM / CP932 / UTF-16LE)
- Who wrote it under what assumption? (an editor / a legacy app / Excel export / a script)
- Who is reading it under what assumption? (editor auto-detection / console code page / library default)
- BOM yes or no, and is the line ending CRLF or LF?
- Has the misread content already been saved? (just a display issue, or are the bytes already gone?)
Once these five are filled in, the cause is usually visible.
Summary
Windows text encoding looks tangled not because Japanese is hard.
It is tangled because bytes + encoding + BOM + line ending + tool defaults all live independently, and the old and new text cultures coexist on top of that.
| What to remember | Detail |
|---|---|
| Cause of mojibake | The same bytes were read with a different encoding |
| Line endings are separate | Even with the right encoding, CRLF / LF can still mismatch |
| Do not over-trust the words | “ANSI”, “Unicode”, and “Shift_JIS” mean different things in different tools |
| UTF-8 alone is not enough | A real convention also pins down the BOM and the line ending |
| Display glitch != data corruption | If you reopen before saving, you may be able to recover |
| Specs should be concrete | Write “UTF-8 no BOM, LF”, not just “text” |
References
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
Sorting Out Text Encodings on Windows - Why Mojibake Happens, Especially When Linux Is in the Mix
A guide to mojibake between Windows and Linux, framed as a mismatch in how byte sequences get interpreted. Covers CP932, UTF-8, UTF-16, c...
Best Practices for Avoiding Mojibake with Codex on Windows - Decide How to Prompt Before Tuning Your Environment
Why Codex hits encoding accidents on Japanese files under Windows, and a reusable prompt template covering pre-read checks, save conditio...
What ClickOnce Actually Is: How It Works, How Updates Flow, and Where It Fits in Practice
A practical look at ClickOnce — how the manifests, auto-updates, per-version cache, and signing fit together, why it shines for internal ...
How to Use Windows Sandbox to Speed Up Windows App Validation - Admin Rights, Clean Environments, and Reproducing Missing-Permission or Low-Resource Cases
A practical guide to validating Windows apps with Windows Sandbox. Covers first-install checks in a clean environment, isolating admin-ri...
How DLL Name Resolution Works on Windows: A Practical Look at Search Order, Known DLLs, API Sets, and SxS
A practical walkthrough of Windows DLL name resolution covering redirection, API sets, SxS manifests, Known DLLs, the loaded-module list,...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Where This Topic Connects
This article connects naturally to the following service pages.
Windows App Development
We support Windows desktop applications that involve resident processing, device integration, operational logging, and maintainable structure.