Sorting Out Text Encodings on Windows - Why Mojibake Happens, Especially When Linux Is in the Mix
What to keep in mind first
Mojibake does not happen because “Japanese is hard.” It happens because the same byte sequence was read as a different encoding, or because the misread result was saved again under another encoding.
Six points that matter:
- Mojibake is not a problem of “characters” but of “how bytes are interpreted”
- Windows still carries both a Unicode world and a legacy code page world side by side
- Linux strongly assumes UTF-8, so CP932 or UTF-16 mixed in causes accidents
- A broken display and broken content saved to disk are different problems
- UTF-8 should be the first choice for new text; existing legacy files are safer left as-is
- The file’s encoding, the editor’s encoding, the console’s code page, and the application’s in-memory string format are all separate things
What mojibake really is
The mechanism is simple:
- A string is encoded with some encoding into a byte sequence
- That byte sequence is decoded with another encoding back into a string
- If encode and decode disagree on the assumption, you get a different string
Example: save あ as UTF-8, you get bytes E3 81 82. Read those bytes as CP932 and you see something like 縺�.
A broken display can still be recovered
As long as the original byte sequence is intact, reopening with the correct encoding brings the text back. The dangerous case is saving the “visible” garbled string and losing the original bytes.
Even more dangerous: dropping unrepresentable characters into a narrow code page
When a Unicode character that does not exist in CP932 gets replaced with ? or �, it cannot be recovered later.
Why this gets messy on Windows
Windows has a Unicode world and a legacy code page world living together.
The Windows API has two families:
- W family: wide character. Unicode, handled as UTF-16
- A family: ANSI (the code page family)
The four encodings that commonly get mixed up in Japanese Windows:
- CP932: legacy text on Japanese Windows
- UTF-8: newer text, web, cross-platform tooling
- UTF-16LE: used by Windows-side tools and APIs
- The console code page: a separate layer that affects
cmd.exeI/O
Note: running chcp 65001 to change the console code page does not change the encoding of existing files.
File names and file contents are different problems
- The layer handling paths and file names
- The layer reading file contents
- The layer rendering to the console
These three are independent. Even if Japanese paths work fine, content in CP932 will still break on the Linux side.
Mind the PowerShell version gap
- Windows PowerShell 5.1: default encoding is inconsistent (UTF-16LE or ANSI in different places)
- PowerShell 7 and later: UTF-8 without BOM by default
Typical accidents when Linux gets involved
1. Linux reads a Windows CP932 file as UTF-8
The most common case. A legacy app writes a CSV/TXT in CP932; a Linux tool reads it assuming UTF-8 and the Japanese turns into mojibake.
2. Windows treats a UTF-8 no-BOM file as ANSI
A UTF-8 no-BOM file produced on Linux or in VS Code gets read as ANSI by PowerShell 5.1 or older tools, and only the Japanese lines come out broken.
3. Reading UTF-16LE on Linux
Some PowerShell 5.1 output and older tools write UTF-16LE; on Linux it looks like a binary file with NUL bytes scattered through it.
4. Friction over BOM presence
Windows tools are often happier with a BOM, while Linux tools treat the BOM as garbage at the start of the file.
5. Trusting how things look in the console will mislead you
“It rendered fine in the console, so the file must be OK” or “It looked broken in the console, so the file is corrupted” are both dangerous. The display and the actual bytes need to be checked separately.
Four questions for investigating mojibake
- What are the original bytes? - UTF-8, CP932, or UTF-16LE
- Who wrote the file under what assumption? - Windows legacy, PowerShell, Linux, VS Code, etc.
- Who is reading it under what assumption? - autodetect or explicit
- Has the misread content already been saved? - just a display issue, or actual data corruption
Once these four are answered, the cause is usually clear.
Operational rules that reduce accidents
- Default to UTF-8 for new files - and decide whether or not you attach a BOM
- Leave existing legacy files alone until there is an explicit migration task - do not silently UTF-8-ify them along the way
- Treat encoding as part of the interface - “we exchange it as text” is not a real spec
- Specify encoding explicitly when writing - do not rely on defaults
- Verify the console and the file separately - rendering in the console is not the same check as reopening the file
- Git will not fix encoding for you - broken bytes go straight into history
Wrap-up
- Mojibake is a mismatch in how bytes are interpreted
- Display breakage and data corruption are different things
- On Windows, think of file / editor / console / API as separate layers
- For text exchanged with Linux, start from UTF-8
- Treat conversion of existing legacy files as its own task, separate from regular code changes
Text encoding looks like a boring detail, but between Windows and Linux it is the I/O contract itself.
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
Sorting out Windows text encodings and line endings - Shift_JIS / UTF-8 / UTF-16, mojibake, CRLF / LF, and why it gets confusing
A practical guide that breaks Windows text-file trouble down into independent pieces — bytes, encoding, BOM, and CRLF / LF — and walks th...
Best Practices for Avoiding Mojibake with Codex on Windows - Decide How to Prompt Before Tuning Your Environment
Why Codex hits encoding accidents on Japanese files under Windows, and a reusable prompt template covering pre-read checks, save conditio...
What ClickOnce Actually Is: How It Works, How Updates Flow, and Where It Fits in Practice
A practical look at ClickOnce — how the manifests, auto-updates, per-version cache, and signing fit together, why it shines for internal ...
How to Use Windows Sandbox to Speed Up Windows App Validation - Admin Rights, Clean Environments, and Reproducing Missing-Permission or Low-Resource Cases
A practical guide to validating Windows apps with Windows Sandbox. Covers first-install checks in a clean environment, isolating admin-ri...
How DLL Name Resolution Works on Windows: A Practical Look at Search Order, Known DLLs, API Sets, and SxS
A practical walkthrough of Windows DLL name resolution covering redirection, API sets, SxS manifests, Known DLLs, the loaded-module list,...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Where This Topic Connects
This article connects naturally to the following service pages.
Windows App Development
We support Windows desktop applications that involve resident processing, device integration, operational logging, and maintainable structure.