Sorting Out Text Encodings on Windows - Why Mojibake Happens, Especially When Linux Is in the Mix

· · Windows, Mojibake, UTF-8, CP932, Linux, PowerShell, Unicode

What to keep in mind first

Mojibake does not happen because “Japanese is hard.” It happens because the same byte sequence was read as a different encoding, or because the misread result was saved again under another encoding.

Six points that matter:

  1. Mojibake is not a problem of “characters” but of “how bytes are interpreted”
  2. Windows still carries both a Unicode world and a legacy code page world side by side
  3. Linux strongly assumes UTF-8, so CP932 or UTF-16 mixed in causes accidents
  4. A broken display and broken content saved to disk are different problems
  5. UTF-8 should be the first choice for new text; existing legacy files are safer left as-is
  6. The file’s encoding, the editor’s encoding, the console’s code page, and the application’s in-memory string format are all separate things

What mojibake really is

The mechanism is simple:

  1. A string is encoded with some encoding into a byte sequence
  2. That byte sequence is decoded with another encoding back into a string
  3. If encode and decode disagree on the assumption, you get a different string

Example: save as UTF-8, you get bytes E3 81 82. Read those bytes as CP932 and you see something like 縺�.

A broken display can still be recovered

As long as the original byte sequence is intact, reopening with the correct encoding brings the text back. The dangerous case is saving the “visible” garbled string and losing the original bytes.

Even more dangerous: dropping unrepresentable characters into a narrow code page

When a Unicode character that does not exist in CP932 gets replaced with ? or �, it cannot be recovered later.

Why this gets messy on Windows

Windows has a Unicode world and a legacy code page world living together.

The Windows API has two families:

  • W family: wide character. Unicode, handled as UTF-16
  • A family: ANSI (the code page family)

The four encodings that commonly get mixed up in Japanese Windows:

  • CP932: legacy text on Japanese Windows
  • UTF-8: newer text, web, cross-platform tooling
  • UTF-16LE: used by Windows-side tools and APIs
  • The console code page: a separate layer that affects cmd.exe I/O

Note: running chcp 65001 to change the console code page does not change the encoding of existing files.

File names and file contents are different problems

  • The layer handling paths and file names
  • The layer reading file contents
  • The layer rendering to the console

These three are independent. Even if Japanese paths work fine, content in CP932 will still break on the Linux side.

Mind the PowerShell version gap

  • Windows PowerShell 5.1: default encoding is inconsistent (UTF-16LE or ANSI in different places)
  • PowerShell 7 and later: UTF-8 without BOM by default

Typical accidents when Linux gets involved

1. Linux reads a Windows CP932 file as UTF-8

The most common case. A legacy app writes a CSV/TXT in CP932; a Linux tool reads it assuming UTF-8 and the Japanese turns into mojibake.

2. Windows treats a UTF-8 no-BOM file as ANSI

A UTF-8 no-BOM file produced on Linux or in VS Code gets read as ANSI by PowerShell 5.1 or older tools, and only the Japanese lines come out broken.

3. Reading UTF-16LE on Linux

Some PowerShell 5.1 output and older tools write UTF-16LE; on Linux it looks like a binary file with NUL bytes scattered through it.

4. Friction over BOM presence

Windows tools are often happier with a BOM, while Linux tools treat the BOM as garbage at the start of the file.

5. Trusting how things look in the console will mislead you

“It rendered fine in the console, so the file must be OK” or “It looked broken in the console, so the file is corrupted” are both dangerous. The display and the actual bytes need to be checked separately.

Four questions for investigating mojibake

  1. What are the original bytes? - UTF-8, CP932, or UTF-16LE
  2. Who wrote the file under what assumption? - Windows legacy, PowerShell, Linux, VS Code, etc.
  3. Who is reading it under what assumption? - autodetect or explicit
  4. Has the misread content already been saved? - just a display issue, or actual data corruption

Once these four are answered, the cause is usually clear.

Operational rules that reduce accidents

  1. Default to UTF-8 for new files - and decide whether or not you attach a BOM
  2. Leave existing legacy files alone until there is an explicit migration task - do not silently UTF-8-ify them along the way
  3. Treat encoding as part of the interface - “we exchange it as text” is not a real spec
  4. Specify encoding explicitly when writing - do not rely on defaults
  5. Verify the console and the file separately - rendering in the console is not the same check as reopening the file
  6. Git will not fix encoding for you - broken bytes go straight into history

Wrap-up

  • Mojibake is a mismatch in how bytes are interpreted
  • Display breakage and data corruption are different things
  • On Windows, think of file / editor / console / API as separate layers
  • For text exchanged with Linux, start from UTF-8
  • Treat conversion of existing legacy files as its own task, separate from regular code changes

Text encoding looks like a boring detail, but between Windows and Linux it is the I/O contract itself.

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Back to the Blog