Best Practices for Avoiding Mojibake with Codex on Windows - Decide How to Prompt Before Tuning Your Environment

· · Codex, Windows, Mojibake, UTF-8, CP932, AI Coding

Short version

When you ask Codex (an AI coding tool) to work with Japanese files on Windows, the first thing that pays off is not aligning your editor and shell settings. It is telling Codex exactly how to read, how to write, and when to stop.

The rules that matter most:

  • before reading an existing file with Japanese in it, make Codex inspect encoding candidates, BOM presence, and line endings
  • if mojibake is suspected, do not let Codex save until it is confident
  • for existing files, preserve the original encoding, BOM, and line endings
  • for new files, default to UTF-8 family per the repo’s conventions
  • only allow writes through methods where encoding can be specified explicitly
  • after saving, reload and verify representative Japanese lines

Why mojibake happens so easily on Windows

The problem is not that Codex is bad at Japanese. It is that the Windows side of the world has multiple encodings and multiple write paths coexisting in the same repo.

In practice, this kind of mix is normal:

  • new source code and Markdown are UTF-8
  • older CSV, TXT, log, and config files are CP932
  • some tool output and generated artifacts are UTF-16
  • editors, shells, and Excel-derived output all save through different paths

In that environment, one wrong interpretation by Codex followed by a save is enough to bake the corruption into the file itself.

Rules to give Codex

1. Inspect encoding, BOM, and line endings before reading

Replace “just read the text” with “first check the file’s assumptions, then read.”

2. Do not let it save a file it could not read with confidence

A file you cannot read is a file you must not save. During investigation, treat the file as read-only and forbid overwrites until the interpretation is solid.

3. Preserve existing files; default new files to UTF-8

“Convert everything to UTF-8” is dangerous. Carve any encoding conversion out as its own task.

4. Forbid ambiguous write paths by default

No casual saves through redirects or convenience commands. Only use methods where encoding can be specified explicitly.

5. Reload and verify representative Japanese lines after saving

“It saved” and “it is not broken” are different things. Check for U+FFFD, an increase in ?, and oversized diffs that are only BOM or line-ending changes.

6. On any anomaly, report before fixing

If you see more U+FFFD, more ?, an unexpected BOM change, or a huge line-ending-only diff, treat it as an anomaly and stop.

Short prompt template

For this task, treat encoding accidents as the top priority to avoid.

- before reading an existing file containing Japanese, check encoding candidates, BOM presence, and line endings
- if mojibake is suspected, do not save based on a guess
- preserve the original encoding / BOM / line endings of existing files
- create new files in UTF-8 family per repo conventions
- only write through methods where encoding can be specified explicitly
- after saving, reload and confirm representative Japanese lines are intact
- if U+FFFD appears, ? increases, BOM/line-ending accidents occur, or diffs are massive, stop and report it as an anomaly

If you already know which files are in scope, adding this one line stabilizes things further:

Target files: <paths> / Representative strings: "<examples>"

Handing over representative strings is surprisingly effective. It gives Codex a concrete checkpoint of “this Japanese must not break.”

A template you can leave in AGENTS.md

Rather than repeating the same warnings every session, put rules like these in AGENTS.md:

  • check encoding/BOM/line endings before reading a Japanese file
  • forbid saving when mojibake is suspected
  • preserve existing files’ encoding
  • treat “convert to UTF-8” as a separate task
  • after saving, reload and verify representative lines
  • on anomalies (replacement chars, BOM changes, massive diffs), stop and report

Bad prompts vs. good prompts

Bad prompt Good prompt
fix the mojibake first separate display issues from data corruption, and do not save based on a guess
convert everything to UTF-8 preserve existing, UTF-8 only for new. Existing-file conversion is a separate task
just produce a CSV match the encoding of the existing pipeline, and reload to verify after writing
match it however you like do not silently change BOM, line endings, or encoding — keep the diff to actual business changes

Review checklist

  • is encoding/BOM/line-ending handling reported per changed file?
  • are Japanese lines unnaturally large in the diff?
  • are there a lot of line-ending-only diffs?
  • has U+FFFD or ? increased?
  • have CSV or log columns shifted?

Wrap-up

  • inspect encoding/BOM/line endings before reading
  • do not save on a guess when mojibake is suspected
  • preserve existing files; default new files to UTF-8 family
  • forbid ambiguous write paths
  • reload after saving and verify representative Japanese lines

Dealing with mojibake is not a matter of asking “please handle Japanese properly.” It is writing down the conditions under which saving is allowed and the conditions under which the agent must stop. If you are going to say it every time, put it in AGENTS.md instead.

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Back to the Blog