Best Practices for Avoiding Mojibake with Codex on Windows - Decide How to Prompt Before Tuning Your Environment
Short version
When you ask Codex (an AI coding tool) to work with Japanese files on Windows, the first thing that pays off is not aligning your editor and shell settings. It is telling Codex exactly how to read, how to write, and when to stop.
The rules that matter most:
- before reading an existing file with Japanese in it, make Codex inspect encoding candidates, BOM presence, and line endings
- if mojibake is suspected, do not let Codex save until it is confident
- for existing files, preserve the original encoding, BOM, and line endings
- for new files, default to UTF-8 family per the repo’s conventions
- only allow writes through methods where encoding can be specified explicitly
- after saving, reload and verify representative Japanese lines
Why mojibake happens so easily on Windows
The problem is not that Codex is bad at Japanese. It is that the Windows side of the world has multiple encodings and multiple write paths coexisting in the same repo.
In practice, this kind of mix is normal:
- new source code and Markdown are UTF-8
- older CSV, TXT, log, and config files are CP932
- some tool output and generated artifacts are UTF-16
- editors, shells, and Excel-derived output all save through different paths
In that environment, one wrong interpretation by Codex followed by a save is enough to bake the corruption into the file itself.
Rules to give Codex
1. Inspect encoding, BOM, and line endings before reading
Replace “just read the text” with “first check the file’s assumptions, then read.”
2. Do not let it save a file it could not read with confidence
A file you cannot read is a file you must not save. During investigation, treat the file as read-only and forbid overwrites until the interpretation is solid.
3. Preserve existing files; default new files to UTF-8
“Convert everything to UTF-8” is dangerous. Carve any encoding conversion out as its own task.
4. Forbid ambiguous write paths by default
No casual saves through redirects or convenience commands. Only use methods where encoding can be specified explicitly.
5. Reload and verify representative Japanese lines after saving
“It saved” and “it is not broken” are different things. Check for U+FFFD, an increase in ?, and oversized diffs that are only BOM or line-ending changes.
6. On any anomaly, report before fixing
If you see more U+FFFD, more ?, an unexpected BOM change, or a huge line-ending-only diff, treat it as an anomaly and stop.
Short prompt template
For this task, treat encoding accidents as the top priority to avoid.
- before reading an existing file containing Japanese, check encoding candidates, BOM presence, and line endings
- if mojibake is suspected, do not save based on a guess
- preserve the original encoding / BOM / line endings of existing files
- create new files in UTF-8 family per repo conventions
- only write through methods where encoding can be specified explicitly
- after saving, reload and confirm representative Japanese lines are intact
- if U+FFFD appears, ? increases, BOM/line-ending accidents occur, or diffs are massive, stop and report it as an anomaly
If you already know which files are in scope, adding this one line stabilizes things further:
Target files: <paths> / Representative strings: "<examples>"
Handing over representative strings is surprisingly effective. It gives Codex a concrete checkpoint of “this Japanese must not break.”
A template you can leave in AGENTS.md
Rather than repeating the same warnings every session, put rules like these in AGENTS.md:
- check encoding/BOM/line endings before reading a Japanese file
- forbid saving when mojibake is suspected
- preserve existing files’ encoding
- treat “convert to UTF-8” as a separate task
- after saving, reload and verify representative lines
- on anomalies (replacement chars, BOM changes, massive diffs), stop and report
Bad prompts vs. good prompts
| Bad prompt | Good prompt |
|---|---|
| fix the mojibake | first separate display issues from data corruption, and do not save based on a guess |
| convert everything to UTF-8 | preserve existing, UTF-8 only for new. Existing-file conversion is a separate task |
| just produce a CSV | match the encoding of the existing pipeline, and reload to verify after writing |
| match it however you like | do not silently change BOM, line endings, or encoding — keep the diff to actual business changes |
Review checklist
- is encoding/BOM/line-ending handling reported per changed file?
- are Japanese lines unnaturally large in the diff?
- are there a lot of line-ending-only diffs?
- has
U+FFFDor?increased? - have CSV or log columns shifted?
Wrap-up
- inspect encoding/BOM/line endings before reading
- do not save on a guess when mojibake is suspected
- preserve existing files; default new files to UTF-8 family
- forbid ambiguous write paths
- reload after saving and verify representative Japanese lines
Dealing with mojibake is not a matter of asking “please handle Japanese properly.” It is writing down the conditions under which saving is allowed and the conditions under which the agent must stop. If you are going to say it every time, put it in AGENTS.md instead.
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
Sorting out Windows text encodings and line endings - Shift_JIS / UTF-8 / UTF-16, mojibake, CRLF / LF, and why it gets confusing
A practical guide that breaks Windows text-file trouble down into independent pieces — bytes, encoding, BOM, and CRLF / LF — and walks th...
Sorting Out Text Encodings on Windows - Why Mojibake Happens, Especially When Linux Is in the Mix
A guide to mojibake between Windows and Linux, framed as a mismatch in how byte sequences get interpreted. Covers CP932, UTF-8, UTF-16, c...
What ClickOnce Actually Is: How It Works, How Updates Flow, and Where It Fits in Practice
A practical look at ClickOnce — how the manifests, auto-updates, per-version cache, and signing fit together, why it shines for internal ...
How to Use Windows Sandbox to Speed Up Windows App Validation - Admin Rights, Clean Environments, and Reproducing Missing-Permission or Low-Resource Cases
A practical guide to validating Windows apps with Windows Sandbox. Covers first-install checks in a clean environment, isolating admin-ri...
How DLL Name Resolution Works on Windows: A Practical Look at Search Order, Known DLLs, API Sets, and SxS
A practical walkthrough of Windows DLL name resolution covering redirection, API sets, SxS manifests, Known DLLs, the loaded-module list,...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Where This Topic Connects
This article connects naturally to the following service pages.
Windows App Development
We support Windows desktop applications that involve resident processing, device integration, operational logging, and maintainable structure.