Windows Text Encodings and Line Endings Explained - Shift_JIS, UTF-8, UTF-16, Mojibake, CRLF and LF
In discussions about text handling on Windows, the following topics get mixed together very quickly:
- what is different between
Shift_JISandUTF-8 - why mojibake happens
- what is different between
CRLFandLF - why something can still fail even after
switching to UTF-8 - why the same file can look different in an editor, a console, Excel, and Git
This does not happen because Japanese is inherently difficult. In most cases, the cause is that the same byte sequence was read under a different assumption or that misread content was saved as-is.
Windows makes the story look more complicated because the Unicode world and the code-page world still coexist there. On top of that, BOM, line endings, editor auto-detection, console code pages, and Git line-ending conversion all pile on.
This article organizes the topics that commonly get mixed together on Windows: Shift_JIS / UTF-8 / UTF-16, why mojibake happens, the difference between CRLF and LF, and why the whole topic becomes confusing so easily in practice.
The content is based on Microsoft Learn, PowerShell, Git, and W3C / Unicode-related public documentation available as of April 2026. See the references at the end for details.
Contents
- Table of Contents
- 1. What to keep in mind first
- 2. Break the terminology apart
- 2.1 What is different between Unicode / UTF-8 / UTF-16 / CP932
- 2.2 How to think about Shift_JIS and CP932
- 2.3 The traps behind the words ANSI, Unicode, and UTF-8N
- 3. Why mojibake happens
- 3.1 What mojibake actually is
- 3.2 Broken display and data corruption are different things
- 3.3 Once you fall back to an encoding that cannot represent the characters, you do not get them back
- 4. What the difference in line endings means
- 4.1 CRLF / LF / CR
- 4.2 Line endings are a different issue from text encoding
- 4.3 \n is not always the same thing as the newline bytes stored in a file
- 5. Why Windows is especially easy to get wrong
- 5.1 Unicode and legacy code pages coexist
- 5.2 The labels are inconsistent
- 5.3 ASCII-only content hides the problem
- 5.4 File content, file names, the console, and source files are different layers
- 5.5 BOM and line endings matter on separate axes
- 5.6 Tools silently change things
- 6. Common failure patterns
- 7. Rules that reduce incidents in practice
- 7.1 Decide the baseline for new text files
- 7.2 Do not silently change existing legacy files
- 7.3 Treat encoding and line endings as part of the interface
- 7.4 Be explicit at read / write boundaries
- 7.5 Share Git and editor rules too
- 7.6 Do not stop at saying “it became mojibake”; say what drifted
- 8. Investigating mojibake and line-ending diffs with these five questions
- 9. Summary
- 10. Related Articles
- 11. References
Table of Contents
- What to keep in mind first
- Break the terminology apart
- 2.1 What is different between Unicode / UTF-8 / UTF-16 / CP932
- 2.2 How to think about
Shift_JISandCP932 - 2.3 The traps behind the words
ANSI,Unicode, andUTF-8N
- Why mojibake happens
- 3.1 What mojibake actually is
- 3.2 Broken display and data corruption are different things
- 3.3 Once you fall back to an encoding that cannot represent the characters, you do not get them back
- What the difference in line endings means
- 4.1
CRLF/LF/CR - 4.2 Line endings are a different issue from text encoding
- 4.3
\nis not always the same thing as the newline bytes stored in a file
- 4.1
- Why Windows is especially easy to get wrong
- 5.1 Unicode and legacy code pages coexist
- 5.2 The labels are inconsistent
- 5.3 ASCII-only content hides the problem
- 5.4 File content, file names, the console, and source files are different layers
- 5.5 BOM and line endings matter on separate axes
- 5.6 Tools silently change things
- Common failure patterns
- Rules that reduce incidents in practice
- Investigating mojibake and line-ending diffs with these five questions
- Summary
- Related Articles
- References
1. What to keep in mind first
If you list only the conclusion first, the seven important points are these:
- a text file is not just a string itself; it is built from bytes + an encoding + a line-ending convention. Depending on the case, BOM may also be involved.
- mojibake happens when the same bytes are decoded under a different encoding assumption
- line-ending trouble happens when the encoding is correct but the assumption about line separators is not
UnicodeandUTF-8do not mean the same thing.Unicodeis about the character set side, whileUTF-8andUTF-16are encodings.- on Windows, what people call
Shift_JISis usually better treated in practice as CP932 / the Windows Japanese code page - saying only
we switched to UTF-8is still not enough. You also need to define whether BOM exists and which line-ending style is used - the real source of confusion is not Japanese itself, but the fact that multiple historical assumptions still remain together on the same Windows machine
In practice, the right starting point is to separate these four questions:
- What are the bytes in this file?
- Under which encoding was it written?
- Under which encoding is it being read?
- Are the line endings
CRLForLF?
Once those are separated, the investigation usually becomes much easier.
2. Break the terminology apart
2.1 What is different between Unicode / UTF-8 / UTF-16 / CP932
The quickest way forward is to break the words apart once.
| Term | What it refers to | Example | Common confusion |
|---|---|---|---|
| Unicode | a framework for assigning numbers to characters | U+3042 (あ) |
people think it is the same thing as UTF-8 |
| UTF-8 | an encoding that turns Unicode into bytes | E3 81 82 |
people think it is Unicode itself |
| UTF-16LE | an encoding that turns Unicode into bytes | 42 30 |
people mix it up with a menu label like Unicode |
| CP932 | the Windows Japanese legacy code page | 82 A0 |
people assume it is perfectly identical to Shift_JIS |
| CRLF / LF | byte sequences that separate lines | 0D 0A / 0A |
people think they are a kind of encoding |
| BOM | identifying bytes at the start of a file | EF BB BF and so on |
people think it is the encoding name itself |
Even for the single character あ, the bytes differ by encoding:
Character: あ
UTF-8 : E3 81 82
CP932 : 82 A0
UTF-16LE : 42 30
The important point is that characters and bytes are not the same thing. Applications show you something that looks like characters on screen, but when saving or transmitting data they ultimately exchange bytes. Incidents usually happen at that conversion boundary.
2.2 How to think about Shift_JIS and CP932
In real projects, Japanese text files on Windows are very often called Shift_JIS as a catch-all term. That works conversationally, but it is a bit too rough for implementation work.
If you want to be more precise about legacy Japanese text on Windows, it is safer to think in terms of CP932, or more broadly the Windows Japanese code page.
If you stay vague here, conversations start to drift like this:
- someone says
save it as Shift_JIS, but the other side actually assumes Windows-style CP932 - Linux or macOS handles the file as
shift_jis, but some Windows-originated files do not reproduce exactly as expected - someone says
save as ANSI, but which code page that means depends on the environment
So in specifications and investigation notes, it is safer to write things more explicitly:
- write
CP932instead ofShift_JIS - write
ACP (active code page) / usually CP932 on Japanese Windowsinstead ofANSI - write something concrete like
UTF-8 no BOM, LFinstead of justtext
2.3 The traps behind the words ANSI, Unicode, and UTF-8N
Around Windows, the labels themselves are part of the problem.
The common traps are these:
ANSI
This appears in old Windows UIs and old explanations, but it does not mean ASCII. In many cases it means the machine’s active code page.Unicode
In some editors and tools, the menu labelUnicodeactually means UTF-16LE. If someone saysI saved it as Unicode, that does not automatically meanUTF-8.UTF-8N
You may see this label in Japanese-market editors. It is usually just a UI label used to distinguish UTF-8 without BOM. It is not a formal encoding name.
In other words, on Windows the same word can mean slightly different things depending on the tool. That is one of the first major sources of confusion.
3. Why mojibake happens
3.1 What mojibake actually is
Mojibake itself is conceptually simple:
- turn a string into bytes using one encoding
- turn those bytes back into a string using a different encoding
- if the assumptions do not match, you get a different string
For example, if あ is saved in UTF-8, the bytes are:
E3 81 82
If those bytes are read as UTF-8, you get あ. If they are read under a CP932-oriented assumption, they may look like 縺� or some other broken text.
What is broken there is not the Japanese text itself, but the decoding assumption.
If you want to summarize mojibake in one sentence, it is this:
The same bytes were read as if they belonged to a different encoding.
3.2 Broken display and data corruption are different things
It is important to separate the stage where recovery is still possible from the stage where it is much harder.
For example, the following can still be recoverable:
- open a UTF-8 file as if it were CP932
- it looks like
縺�on screen - nothing has been saved yet
At that point the original bytes are still in UTF-8. If the file is reopened with the correct encoding, the text may recover.
The dangerous path is this:
- misread a UTF-8 file as CP932
- save the already-broken display as-is
- lose the original UTF-8 byte sequence
At that point it is no longer just a display issue. It is data corruption.
In practice, instead of flattening everything into the sentence it became mojibake, you should at least separate these two questions:
- are the bytes themselves still correct?
- or has the misread content already been saved again?
3.3 Once you fall back to an encoding that cannot represent the characters, you do not get them back
Another dangerous case is when a Unicode string is pushed into a narrower code page such as CP932.
If the target code page cannot represent some of the characters, one of these usually happens:
- they are replaced with
? - replacement characters are inserted
- conversion errors occur
- they are forced into different, similar-looking characters
For example, some emoji and some extended ideographs cannot be round-tripped into CP932.
This should not be judged only by whether the text is still readable, but by whether a round trip preserves the original text.
Once the information has been lost, knowing the correct encoding later does not reconstruct it.
4. What the difference in line endings means
4.1 CRLF / LF / CR
Line endings are also bytes.
CR= carriage return =0DLF= line feed =0A- in traditional Windows text files,
CRLF(0D 0A) is the classic form - on Linux / Unix-like systems,
LF(0A) is the common form - standalone
CRappears mostly in older historical contexts such as classic Mac data
Put in table form:
| Line ending | Bytes | Typical context |
|---|---|---|
CRLF |
0D 0A |
traditional Windows text files, legacy tools |
LF |
0A |
Linux / macOS / many development tools |
CR |
0D |
very old legacy data |
4.2 Line endings are a different issue from text encoding
This point matters a lot.
Line-ending style is a different issue from encoding.
Even two UTF-8 files can differ only in line endings.
For example, if the content is A, newline, B, the bytes become:
UTF-8 + LF : 41 0A 42
UTF-8 + CRLF : 41 0D 0A 42
So all of the following are perfectly possible:
- a file is
UTF-8, but only the line endings differ - a file is
CP932, but the line endings areLF - a file is
UTF-16LE, but the line endings areCRLF
That is why, when someone says we already changed it to UTF-8 but something is still off, the actual difference may be the line endings and nothing else.
4.3 \n is not always the same thing as the newline bytes stored in a file
This is an easy source of confusion from the programmer’s point of view.
Just because you wrote \n in source code does not guarantee that only 0A is stored in the file.
In text mode, depending on the language, runtime, or I/O API, \n may be converted to CRLF on Windows.
That means these can drift apart:
- the newline representation in source code
- the string at runtime
- the bytes saved into the file
- the line breaks shown in the editor
That is why you sometimes end up with I thought I wrote LF, but the file became CRLF.
Modern editors can handle plain LF just fine, but surrounding tools, legacy applications, and business workflows still often assume CRLF.
So line-ending trouble is not just an old story. It still appears in normal day-to-day work.
5. Why Windows is especially easy to get wrong
5.1 Unicode and legacy code pages coexist
This is the single biggest reason Windows gets complicated.
Windows still has both:
- a path that uses Unicode
- a path that uses code pages
Newer applications, web-related assets, and cross-platform systems tend to lean toward UTF-8, while old CSV, TXT, logs, Excel-adjacent workflows, and business-system integration often still carry CP932.
On top of that, some outputs and some APIs still naturally emit UTF-16LE.
In other words, multiple text cultures coexist on a single Windows machine.
5.2 The labels are inconsistent
What increases confusion is often not the technology itself, but the labels:
- people say
Shift_JIS, but the reality is CP932 - people say
ANSI, but the reality is the active code page - people say
Unicode, but the reality is UTF-16LE - people say
UTF-8, but whether BOM exists is still unspecified - editor-specific labels such as
UTF-8Nappear
If that stays vague, the conversation sounds aligned while the actual technical assumptions are not.
5.3 ASCII-only content hides the problem
This also matters a lot.
Because UTF-8 is compatible with the ASCII range, files that contain only letters, digits, and symbols may still look kind of fine even under the wrong assumption.
On the CP932 side, the ASCII-equivalent range also often survives visually, so the issue does not show up.
As a result, you get situations like:
- an English-only config file looks fine
- the moment one line of Japanese is added, it breaks
- a latent problem surfaces for the first time during operations
That is why encoding incidents often look like it worked until yesterday, and suddenly broke today.
In reality, the trap was there already, and it only became visible when non-ASCII text entered the file.
5.4 File content, file names, the console, and source files are different layers
On Windows, if you call all of these encoding, you get lost:
- file names / paths
- file contents
- console display
- the encoding of the source code file itself
- the string representation at runtime
- clipboard or GUI-component display
For example, Japanese file names may display normally while the file contents are still saved in CP932.
Or the file itself may be UTF-8 while the display is broken only because the console code page is wrong.
An operation such as chcp 65001 basically changes the console-side assumption. It does not rewrite the bytes of an existing file.
Likewise, even if the source code file is UTF-8, that does not mean the log file written at runtime is also UTF-8.
You need to separate which layer’s encoding you are talking about every time.
On Japanese Windows, \ may also appear as a yen sign. That often gets mixed into the encoding discussion too.
But in many cases that is a font or glyph-display issue, not a change in the path separator or escape semantics.
5.5 BOM and line endings matter on separate axes
The label UTF-8 still leaves only half the story defined.
In practice, these also matter:
- is BOM present or absent?
- are the line endings
CRLForLF?
For example, even with the same UTF-8:
- some Windows tools read it only when BOM is present
- some Unix-like processing treats BOM as an unwanted extra at the start of the first field
- some legacy-side tools are awkward if the file is
LFonly CRLFcan make shell scripts or diffs noisier
So you can still have incidents even after the encoding itself matches.
5.6 Tools silently change things
What makes this even trickier is that local tools can add implicit behavior:
- an editor auto-detects the encoding
- save operations add or remove BOM
- Git converts
CRLF/LF - a shell or command writes using its default encoding
- CSV export uses an unexpected code page
- a different PowerShell or tool version changes the defaults
So even when the person operating the system did not explicitly change anything, some layer may still have inserted its own assumption.
That is the real meaning of it broke even though I changed nothing.
Quite often, it was not a person that changed it, but the tool’s default behavior.
6. Common failure patterns
The typical incidents look like this:
| Situation | What is actually misaligned | Typical symptom |
|---|---|---|
a legacy Windows tool treats a UTF-8-no-BOM settings file as ANSI / CP932 |
decoding assumption | only Japanese text becomes mojibake |
| a CP932 CSV is passed into a UTF-8-oriented processing path | decoding assumption | �, decode errors, or meaningless Japanese text |
| a UTF-16LE log is passed into Unix-style text tools | encoding assumption | NUL bytes appear and it looks like binary data |
an LF source file is converted into CRLF in another environment |
newline assumption | massive line-ending diffs or script trouble |
| misread content is saved as-is | the bytes themselves become something else | irreversible data corruption |
the specification says only export CSV |
the interface is undefined | Excel can read it, but another tool breaks |
the team decides only standardize on UTF-8 |
BOM / line endings remain undefined | some tools still fail |
The especially dangerous path is seeing broken display and then saving it, which turns a recoverable misunderstanding into a confirmed incident.
7. Rules that reduce incidents in practice
From here on, the question is what rules reduce incidents operationally.
7.1 Decide the baseline for new text files
For new files, UTF-8 is a reasonable first choice.
But that alone is still not enough.
At minimum, decide these too:
- whether it is
UTF-8 with BOMorUTF-8 no BOM - whether line endings are
CRLForLF - who will read the file
- whether legacy Windows tool compatibility is required
- whether Linux / macOS / CI / containers also consume it
For example, if it is cross-platform source code or configuration, UTF-8 no BOM + LF is often the strongest first candidate.
On the other hand, if it must fit old Windows tools or an existing operational process, UTF-8 with BOM or CP932 + CRLF may still be the correct choice.
The important thing is not some universal answer to what is objectively correct, but who this file needs to interoperate with.
7.2 Do not silently change existing legacy files
If an existing file is CP932, it is usually safer not to convert it to UTF-8 as a side effect of a small day-to-day edit.
The safer operating model is:
- preserve the original encoding / BOM / line endings of existing files
- treat encoding conversion as a separate migration task
- confirm the target set and downstream consumers before doing batch conversion
A large share of mojibake incidents comes from a well-meaning while I am here, I will modernize it.
7.3 Treat encoding and line endings as part of the interface
CSV, TXT, logs, settings files, and lightweight protocols are not only about content. The text format itself is part of the interface.
At minimum, the specification should state:
- the encoding
- whether BOM exists
- the newline style
- whether a header exists
- quote / delimiter rules
- which tools were used to verify the format
For example, the three letters CSV are not enough.
Only when you write something like UTF-8 with BOM, CRLF, comma delimiter, header present does the conversation stop drifting.
7.4 Be explicit at read / write boundaries
On the code side as well, it is safer not to rely on hidden defaults.
- specify the encoding on file reads and writes
- stay conscious of encoding when passing text across process boundaries
- fix line endings as part of the specification for export / import paths
- do not make ad-hoc shell redirects part of the real production path
Especially on Windows, it saved successfully and it saved the correct bytes are not the same statement.
7.5 Share Git and editor rules too
Git does not automatically fix encoding for you.
At the same time, it may transform line endings.
So at the repository level it is safer to decide:
- whether source code is
LFby default - whether Windows-only text can use
CRLF - how
.gitattributesshould pin the behavior - how editor settings should be shared
The important point is to think about encoding and line endings separately.
Even if Git normalizes line endings, an encoding incident still remains an encoding incident.
7.6 Do not stop at saying “it became mojibake”; say what drifted
In real teams, these paraphrases are surprisingly effective:
- bad wording:
it became mojibake - better wording:
it looks like a UTF-8 no BOM file is being opened under a CP932 assumption - bad wording:
the line endings are weird - better wording:
an LF file is being converted into CRLF, which is increasing diffs
Once you can state what actually drifted, the speed of investigation changes a lot.
8. Investigating mojibake and line-ending diffs with these five questions
When investigation gets stuck, the fastest path is to go back to these five questions:
- What are the bytes of this file right now?
- is it UTF-8?
- UTF-8 with BOM?
- CP932?
- UTF-16LE?
- Who wrote it first, and under what assumption?
- editor
- legacy app
- Excel export
- shell / script
- batch / middleware
- Who is reading it now, and under what assumption?
- editor auto-detect
- console code page
- library default encoding
- import-side specification
- What about BOM and line endings?
- BOM present or absent
CRLForLF
- Has the misread content already been saved?
- is it still only a display problem?
- or were the bytes already rewritten and information lost?
Once those five are answered, the cause is usually visible.
9. Summary
Windows text encodings and line endings look messy not because Japanese itself is difficult, but because bytes, encodings, BOM, newline conventions, and tool defaults exist as separate layers while old and new text conventions still coexist on Windows.
The especially important points to remember are these:
- mojibake is the result of reading the same bytes under a different encoding
- newline issues live on a different axis from encoding
- do not trust words such as
Shift_JIS,CP932,ANSI, andUnicodetoo casually - saying only
we switched to UTF-8is not enough; BOM and line endings also matter - broken display and already-resaved data corruption must be treated separately
- in specifications, write
UTF-8 no BOM, LFrather than justtext
In other words, when handling text on Windows, the practical way to think is not this is a string problem, but how do we align the contract around bytes?
10. Related Articles
- Understanding Text Encodings on Windows - Why Mojibake Happens and What Breaks When Linux Gets Involved
- Best Practices for Reducing Codex Mojibake Incidents on Windows - Decide “How to Instruct It” Before Tweaking the Environment
11. References
- Microsoft Learn, Code Page Identifiers - Win32 apps
https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers - Microsoft Learn, about_Character_Encoding - PowerShell
https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.6 - Microsoft Learn, Understanding file encoding in VS Code and PowerShell
https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding?view=powershell-7.6 - W3C Internationalization, Character encodings: Essential concepts
https://www.w3.org/International/articles/definitions-characters/ - Git documentation, gitattributes
https://git-scm.com/docs/gitattributes - Git documentation, git-config
https://git-scm.com/docs/git-config
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
Understanding Text Encodings on Windows - Why Mojibake Happens and What Breaks When Linux Gets Involved
A practical guide to Windows text encodings, mojibake, and the common failure points that appear when CP932, UTF-8, UTF-16, PowerShell, a...
Best Practices for Avoiding Mojibake with Codex on Windows - Clear Prompting Before Environment Tweaks
A practical guide to reducing mojibake accidents with Codex on Windows by defining safe read, write, and verification rules before editin...
What ClickOnce Is: How It Works, How Updates Work, and When It Fits or Does Not Fit in Practice
A practical guide to ClickOnce for .NET Windows desktop app deployment, with Mermaid diagrams covering manifests, updates, caching, signi...
How to Use Windows Sandbox to Speed Up Windows App Validation - Admin Rights, Clean Environments, and Reproducing Missing-Permission or Low-Resource Cases
A practical guide to using Windows Sandbox to validate Windows apps faster by separating admin-rights issues, reproducing clean-environme...
How DLL Name Resolution Works on Windows: Search Order, Known DLLs, API Sets, and SxS in Practice
A practical guide to Windows DLL name resolution, covering search order, Known DLLs, loaded-module checks, API sets, SxS manifests, and t...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Where This Topic Connects
This article connects naturally to the following service pages.
Technical Consulting & Design Review
When CSV files, logs, and configuration text move between Windows and Linux, clarifying the I/O contract and operating rules early reduces encoding incidents and rework.
Windows App Development
Windows business tools often carry a mix of CP932 and UTF-8 assets, so treating text encodings and line endings as design concerns improves maintainability and supportability.