Windows Text Encodings and Line Endings Explained - Shift_JIS, UTF-8, UTF-16, Mojibake, CRLF and LF

· · Windows, Text Encodings, Mojibake, Line Endings, UTF-8, CP932, PowerShell, Unicode

In discussions about text handling on Windows, the following topics get mixed together very quickly:

  • what is different between Shift_JIS and UTF-8
  • why mojibake happens
  • what is different between CRLF and LF
  • why something can still fail even after switching to UTF-8
  • why the same file can look different in an editor, a console, Excel, and Git

This does not happen because Japanese is inherently difficult. In most cases, the cause is that the same byte sequence was read under a different assumption or that misread content was saved as-is.

Windows makes the story look more complicated because the Unicode world and the code-page world still coexist there. On top of that, BOM, line endings, editor auto-detection, console code pages, and Git line-ending conversion all pile on.

This article organizes the topics that commonly get mixed together on Windows: Shift_JIS / UTF-8 / UTF-16, why mojibake happens, the difference between CRLF and LF, and why the whole topic becomes confusing so easily in practice.

The content is based on Microsoft Learn, PowerShell, Git, and W3C / Unicode-related public documentation available as of April 2026. See the references at the end for details.

Contents

  1. Table of Contents
  2. 1. What to keep in mind first
  3. 2. Break the terminology apart
    • 2.1 What is different between Unicode / UTF-8 / UTF-16 / CP932
    • 2.2 How to think about Shift_JIS and CP932
    • 2.3 The traps behind the words ANSI, Unicode, and UTF-8N
  4. 3. Why mojibake happens
    • 3.1 What mojibake actually is
    • 3.2 Broken display and data corruption are different things
    • 3.3 Once you fall back to an encoding that cannot represent the characters, you do not get them back
  5. 4. What the difference in line endings means
    • 4.1 CRLF / LF / CR
    • 4.2 Line endings are a different issue from text encoding
    • 4.3 \n is not always the same thing as the newline bytes stored in a file
  6. 5. Why Windows is especially easy to get wrong
    • 5.1 Unicode and legacy code pages coexist
    • 5.2 The labels are inconsistent
    • 5.3 ASCII-only content hides the problem
    • 5.4 File content, file names, the console, and source files are different layers
    • 5.5 BOM and line endings matter on separate axes
    • 5.6 Tools silently change things
  7. 6. Common failure patterns
  8. 7. Rules that reduce incidents in practice
    • 7.1 Decide the baseline for new text files
    • 7.2 Do not silently change existing legacy files
    • 7.3 Treat encoding and line endings as part of the interface
    • 7.4 Be explicit at read / write boundaries
    • 7.5 Share Git and editor rules too
    • 7.6 Do not stop at saying “it became mojibake”; say what drifted
  9. 8. Investigating mojibake and line-ending diffs with these five questions
  10. 9. Summary
  11. 10. Related Articles
  12. 11. References

Table of Contents

  1. What to keep in mind first
  2. Break the terminology apart
    • 2.1 What is different between Unicode / UTF-8 / UTF-16 / CP932
    • 2.2 How to think about Shift_JIS and CP932
    • 2.3 The traps behind the words ANSI, Unicode, and UTF-8N
  3. Why mojibake happens
    • 3.1 What mojibake actually is
    • 3.2 Broken display and data corruption are different things
    • 3.3 Once you fall back to an encoding that cannot represent the characters, you do not get them back
  4. What the difference in line endings means
    • 4.1 CRLF / LF / CR
    • 4.2 Line endings are a different issue from text encoding
    • 4.3 \n is not always the same thing as the newline bytes stored in a file
  5. Why Windows is especially easy to get wrong
    • 5.1 Unicode and legacy code pages coexist
    • 5.2 The labels are inconsistent
    • 5.3 ASCII-only content hides the problem
    • 5.4 File content, file names, the console, and source files are different layers
    • 5.5 BOM and line endings matter on separate axes
    • 5.6 Tools silently change things
  6. Common failure patterns
  7. Rules that reduce incidents in practice
  8. Investigating mojibake and line-ending diffs with these five questions
  9. Summary
  10. Related Articles
  11. References

1. What to keep in mind first

If you list only the conclusion first, the seven important points are these:

  • a text file is not just a string itself; it is built from bytes + an encoding + a line-ending convention. Depending on the case, BOM may also be involved.
  • mojibake happens when the same bytes are decoded under a different encoding assumption
  • line-ending trouble happens when the encoding is correct but the assumption about line separators is not
  • Unicode and UTF-8 do not mean the same thing. Unicode is about the character set side, while UTF-8 and UTF-16 are encodings.
  • on Windows, what people call Shift_JIS is usually better treated in practice as CP932 / the Windows Japanese code page
  • saying only we switched to UTF-8 is still not enough. You also need to define whether BOM exists and which line-ending style is used
  • the real source of confusion is not Japanese itself, but the fact that multiple historical assumptions still remain together on the same Windows machine

In practice, the right starting point is to separate these four questions:

  1. What are the bytes in this file?
  2. Under which encoding was it written?
  3. Under which encoding is it being read?
  4. Are the line endings CRLF or LF?

Once those are separated, the investigation usually becomes much easier.

2. Break the terminology apart

2.1 What is different between Unicode / UTF-8 / UTF-16 / CP932

The quickest way forward is to break the words apart once.

Term What it refers to Example Common confusion
Unicode a framework for assigning numbers to characters U+3042 () people think it is the same thing as UTF-8
UTF-8 an encoding that turns Unicode into bytes E3 81 82 people think it is Unicode itself
UTF-16LE an encoding that turns Unicode into bytes 42 30 people mix it up with a menu label like Unicode
CP932 the Windows Japanese legacy code page 82 A0 people assume it is perfectly identical to Shift_JIS
CRLF / LF byte sequences that separate lines 0D 0A / 0A people think they are a kind of encoding
BOM identifying bytes at the start of a file EF BB BF and so on people think it is the encoding name itself

Even for the single character , the bytes differ by encoding:

Character: あ

UTF-8    : E3 81 82
CP932    : 82 A0
UTF-16LE : 42 30

The important point is that characters and bytes are not the same thing. Applications show you something that looks like characters on screen, but when saving or transmitting data they ultimately exchange bytes. Incidents usually happen at that conversion boundary.

2.2 How to think about Shift_JIS and CP932

In real projects, Japanese text files on Windows are very often called Shift_JIS as a catch-all term. That works conversationally, but it is a bit too rough for implementation work.

If you want to be more precise about legacy Japanese text on Windows, it is safer to think in terms of CP932, or more broadly the Windows Japanese code page.

If you stay vague here, conversations start to drift like this:

  • someone says save it as Shift_JIS, but the other side actually assumes Windows-style CP932
  • Linux or macOS handles the file as shift_jis, but some Windows-originated files do not reproduce exactly as expected
  • someone says save as ANSI, but which code page that means depends on the environment

So in specifications and investigation notes, it is safer to write things more explicitly:

  • write CP932 instead of Shift_JIS
  • write ACP (active code page) / usually CP932 on Japanese Windows instead of ANSI
  • write something concrete like UTF-8 no BOM, LF instead of just text

2.3 The traps behind the words ANSI, Unicode, and UTF-8N

Around Windows, the labels themselves are part of the problem.

The common traps are these:

  • ANSI
    This appears in old Windows UIs and old explanations, but it does not mean ASCII. In many cases it means the machine’s active code page.
  • Unicode
    In some editors and tools, the menu label Unicode actually means UTF-16LE. If someone says I saved it as Unicode, that does not automatically mean UTF-8.
  • UTF-8N
    You may see this label in Japanese-market editors. It is usually just a UI label used to distinguish UTF-8 without BOM. It is not a formal encoding name.

In other words, on Windows the same word can mean slightly different things depending on the tool. That is one of the first major sources of confusion.

3. Why mojibake happens

3.1 What mojibake actually is

Mojibake itself is conceptually simple:

  1. turn a string into bytes using one encoding
  2. turn those bytes back into a string using a different encoding
  3. if the assumptions do not match, you get a different string

For example, if is saved in UTF-8, the bytes are:

E3 81 82

If those bytes are read as UTF-8, you get . If they are read under a CP932-oriented assumption, they may look like 縺� or some other broken text.
What is broken there is not the Japanese text itself, but the decoding assumption.

If you want to summarize mojibake in one sentence, it is this:

The same bytes were read as if they belonged to a different encoding.

3.2 Broken display and data corruption are different things

It is important to separate the stage where recovery is still possible from the stage where it is much harder.

For example, the following can still be recoverable:

  1. open a UTF-8 file as if it were CP932
  2. it looks like 縺� on screen
  3. nothing has been saved yet

At that point the original bytes are still in UTF-8. If the file is reopened with the correct encoding, the text may recover.

The dangerous path is this:

  1. misread a UTF-8 file as CP932
  2. save the already-broken display as-is
  3. lose the original UTF-8 byte sequence

At that point it is no longer just a display issue. It is data corruption.

In practice, instead of flattening everything into the sentence it became mojibake, you should at least separate these two questions:

  • are the bytes themselves still correct?
  • or has the misread content already been saved again?

3.3 Once you fall back to an encoding that cannot represent the characters, you do not get them back

Another dangerous case is when a Unicode string is pushed into a narrower code page such as CP932.

If the target code page cannot represent some of the characters, one of these usually happens:

  • they are replaced with ?
  • replacement characters are inserted
  • conversion errors occur
  • they are forced into different, similar-looking characters

For example, some emoji and some extended ideographs cannot be round-tripped into CP932.
This should not be judged only by whether the text is still readable, but by whether a round trip preserves the original text.

Once the information has been lost, knowing the correct encoding later does not reconstruct it.

4. What the difference in line endings means

4.1 CRLF / LF / CR

Line endings are also bytes.

  • CR = carriage return = 0D
  • LF = line feed = 0A
  • in traditional Windows text files, CRLF (0D 0A) is the classic form
  • on Linux / Unix-like systems, LF (0A) is the common form
  • standalone CR appears mostly in older historical contexts such as classic Mac data

Put in table form:

Line ending Bytes Typical context
CRLF 0D 0A traditional Windows text files, legacy tools
LF 0A Linux / macOS / many development tools
CR 0D very old legacy data

4.2 Line endings are a different issue from text encoding

This point matters a lot.

Line-ending style is a different issue from encoding.

Even two UTF-8 files can differ only in line endings.
For example, if the content is A, newline, B, the bytes become:

UTF-8 + LF   : 41 0A 42
UTF-8 + CRLF : 41 0D 0A 42

So all of the following are perfectly possible:

  • a file is UTF-8, but only the line endings differ
  • a file is CP932, but the line endings are LF
  • a file is UTF-16LE, but the line endings are CRLF

That is why, when someone says we already changed it to UTF-8 but something is still off, the actual difference may be the line endings and nothing else.

4.3 \n is not always the same thing as the newline bytes stored in a file

This is an easy source of confusion from the programmer’s point of view.

Just because you wrote \n in source code does not guarantee that only 0A is stored in the file.
In text mode, depending on the language, runtime, or I/O API, \n may be converted to CRLF on Windows.

That means these can drift apart:

  • the newline representation in source code
  • the string at runtime
  • the bytes saved into the file
  • the line breaks shown in the editor

That is why you sometimes end up with I thought I wrote LF, but the file became CRLF.

Modern editors can handle plain LF just fine, but surrounding tools, legacy applications, and business workflows still often assume CRLF.
So line-ending trouble is not just an old story. It still appears in normal day-to-day work.

5. Why Windows is especially easy to get wrong

5.1 Unicode and legacy code pages coexist

This is the single biggest reason Windows gets complicated.

Windows still has both:

  • a path that uses Unicode
  • a path that uses code pages

Newer applications, web-related assets, and cross-platform systems tend to lean toward UTF-8, while old CSV, TXT, logs, Excel-adjacent workflows, and business-system integration often still carry CP932.
On top of that, some outputs and some APIs still naturally emit UTF-16LE.

In other words, multiple text cultures coexist on a single Windows machine.

5.2 The labels are inconsistent

What increases confusion is often not the technology itself, but the labels:

  • people say Shift_JIS, but the reality is CP932
  • people say ANSI, but the reality is the active code page
  • people say Unicode, but the reality is UTF-16LE
  • people say UTF-8, but whether BOM exists is still unspecified
  • editor-specific labels such as UTF-8N appear

If that stays vague, the conversation sounds aligned while the actual technical assumptions are not.

5.3 ASCII-only content hides the problem

This also matters a lot.

Because UTF-8 is compatible with the ASCII range, files that contain only letters, digits, and symbols may still look kind of fine even under the wrong assumption.
On the CP932 side, the ASCII-equivalent range also often survives visually, so the issue does not show up.

As a result, you get situations like:

  • an English-only config file looks fine
  • the moment one line of Japanese is added, it breaks
  • a latent problem surfaces for the first time during operations

That is why encoding incidents often look like it worked until yesterday, and suddenly broke today.
In reality, the trap was there already, and it only became visible when non-ASCII text entered the file.

5.4 File content, file names, the console, and source files are different layers

On Windows, if you call all of these encoding, you get lost:

  • file names / paths
  • file contents
  • console display
  • the encoding of the source code file itself
  • the string representation at runtime
  • clipboard or GUI-component display

For example, Japanese file names may display normally while the file contents are still saved in CP932.
Or the file itself may be UTF-8 while the display is broken only because the console code page is wrong.

An operation such as chcp 65001 basically changes the console-side assumption. It does not rewrite the bytes of an existing file.

Likewise, even if the source code file is UTF-8, that does not mean the log file written at runtime is also UTF-8.
You need to separate which layer’s encoding you are talking about every time.

On Japanese Windows, \ may also appear as a yen sign. That often gets mixed into the encoding discussion too.
But in many cases that is a font or glyph-display issue, not a change in the path separator or escape semantics.

5.5 BOM and line endings matter on separate axes

The label UTF-8 still leaves only half the story defined.

In practice, these also matter:

  • is BOM present or absent?
  • are the line endings CRLF or LF?

For example, even with the same UTF-8:

  • some Windows tools read it only when BOM is present
  • some Unix-like processing treats BOM as an unwanted extra at the start of the first field
  • some legacy-side tools are awkward if the file is LF only
  • CRLF can make shell scripts or diffs noisier

So you can still have incidents even after the encoding itself matches.

5.6 Tools silently change things

What makes this even trickier is that local tools can add implicit behavior:

  • an editor auto-detects the encoding
  • save operations add or remove BOM
  • Git converts CRLF / LF
  • a shell or command writes using its default encoding
  • CSV export uses an unexpected code page
  • a different PowerShell or tool version changes the defaults

So even when the person operating the system did not explicitly change anything, some layer may still have inserted its own assumption.

That is the real meaning of it broke even though I changed nothing.
Quite often, it was not a person that changed it, but the tool’s default behavior.

6. Common failure patterns

The typical incidents look like this:

Situation What is actually misaligned Typical symptom
a legacy Windows tool treats a UTF-8-no-BOM settings file as ANSI / CP932 decoding assumption only Japanese text becomes mojibake
a CP932 CSV is passed into a UTF-8-oriented processing path decoding assumption , decode errors, or meaningless Japanese text
a UTF-16LE log is passed into Unix-style text tools encoding assumption NUL bytes appear and it looks like binary data
an LF source file is converted into CRLF in another environment newline assumption massive line-ending diffs or script trouble
misread content is saved as-is the bytes themselves become something else irreversible data corruption
the specification says only export CSV the interface is undefined Excel can read it, but another tool breaks
the team decides only standardize on UTF-8 BOM / line endings remain undefined some tools still fail

The especially dangerous path is seeing broken display and then saving it, which turns a recoverable misunderstanding into a confirmed incident.

7. Rules that reduce incidents in practice

From here on, the question is what rules reduce incidents operationally.

7.1 Decide the baseline for new text files

For new files, UTF-8 is a reasonable first choice.
But that alone is still not enough.

At minimum, decide these too:

  • whether it is UTF-8 with BOM or UTF-8 no BOM
  • whether line endings are CRLF or LF
  • who will read the file
  • whether legacy Windows tool compatibility is required
  • whether Linux / macOS / CI / containers also consume it

For example, if it is cross-platform source code or configuration, UTF-8 no BOM + LF is often the strongest first candidate.
On the other hand, if it must fit old Windows tools or an existing operational process, UTF-8 with BOM or CP932 + CRLF may still be the correct choice.

The important thing is not some universal answer to what is objectively correct, but who this file needs to interoperate with.

7.2 Do not silently change existing legacy files

If an existing file is CP932, it is usually safer not to convert it to UTF-8 as a side effect of a small day-to-day edit.

The safer operating model is:

  • preserve the original encoding / BOM / line endings of existing files
  • treat encoding conversion as a separate migration task
  • confirm the target set and downstream consumers before doing batch conversion

A large share of mojibake incidents comes from a well-meaning while I am here, I will modernize it.

7.3 Treat encoding and line endings as part of the interface

CSV, TXT, logs, settings files, and lightweight protocols are not only about content. The text format itself is part of the interface.

At minimum, the specification should state:

  • the encoding
  • whether BOM exists
  • the newline style
  • whether a header exists
  • quote / delimiter rules
  • which tools were used to verify the format

For example, the three letters CSV are not enough.
Only when you write something like UTF-8 with BOM, CRLF, comma delimiter, header present does the conversation stop drifting.

7.4 Be explicit at read / write boundaries

On the code side as well, it is safer not to rely on hidden defaults.

  • specify the encoding on file reads and writes
  • stay conscious of encoding when passing text across process boundaries
  • fix line endings as part of the specification for export / import paths
  • do not make ad-hoc shell redirects part of the real production path

Especially on Windows, it saved successfully and it saved the correct bytes are not the same statement.

7.5 Share Git and editor rules too

Git does not automatically fix encoding for you.
At the same time, it may transform line endings.

So at the repository level it is safer to decide:

  • whether source code is LF by default
  • whether Windows-only text can use CRLF
  • how .gitattributes should pin the behavior
  • how editor settings should be shared

The important point is to think about encoding and line endings separately.
Even if Git normalizes line endings, an encoding incident still remains an encoding incident.

7.6 Do not stop at saying “it became mojibake”; say what drifted

In real teams, these paraphrases are surprisingly effective:

  • bad wording: it became mojibake
  • better wording: it looks like a UTF-8 no BOM file is being opened under a CP932 assumption
  • bad wording: the line endings are weird
  • better wording: an LF file is being converted into CRLF, which is increasing diffs

Once you can state what actually drifted, the speed of investigation changes a lot.

8. Investigating mojibake and line-ending diffs with these five questions

When investigation gets stuck, the fastest path is to go back to these five questions:

  1. What are the bytes of this file right now?
    • is it UTF-8?
    • UTF-8 with BOM?
    • CP932?
    • UTF-16LE?
  2. Who wrote it first, and under what assumption?
    • editor
    • legacy app
    • Excel export
    • shell / script
    • batch / middleware
  3. Who is reading it now, and under what assumption?
    • editor auto-detect
    • console code page
    • library default encoding
    • import-side specification
  4. What about BOM and line endings?
    • BOM present or absent
    • CRLF or LF
  5. Has the misread content already been saved?
    • is it still only a display problem?
    • or were the bytes already rewritten and information lost?

Once those five are answered, the cause is usually visible.

9. Summary

Windows text encodings and line endings look messy not because Japanese itself is difficult, but because bytes, encodings, BOM, newline conventions, and tool defaults exist as separate layers while old and new text conventions still coexist on Windows.

The especially important points to remember are these:

  • mojibake is the result of reading the same bytes under a different encoding
  • newline issues live on a different axis from encoding
  • do not trust words such as Shift_JIS, CP932, ANSI, and Unicode too casually
  • saying only we switched to UTF-8 is not enough; BOM and line endings also matter
  • broken display and already-resaved data corruption must be treated separately
  • in specifications, write UTF-8 no BOM, LF rather than just text

In other words, when handling text on Windows, the practical way to think is not this is a string problem, but how do we align the contract around bytes?

11. References

  1. Microsoft Learn, Code Page Identifiers - Win32 apps
    https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
  2. Microsoft Learn, about_Character_Encoding - PowerShell
    https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.6
  3. Microsoft Learn, Understanding file encoding in VS Code and PowerShell
    https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding?view=powershell-7.6
  4. W3C Internationalization, Character encodings: Essential concepts
    https://www.w3.org/International/articles/definitions-characters/
  5. Git documentation, gitattributes
    https://git-scm.com/docs/gitattributes
  6. Git documentation, git-config
    https://git-scm.com/docs/git-config

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Windows App Development

Windows business tools often carry a mix of CP932 and UTF-8 assets, so treating text encodings and line endings as design concerns improves maintainability and supportability.

View Service Contact
Back to the Blog