Why do my benchmark results vary so much between runs on Windows?

Because many things besides your code affect the numbers: power mode and power plan, AC versus battery, thermals and turbo behavior, background updates, Defender scans, search indexing, cloud sync, notifications, scheduling and affinity, and cache state. Even the same Windows machine is effectively a different experiment if these conditions are not aligned. The core disciplines are to compare on AC power with a pinned power mode, reboot and wait a few minutes before measuring, alternate A/B runs instead of running all of one version first, and record the conditions alongside the results.

What is the difference between the Windows power mode and the power plan, and does it matter for benchmarking?

They are separate things that look similar. The power mode is the Settings-app slider (Best power efficiency / Balanced / Best performance), which also affects processor power management behavior such as core parking and performance scaling. The power plan is the traditional power scheme visible via powercfg, like Balanced or High performance. Handle them sloppily and your comparison becomes a comparison of the OS's power-saving policies, so record at minimum: AC or battery, which power mode, and which active power plan. Note that on Modern Standby devices, only Balanced-derived plans are allowed, so a missing High performance option is by design.

Which timing metrics should I measure when comparing program versions?

Look at three: wall-clock time (via QueryPerformanceCounter or Stopwatch) for what the user actually waits; CPU time (user plus kernel, via GetProcessTimes) for computational efficiency; and CPU cycle count (via QueryProcessCycleTime) for whether the computation itself got lighter. The combinations are what tell the story — if wall-clock improved but CPU time did not, the gain likely came from I/O, waits, caches, or scheduling rather than the implementation. Also prefer the median, p95, and the distribution over the mean, which a single Defender scan or notification can drag away.

Should I run my benchmark with high priority or pinned CPU affinity?

Not at first. Measure in the default state, because a difference that shows up there has value, and starting with /high or /affinity imports conditions that do not occur on real Windows. Use them only with a clear purpose: /high to reduce disturbance from other processes, /affinity to pin CPU placement for the comparison, NUMA control to align memory locality on large machines. Skip /realtime entirely — it tends to generate new accidents rather than remove noise. When a small difference needs explaining, capture an ETW trace with Windows Performance Recorder instead.

How to Benchmark Programs Reliably on Windows

You want to compare version A and version B of a program on Windows. The single worst thing you can do is run each once on the same machine and declare “B seems about 8% faster.”

That 8% might genuinely be the code difference. But in reality, it was one of power mode, power plan, thermals, background updates, search indexing, virus scans, affinity, execution order, or cache state — the classic Windows benchmarking story. It is quite a muddy world.

This article summarizes how to compare the execution speed of different versions of a program on Windows in a form as close to the code difference as possible. The main target is Windows 11, but most of it — powercfg, start, and so on — works the same on Windows 10.

The Conclusion First

The tricks for improving reproducibility boil down to these six.

Decide first “what you want to compare” Whether you want to see the code difference or the real user experience changes which environment factors you should align.
Record power mode and power plan as separate things Handle this sloppily on Windows, and your comparison tends to become a comparison of the OS’s power-saving policies.
Separate the cold first run from the warmed-up steady state “Only the first run is fast” or “only the later runs are slow” is not unusual.
Alternate runs, A→B→A→B Run all of A first and then all of B, and you eat the skew of thermals and background state.
Look at the median and the spread, not just the mean One outlier wrecks the whole picture. The mean is more fragile than you think.
If the difference is small, dig down to the cause with ETW / WPR Argue from gut feel, and you mostly end up brawling in the fog.

Decide First What You Want to Compare

“Speed comparison” sounds like one thing, but there are actually two kinds.

1. A comparison to see the code difference

You want to know whether the implementation itself got faster due to an algorithm change, data structure change, compiler optimization, runtime update, and so on.

In this case, cut environmental noise as much as possible. A dedicated benchmarking session, fixed power mode, notifications off, search indexing and sync suppressed, and if necessary, go as far as a clean boot.

2. A comparison to see the real user experience

You want to know the speed users will actually feel on their everyday Windows after release.

In this case, you must not erase all the noise that exists in reality. Comparing in a “plausible everyday environment” — including OneDrive sync, Defender, notifications, and normal power settings — gives results closer to reality.

Mix these two, and your conclusions get twisted. Things like “12% faster in the lab but within noise in the real world” or “faster in the real world but unchanged in CPU time” happen routinely.

The Main Causes of Variance on Windows

First, a rough inventory of what makes results wobble.

Layer	Variance factor	Typical example
Hardware	CPU / GPU, memory, SSD, cooling	Thinness of a laptop, presence of a cooling pad
Firmware	BIOS / UEFI, OEM controls	Power-saving policies, fan control
OS	Windows build, drivers, update state	The same PC behaves differently after an update
Power	AC / DC, power mode, power plan	On battery, it is a different world
Thermals	Room temperature, fans, prior load	Turbo on the first run only, fading later
Background	Update, Defender, sync, notifications	A scan or sync runs mid-execution
Scheduling	Priority, affinity, NUMA	CPU placement varies by machine
Data / cache	OS cache, app cache	Slow only the first time, fast only from the second run
Build conditions	Debug / Release, PGO, logging on/off	You are comparing different things to begin with

In short: even “the same Windows machine” is a different experiment if the conditions are not aligned.

Treat Power Mode and Power Plan Separately

This part matters a lot.

Windows has the Power mode in the Settings app and the traditional Power plan (the power schemes visible via powercfg). They look similar and tend to get lumped together, but handle them sloppily and the comparison turns to mush.

In the Windows Settings app, you can choose the Power mode from Settings > System > Power & battery. Microsoft’s documentation states you can switch between Best power efficiency, Balanced, and Best performance separately for Plugged in / On Battery. Furthermore, changing the Power mode also affects the underlying power-related settings and PPM (Processor Power Management) behavior. In other words, this alone can change core parking and performance scaling policy.

The Power plan, on the other hand, is the traditional power scheme: Balanced, High performance, and so on. You can check it with powercfg /list and powercfg /getactivescheme.

The confusing part is that Windows has both the power mode overlay and the power plan. So record at least the following with your benchmark results:

AC or battery
Which power mode
Which active power plan

Benchmark results missing these three are quite painful to look at later.

Power conditions to pin down first

Always compare laptops on AC power Battery operation easily introduces unintended limits.
Pin the power mode For benchmarking, try Best performance first.
Record the active power plan Save the current value with powercfg.

powercfg /list
powercfg /getactivescheme

Switch to High performance if needed

# Balanced
powercfg /setactive 381b4222-f694-41f0-9685-ff5bb260df2e

# High performance
powercfg /setactive 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c

“High performance does not show up” is completely normal

This is another stumbling point. Microsoft’s documentation states that on devices supporting Modern Standby, only Balanced, or plans derived from Balanced, are allowed. So instead of “High performance is missing — is it broken?”, the answer may be that is how that machine is designed.

Microsoft also advises that if the Power mode cannot be changed, a custom power plan may be selected, so try selecting Balanced first. When the Power mode UI is unresponsive, this is the quickest thing to suspect.

Kill the Background Noise

Windows is a hard worker. Even when you want a quiet benchmark, it does all sorts of things in the background for you.

First, reboot and wait for things to settle

After changing settings, reboot once, and do not run immediately after login — wait a few minutes. Right after startup, updates, indexing, sync, Defender, and assorted residents are still thrashing around.

For serious comparisons, use a clean boot

Microsoft documents a procedure for reducing to a minimal startup configuration via clean boot: stop non-Microsoft services in msconfig and disable Startup apps in Task Manager.

This is powerful for reducing noise. However, it diverges from the everyday environment, so it is suited to “lab comparisons aimed at seeing the code difference.”

Silence notifications

Windows notification banners look light but are surprisingly disruptive. Beyond the visual nuisance, they can change execution timing, focus, and background app activity.

Enable Do not disturb manually, or at minimum turn notifications off during the benchmark.

Suppress search indexing and sync

If the benchmark target reads lots of files, writes lots of artifacts, or rebuilds source trees repeatedly, search indexing and cloud sync quietly sting.

Exclude the benchmark directory from search indexing
Pause OneDrive / Dropbox / Google Drive sync
Close browsers, Teams, Discord, Slack

Nothing flashy here, but when it matters, it matters a lot.

A Comparison That Does Not Align Thermals Is Mostly Comparing Thermals

A CPU or GPU is a different creature when cold versus warmed up. Laptops, thin mini PCs, and small desktops show this most clearly.

Rules to follow

Keep room temperature as consistent as possible
Fix how the laptop is positioned
Fix the AC adapter, dock, and external display configuration
Do no heavy work right before the benchmark
Measure the first run and the steady state separately

Alternate the execution order

Avoid running A 10 times and then B 10 times. The skew of thermals, caches, and background activity piles on.

Recommended patterns:

A B A B A B ...
A B B A A B B A ...
Pre-generate a random order and run in that order

What You Measure Changes What “Fast” Means

Squash “fast” into a single number and you mostly have an accident. The three representative metrics to look at on Windows:

1. Wall-clock time

The time the user waits. It is closest to the end-to-end experience, so this is the first value to look at.

On Windows, QueryPerformanceCounter (QPC) is available for high-resolution timing. In managed code, the Stopwatch family is the standard. Eyeballing milliseconds with DateTime.Now is, frankly, a bit defenseless.

2. CPU time (user + kernel time)

The time the process actually used the CPU, obtainable via GetProcessTimes.

This is useful for looking at computational efficiency. For example, if wall-clock improved but CPU time did not change, caches, I/O, wait time, or scheduling may be the active ingredient.

3. Cycle count (CPU cycles)

QueryProcessCycleTime gives you the CPU cycle count for the whole process.

This is also a CPU-work metric, but it shows a different face than wall-clock. It is particularly useful for asking “the wait time is the same, but did the computation itself get lighter?”

Priority, Affinity, and NUMA Are Last Resorts

These can have an effect. But touching them from the start, just because they work, easily creates a different phenomenon.

First, measure normally

If a difference shows up in the default state, that difference itself has value. Throwing in /high or /affinity from the start imports “conditions that do not occur on real Windows.”

If you use them, be clear about the purpose

/high: you want fewer disturbances from other processes
/affinity: you want to pin CPU placement for the comparison
NUMA control: you want to align memory locality on large machines

The Windows start command can launch with a priority class and affinity mask.

start "" /high /wait myapp.exe --bench case1.json
start "" /affinity F /high /wait myapp.exe --bench case1.json

But skip /realtime

/realtime is available, but you should not use it. It tends to work less as noise removal and more as a generator of new accidents.

A Recommended Measurement Procedure

Putting it all together, here is a procedure that is easy to run in practice.

Lab-leaning comparison procedure

Fix the comparison targets
- commit hash / build number
- compiler / runtime version
- Debug / Release
- logging, asserts, tracing on/off
Fix the machine conditions
- Windows build
- BIOS / UEFI version
- driver version
- AC power
- room temperature, physical placement
Fix the power conditions
- Decide the power mode
- Record the active power plan
Reboot
Wait a few minutes before benchmarking
Clean boot if necessary
Include a warm-up
Alternate A / B runs
Get enough repetitions
Keep median, min, max, p95
Save the raw data
If the difference is small, capture ETW / WPR

Items Worth Recording That Save You Later

In the benchmark CSV or JSON, keeping at least the following pays off.

timestamp,version,scenario,elapsed_ms,user_ms,kernel_ms,cycles,power_mode,power_plan,ac_or_dc,room_temp_c,notes

If possible, these are handy as well.

cpu_package_temp_start_c,cpu_package_temp_end_c,affinity_mask,priority_class,windows_build,driver_version

With benchmarks, being interpretable later often matters more than the measuring itself.

Look at the Median and the Distribution, Not Just the Mean

The mean is convenient, but it breaks easily in Windows benchmarks. Defender kicking in just once, a notification popping, another process hammering the SSD — any of these can drag the mean away.

The recommended combination:

Median: look at this first
p95 / p99: check whether the tail has gotten worse
min / max: see how things stray
Box plots or scatter plots: useful when the difference is small

How to Read a Difference When You See One

Interpreting results is easiest when you look at combinations.

Only wall-clock is faster

Possibly improvements in I/O, wait time, caches, or scheduling.

CPU time and cycles both dropped

There is a good chance the implementation itself got lighter.

Only the first run is slow / fast

That is the cold / warm difference. Suspect startup, initialization, cache generation, JIT.

Gets slower the more runs you do

Suspect thermals, throttling, memory pressure, background activity.

Dig Down to “Why It Is Faster” with ETW / WPR

When the difference is small, or the reason is unreadable, moving on to Windows’s ETW (Event Tracing for Windows) tooling is the classic route.

Microsoft’s Windows Performance Recorder (WPR) is an ETW-based recording tool included in the Windows ADK. It can capture CPU, I/O, context switches, page faults, and more in one go.

At a minimum, it looks like this.

wpr -start CPU -filemode

REM Run the benchmark here

wpr -stop trace.etl

Once you reach this stage, instead of “B is 3% faster,” you can speak with reasons: “B has less lock contention and lower ready time.” “A opens more files and has a slower cold start.”

Summary

When comparing different versions of a program on Windows, what really works is not flashy tricks. What matters is the unglamorous discipline that pays off in reproducibility:

Pin and record AC / power mode / power plan
Separate cold and warm
Alternate A / B runs
Look at the median and the distribution
Clean boot if necessary
If the difference is small, dig to the reason with ETW / WPR

And most important of all: write down, alongside the results, what you pinned and what you did not. A benchmark is a comparison of speed, and at the same time a record of experimental conditions.

A speedup report without conditions is about as entertaining as fortune-telling that occasionally hits — but in terms of reproducibility, it is quite unreliable. Conversely, if the conditions are properly written down, the result has real value even when the difference is small.

References

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

A Windows App Developer's Primer on CPU Settings: Priority, Affinity, and P-cores/E-cores

How CPU priority, affinity, P-cores/E-cores, and power settings interact to shape Windows app performance — covers EcoQoS, with C# sample...

Read Article

How to Fairly Compare the Execution Speed of C#, C++, Java, and Go

How to fairly compare the execution speed of C#, C++, Java, and Go, covering measurement design, warm-up, environment pinning, how to rea...

Read Article

How to Think About Windows Session Isolation — Session 0, RDP, and Running Multiple Users Concurrently

This article untangles the concept of a Windows "session," a topic that consistently confuses Windows app developers. It covers why Sessi...

Read Article

Preventing Multiple Instances of a Windows App — Named Mutexes and Activating the Existing Window on a Second Launch

This article organizes the classic requirement for business Windows apps — 'don't let the same app launch twice' — around a named Mutex. ...

Read Article

Integrating Entra ID Authentication into WinForms/WPF Apps — A Practical Architecture with MSAL.NET and the WAM Broker

A practical, hands-on look at integrating Entra ID (formerly Azure AD) authentication into WinForms/WPF desktop apps: the public client m...

Read Article

Where This Topic Connects

This article connects naturally to the following service pages.

Technical Consulting & Design Review

Designing performance comparisons, aligning measurement conditions, and digging deeper with ETW / WPR all fit well with our technical consulting / design review service.

View Service Contact

Bug Investigation & Root Cause Analysis

When versions differ in speed, the workflow of isolating whether the cause is power conditions, thermals, background noise, or implementation differences proceeds well as a bug investigation / root cause analysis engagement.

View Service Contact

Frequently Asked Questions

Common questions about the topic of this article.

Why do my benchmark results vary so much between runs on Windows?: Because many things besides your code affect the numbers: power mode and power plan, AC versus battery, thermals and turbo behavior, background updates, Defender scans, search indexing, cloud sync, notifications, scheduling and affinity, and cache state. Even the same Windows machine is effectively a different experiment if these conditions are not aligned. The core disciplines are to compare on AC power with a pinned power mode, reboot and wait a few minutes before measuring, alternate A/B runs instead of running all of one version first, and record the conditions alongside the results.
What is the difference between the Windows power mode and the power plan, and does it matter for benchmarking?: They are separate things that look similar. The power mode is the Settings-app slider (Best power efficiency / Balanced / Best performance), which also affects processor power management behavior such as core parking and performance scaling. The power plan is the traditional power scheme visible via powercfg, like Balanced or High performance. Handle them sloppily and your comparison becomes a comparison of the OS's power-saving policies, so record at minimum: AC or battery, which power mode, and which active power plan. Note that on Modern Standby devices, only Balanced-derived plans are allowed, so a missing High performance option is by design.
Which timing metrics should I measure when comparing program versions?: Look at three: wall-clock time (via QueryPerformanceCounter or Stopwatch) for what the user actually waits; CPU time (user plus kernel, via GetProcessTimes) for computational efficiency; and CPU cycle count (via QueryProcessCycleTime) for whether the computation itself got lighter. The combinations are what tell the story — if wall-clock improved but CPU time did not, the gain likely came from I/O, waits, caches, or scheduling rather than the implementation. Also prefer the median, p95, and the distribution over the mean, which a single Defender scan or notification can drag away.
Should I run my benchmark with high priority or pinned CPU affinity?: Not at first. Measure in the default state, because a difference that shows up there has value, and starting with /high or /affinity imports conditions that do not occur on real Windows. Use them only with a clear purpose: /high to reduce disturbance from other processes, /affinity to pin CPU placement for the comparison, NUMA control to align memory locality on large machines. Skip /realtime entirely — it tends to generate new accidents rather than remove noise. When a small difference needs explaining, capture an ETW trace with Windows Performance Recorder instead.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

View Profile Contact

Public links

GitHub LinkedIn X COM_BLAS COM_BigDecimal

The Conclusion First

Decide First What You Want to Compare

1. A comparison to see the code difference

2. A comparison to see the real user experience

The Main Causes of Variance on Windows

Treat Power Mode and Power Plan Separately

Power conditions to pin down first

“High performance does not show up” is completely normal

Kill the Background Noise

First, reboot and wait for things to settle

For serious comparisons, use a clean boot

Silence notifications

Suppress search indexing and sync

A Comparison That Does Not Align Thermals Is Mostly Comparing Thermals

Rules to follow

Alternate the execution order

What You Measure Changes What “Fast” Means

1. Wall-clock time

2. CPU time (user + kernel time)

3. Cycle count (CPU cycles)

Priority, Affinity, and NUMA Are Last Resorts

First, measure normally

If you use them, be clear about the purpose

But skip /realtime

A Recommended Measurement Procedure

Lab-leaning comparison procedure

Items Worth Recording That Save You Later

Look at the Median and the Distribution, Not Just the Mean

How to Read a Difference When You See One

Only wall-clock is faster

CPU time and cycles both dropped

Only the first run is slow / fast

Gets slower the more runs you do

Dig Down to “Why It Is Faster” with ETW / WPR

Summary

References

Related Articles

Related Topics

Where This Topic Connects

Frequently Asked Questions

Author Profile

Go Komura