How to Compare Program Versions on Windows: From Power Mode Setup to the Limits of Repeatability

· · Windows, Benchmark, Performance, Profiling, Power Management

Bottom line first

Six points for taking repeatable benchmarks:

  1. Decide what you want to compare before you start: the code difference, or the actual user experience?
  2. Treat power mode and power plan as separate things, and record both
  3. Separate the cold first run from the steady state after warm-up
  4. Alternate runs in A->B->A->B order
  5. Look at the median and the spread, not just the average
  6. If the difference is small, dig into the cause with ETW/WPR

Decide the type of comparison up front

Comparing the code itself

When you want to know the impact of an algorithm change or compiler optimization on the implementation itself. Strip out as much environmental noise as you can (clean boot, fixed power mode, notifications off, and so on).

Comparing the real user experience

When you want to know how fast it actually feels to users after release. Do not strip out the noise. Compare in an everyday environment, including OneDrive sync, Defender, notifications, and the rest.

Mixing these two leads to twisted conclusions (“12% faster in the lab, but indistinguishable in real life”).

Main sources of variance

Layer Source of variance Typical example
Hardware CPU/GPU, memory, SSD, cooling Thin laptops, with or without a cooling pad
Firmware BIOS/UEFI, OEM controls Power-saving policies, fan control
OS Windows build, drivers, update state Behavior changes after an update
Power AC/DC, power mode, power plan Battery operation is a different world
Thermals Room temperature, fans, recent load Turbo on the first run, then drops off
Background Update, Defender, sync, notifications A scan or sync kicks in mid-run
Data/cache OS cache, app cache Slow only on the first run, fast only after

Pin down the power mode and power plan

These are two different things

  • Power mode: chosen from the Settings app under “System > Power & battery” (Best power efficiency / Balanced / Best performance)
  • Power plan: the classic power scheme (Balanced / High performance, etc.). Check it with powercfg

How to pin them

  1. Always run the comparison on AC power for laptops
  2. Pin the power mode (use Best performance for benchmarking)
  3. Record the active power plan:
powercfg /list
powercfg /getactivescheme
  1. Switch to High performance if needed:
REM High performance
powercfg /setactive 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c

Caveats

  • High performance may not appear on Modern Standby devices (a design constraint of the model)
  • If you cannot change the power mode, a custom power plan may be selected. Try selecting Balanced first

Crush background noise

  1. Reboot and wait a few minutes: right after boot, updates, indexing, sync, and Defender are all running wild
  2. For serious comparisons, do a clean boot: stop non-Microsoft services with msconfig. This is for lab comparisons that focus on the code difference
  3. Turn off notifications: enable Do not disturb
  4. Suppress search indexing and sync: exclude your benchmark directory from indexing. Stop OneDrive/Dropbox. Close browsers and Teams

Match thermals

A cold CPU/GPU and a warmed-up one are two different beasts. This is especially obvious on laptops.

  • Keep the room temperature consistent
  • Fix how the laptop is positioned
  • Fix the AC adapter, dock, and external display setup
  • Avoid heavy work right before benchmarking
  • Measure the first run and the steady state separately

Alternate the run order

10 runs of A followed by 10 runs of B -> bad (thermal and cache bias gets baked in).

Recommended:

  • A B A B A B ...
  • A B B A A B B A ...
  • Pre-generate a random order and run it

What to measure

Metric How to obtain it Use
Wall-clock time QueryPerformanceCounter / Stopwatch Closest to user-perceived speed
CPU time (user + kernel) GetProcessTimes Computational efficiency
Cycle count QueryProcessCycleTime Computational load excluding wait time

Read the metrics in combination

  • Only wall-clock got faster -> probably an I/O, wait-time, or cache improvement
  • CPU time and cycles both dropped -> the implementation itself got lighter
  • Only the first run is slow/fast -> a cold/warm difference (startup, initialization, JIT)
  • Gets slower as runs accumulate -> thermals, throttling, or memory pressure

priority and affinity are last resorts

If there is a difference at default settings, that difference itself is meaningful. Reaching for /high or /affinity from the start introduces conditions you would never see on real Windows.

start "" /high /wait myapp.exe --bench case1.json
start "" /affinity F /high /wait myapp.exe --bench case1.json

Do not use /realtime. It does not remove noise; it creates a different kind of accident.

A practical measurement procedure

  1. Pin what you are comparing (commit hash, build number, Debug/Release, logging on/off)
  2. Pin the machine conditions (Windows build, BIOS version, AC connected, room temperature)
  3. Pin the power conditions (record power mode and active power plan)
  4. Reboot and wait a few minutes
  5. Do a clean boot if needed
  6. Add a warm-up
  7. Alternate A and B (with enough total runs)
  8. Keep median, min, max, and p95
  9. Save the raw data
  10. If the difference is small, capture an ETW/WPR trace

Fields worth recording

timestamp, version, scenario, elapsed_ms, user_ms, kernel_ms, cycles,
power_mode, power_plan, ac_or_dc, room_temp_c, notes

If possible:

cpu_package_temp_start_c, cpu_package_temp_end_c,
affinity_mask, priority_class, windows_build, driver_version

Dig into “why is it faster?” with ETW/WPR

When the difference is small, or the reason is hard to read, reach for ETW (Event Tracing for Windows).

wpr -start CPU -filemode
REM run the benchmark here
wpr -stop trace.etl

From there you can argue with reasons attached, like “B has less lock contention so ready time dropped” or “A has more file opens, so the cold start is slow.”

Wrap-up

What actually moves the needle when comparing versions on Windows is unglamorous discipline that helps repeatability:

  • Pin AC / power mode / power plan, and record them
  • Separate cold and warm
  • Alternate A and B
  • Look at the median and the distribution
  • Do a clean boot if needed
  • If the difference is small, dig into the reason with ETW/WPR

The most important thing is to write down what you fixed and what you did not, alongside the results. Benchmark results without the conditions written down are unreliable as far as repeatability goes.

References

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Back to the Blog