Checklist for Unexpected Exceptions - A Quick Decision Table for Whether to Exit or Keep Running

· · Windows Development, Exception Handling, Architecture, C# / .NET, Reliability

The bottom line first

When an unexpected exception happens, the question to ask is not “can it be caught?” but “can the app’s state still be trusted afterward?”

  • Swallowing the exception with catch (Exception) and continuing is almost always dangerous
  • Continuing is acceptable only when three conditions line up: the failure unit can be discarded, shared state can be rolled back, and external side effects can be explained
  • Lean toward exiting if there is inconsistent shared state, a broken parent loop, an anomaly at a native boundary, or a smell of memory corruption
  • Do not assume continuation for exceptions that put the whole process’s health in doubt, such as StackOverflowException or AccessViolationException

Three levels of choice

“Keep the app running” actually has three distinct levels.

Choice Meaning
Fail just this operation and continue Keep the screen open but treat this save or import as failed
Stop just the subsystem and continue Reinitialize only the connection, screen, or worker
Terminate the process The blast radius of state damage is unreadable, so restart

Decision table

Situation Choice Reason
Only one operation, screen, or job failed and the state can be discarded Lean toward continue The failure unit can be contained
The target objects or connections can be disposed and recreated after the exception Lean toward subsystem reinit The damaged area can be localized
Shared state was partially updated and the scope of the change is unclear Lean toward exit Invariants may be broken
External side effects on DB, files, or devices were left half-done Lean toward exit External consistency cannot be reasoned about
A monitoring loop or parent loop died from an unexpected exception Lean toward exit When only some features die, zombification is likely
Failure during startup, configuration loading, or DI composition Fail startup and exit Booting halfway is more dangerous
AccessViolationException, StackOverflowException, severe OutOfMemoryException Lean toward immediate exit The whole process’s health is in doubt
The dangerous work is isolated in a separate process and the parent is intact Continue the parent, restart the child The failure domain is properly separated

Decision flowchart

Unexpected exception occurs
  -> Smell of memory corruption / stack exhaustion / resource exhaustion?
    -> YES -> Exit / FailFast / restart
    -> NO  -> Can the failure unit be discarded?
      -> NO  -> Lean toward exit
      -> YES -> Can shared state be rolled back / reinitialized?
        -> NO  -> Stop the subsystem or exit
        -> YES -> Can external side effects be explained?
          -> NO  -> Stop the subsystem or exit
          -> YES -> Mark just this operation as failed and continue

Look at this before the exception type

  1. Where it happened: A UI event? A parent loop? Startup code? A native boundary?
  2. How far it got: Has memory state, the DB, files, or device state changed midway?
  3. The blast radius: Just that object, the whole screen, or the whole process?
  4. Whether rollback is possible: Can it be discarded and rebuilt, or rolled back via a transaction?
  5. External side effects: Was it sent or not, and is a duplicate execution safe?
  6. Monitoring and restart: Is there an automatic restart or recovery path after the crash?

High-risk exceptions

Exception Choice Reason
StackOverflowException Lean toward immediate exit The call stack is broken
AccessViolationException Lean toward immediate exit Possible memory corruption
OutOfMemoryException Lean toward exit Recovery code that needs more allocations is unreliable
NullReferenceException / InvalidOperationException (unexpected) Depends on context, but lean toward exit Your assumptions are broken; partial changes may remain
Unexpected exception that escaped the parent loop Lean toward exit Risk that the core died but the process lingers
Anomalies at COM / P/Invoke boundaries Immediate exit to strong lean toward exit Hard to judge safety from managed code alone

Decisions by scenario

UI events (button click, screen navigation)

Conditions that favor continuing:

  • Failed before loading and business state has not been touched
  • Only transient state inside a dialog is broken, and closing the screen discards it
  • The ViewModel or connection can be rebuilt after the exception

Conditions that favor exiting:

  • Both screen state and domain state were partially updated
  • Shared state such as static, singleton, or cache was touched

Per-item job processing (messages, files, requests)

This is the easiest place to continue. Mark just that one item as failed and move on. However, you still need a transaction or compensation for partial changes.

Resident loops, monitors, queue workers

The most dangerous place. The scary case is the parent loop dying from a single unexpected exception while the process lingers.

  • At each item-processing boundary, catch expected exceptions
  • If an unexpected exception escapes the parent loop, lean toward terminating the process

Startup

Required configuration unreadable, migration failure, missing required folders or certificates -> fail startup and exit.

Native boundaries (COM, P/Invoke, unsafe)

Apply stricter rules. AccessViolationException, heap corruption, and handle anomalies should lean toward immediate exit.

Conditions that allow continuation

  • The failure unit is clear (one operation, one screen, one job)
  • State can be discarded (disposable / recreatable)
  • Shared state is protected (no contamination of other features)
  • External side effects can be explained (sent / not sent / safe to resend is known)
  • You can be honest with the user
  • It is observable (traceable via logs, metrics, dumps)

Common anti-patterns

  1. catch (Exception) that only logs and continues: Hides the root cause and keeps the broken state alive
  2. Trying to recover inside an unhandled exception handler: It is useful as a last-chance record, but not a recovery point
  3. Casual retry despite external side effects: Recipe for double-execution incidents
  4. Leaving the UI alive after the monitoring loop dies: A zombie app that only looks alive
  5. Saying “I don’t want it to crash” without designing for crashing: Automatic restart, session restore, and idempotency come first

Summary

Decision order for unexpected exceptions:

  1. Can the failure unit be discarded?
  2. Can shared state be rolled back or rebuilt?
  3. Can external side effects be explained?
  4. Can the health of memory, threads, and native boundaries be trusted?

Confident on all four -> continue. Not confident -> exit. In long-running apps, staying alive while broken is often more dangerous than crashing honestly.

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Back to the Blog