When an Industrial Camera Control App Suddenly Crashes After a Month (Part 1) - How to Find a Handle Leak and Design Logging for Long-Running Operation
When a Windows application suddenly crashes only after running for a long time, it is very common to suspect a memory leak first.
But in practice, the main culprit is sometimes a handle leak, and what finally appears weeks later is only the secondary failure.
The case here is a Windows application that controlled an industrial camera and suddenly crashed after about one month of continuous operation. After narrowing it down, the real cause turned out to be a handle leak on a failure path around camera reconnect logic.
In this first part, I will organize what a handle leak means, how this case was narrowed down, and what logs are worth adding to prevent the same investigation pain later.
The second part, When an Industrial Camera Control App Suddenly Crashes After a Month (Part 2) - What Application Verifier Is and How to Build Failure-Path Test Infrastructure, covers the failure-path testing side.
Some proper names and a few log items are generalized, but the underlying thinking applies very broadly to Windows-based device-control applications.
Contents
- Short version
- What a handle leak means here
- Case study: the industrial camera control app that crashed after one month
- How we narrowed it down
- What logs are needed for recurrence prevention
- Rough rule-of-thumb guide
- Summary
- References
- Author GitHub
1. Short version
- In control applications that fail only after long-running operation, always look at
Handle Countin addition toPrivate Bytes - Handle leaks tend to hide on timeout / reconnect / partial-failure / early-return paths rather than on the normal path
- The line where the app finally crashes is often not the line that originally leaked the resource
- The most important logs are operation / session context, process-level
Handle Count, resource open / close symmetry, and Win32 / HRESULT / SDK error details - Instead of waiting another month for reproduction, shorten the path by hammering connect / disconnect / reconnect / failure scenarios in a loop
- Application Verifier, discussed in the second part, is very effective, but your own logs that reveal lifetime symmetry are the foundation
In short, the right first move in this kind of case is not to stare at “it crashed after a long time.”
The right move is to make resource growth and failure-path behavior observable.
Handle leaks often appear wearing the face of a later secondary failure.
If you only read the final exception, it is very easy to walk in the wrong direction.
2. What a handle leak means here
2.1. What “handle” means in this context
Here, a handle means a Windows process-level identifier used to refer to OS-managed resources.
Typical examples include:
| Category | Examples |
|---|---|
| kernel objects | event, mutex, semaphore, thread, process, waitable timer |
| I/O-related resources | file, pipe, socket, device opens |
| especially common in device-control apps | SDK-internal events, callback registration objects, acquisition-thread-related handles |
What often causes trouble in control applications is creating a resource temporarily for a particular operation and forgetting to close it on a mid-failure path.
For example:
- create one event every time reconnect starts
- callback registration or acquisition startup fails partway through
- the success path closes it, but the failure path does not
- short tests mostly run the success path, so the leak is missed
This type of bug hides very naturally in both code review and real operation.
2.2. Why it often appears only after long-running operation
A handle leak does not necessarily fail spectacularly on the first occurrence.
The dangerous pattern is usually a tiny slope: one resource leaked per failure.
flowchart LR
A["Normal operation"] --> B["Occasional timeout / reconnect"]
B --> C["Failure path creates an event handle"]
C --> D["CloseHandle is not called"]
D --> E["Handle Count rises slightly"]
E --> F["The pattern repeats hundreds or thousands of times"]
F --> G["Later CreateEvent / SDK open starts failing"]
G --> H["Secondary crash or stop"]
If one reconnect leaks only one handle, nothing may happen for days.
But in a 24/7 control application, timeout, reinitialization, and reconnect edges repeat over and over.
That is how a leak can surface only weeks later.
What matters is that the handle leak itself may not be the exact line of the final crash.
Typical later failure shapes include:
- API calls that create a new event / file / thread start failing
- the SDK fails to initialize internal resources and returns only a generic failure code
- the error path is thin, and later code trips on a
nullor invalid handle - timeouts increase until a watchdog or upper-level control path kills the app
So the final crash point is often only the last victim, not the first offender.
2.3. How it differs from a memory leak
Long-run failures make people suspect memory leaks first, and that is natural.
But handle leaks should often be investigated on a different axis.
| Viewpoint | Memory leak | Handle leak |
|---|---|---|
| First metric to watch | Private Bytes, Commit, Working Set |
Handle Count |
| Typical symptom | memory pressure, paging, slowdown, OOM | Create* / Open* / SDK initialization failure, secondary failures |
| Where it tends to hide | retained references, caches, forgotten release | asymmetry between create/open and close/dispose |
| Typical trend | memory grows steadily | Handle Count rises and does not return |
So when dealing with long-running failures, looking only at memory is often like driving with one eye closed.
At minimum, Handle Count and Thread Count should be checked together.
3. Case study: the industrial camera control app that crashed after one month
3.1. What the symptoms looked like
The outward symptoms were simple:
- a Windows application controlling an industrial camera runs 24/7
- under normal conditions it works fine
- after about one month, one day it suddenly crashes
- after restarting, it works again for a while
The first problem is simply that the time to failure is too long.
Waiting one month for each reproduction cycle is brutal as an investigation strategy.
What made it more annoying was that the crash location was not exactly the same every time.
Sometimes it happened right after reconnect started, sometimes when acquisition started, and sometimes after an SDK call failed.
With that appearance, all of these look plausible at first:
- instability in the camera SDK
- temporary device or communication failures
- a memory leak
- some thread-related race
- an initialization failure that never made it into the logs
In other words, there were simply too many vaguely suspicious things.
3.2. The first metrics we looked at
So the first move was to look at how process-level resources were growing.
In this case, the observations pointed roughly in this direction:
| Metric | Observed trend | Reading |
|---|---|---|
Handle Count |
rose slightly after reconnects or timeouts and did not come back | suspect a handle leak |
Private Bytes |
moved up and down, but did not show a strong monotonic slope | heap memory might not be the main culprit |
Thread Count |
mostly flat | a thread leak looked less likely |
| crash location | varied slightly each time | a secondary failure looked likely |
At that point the search space narrowed a lot.
It felt more natural to read the case not as “it crashes after one month,” but as “something leaks little by little during operation, and that is why it crashes a month later.”
3.3. The actual leaking point
The final cause was a missing close for an event handle created on an initialization-failure path during camera reconnect.
In simplified form, the flow looked like this:
sequenceDiagram
participant App as Control app
participant OS as Windows
participant SDK as Camera SDK
App->>OS: CreateEvent
App->>SDK: Register callback
SDK-->>App: Partial failure / timeout
Note over App: failure path returns
Note over App: CloseHandle is never called
loop many reconnects
App->>OS: Handle Count rises little by little
end
App->>OS: Next CreateEvent / Open
OS-->>App: Failure
App-->>App: Secondary crash
Conceptually, the code looked like this kind of leak:
handle = CreateEvent(...)
if (!RegisterCallback(handle))
{
return Error; // CloseHandle(handle) is missing
}
if (!StartAcquisition())
{
return Error; // close is also missing here
}
...
CloseHandle(handle)
The reason short tests miss this is also fairly easy to see:
- on normal startup -> normal shutdown, the handle is closed
- the failure occurs only during reconnect
- there is no test that repeatedly drives that failure path at high volume
- in production, the leak accumulates slowly over weeks
So the structure was: invisible if you only look at the normal path, but a very ordinary leak on the abnormal path.
The fix itself was not flashy:
- bring
create/openandclose/disposeresponsibilities closer together - make sure resources are always released even on partial failure, for example by using
finally, a destructor, or a session object - make ownership explicit before and after callback registration or acquisition startup
- express “who closes this” as code responsibility rather than a comment
This is not a special trick so much as lifetime design embedded into the code.
4. How we narrowed it down
4.1. Compress the reproduction instead of waiting another month
In this kind of investigation, waiting a month each time is simply a bad strategy.
What you want to do is drive the suspicious edge path many times in a short time.
In this case, reproduction was compressed with loops like this:
flowchart LR
A["Start"] --> B["Open camera"]
B --> C["Start acquisition"]
C --> D["Simulated timeout / disconnect"]
D --> E["Reconnect"]
E --> F["Resume acquisition"]
F --> G{"Repeat N times"}
G -- yes --> D
G -- no --> H["Check the final delta"]
The point is to spend time on lifetime edges, not on normal “it is capturing frames correctly” time.
Scenarios like these help:
- run
open -> start -> stop -> closeat high volume - intentionally inject timeouts and force reconnects
- fail immediately after callback registration
- include disconnect interruption, reconnect interruption, and shutdown races
You do not need to perfectly replay a full month of production.
It is usually much more effective to step on the lifetime edge you suspect thousands of times.
4.2. Look at the slope of Handle Count
In handle-leak investigations, absolute values alone can be misleading.
What matters is whether the count comes back after an operation that should release resources, and how much it grows per repeated cycle.
A practical sequence is:
- decide a baseline after warm-up
- record
Handle Countafter reconnect / start-stop / close - inspect the delta per cycle
- also inspect the slope across several cycles
For example:
leakSlope =
(currentHandleCount - baselineHandleCount)
/ reconnectCount
Whether an absolute value like 2000 is high or low depends on the application.
But if one reconnect adds about +1 and it never comes back, that is very suspicious.
It also helps not to look at Handle Count alone.
At minimum, record it together with:
Handle CountPrivate BytesThread CountReconnectCount- the current phase
That quickly tells you whether memory is growing, threads are growing, or reconnects are returning with resources still unbalanced.
4.3. Compare create/open against close/dispose
Even when process-wide Handle Count looks suspicious, that alone does not tell you the leak site.
The next thing you need is logging that shows resource lifetime as matching pairs.
For example, structured logs like this:
CameraSession session=421 cameraId=CAM01 phase=ReconnectStart reason=FrameTimeout handleCount=1824 privateBytesMB=418
CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Create osHandle=0x00000ABC handleCount=1825
CameraResource session=421 resourceId=evt-884 kind=Event name=FrameReady action=Close osHandle=0x00000ABC handleCount=1824
The important point is not to rely only on osHandle.
Windows handle values can be reused later, so in logs it is much easier to follow the story if you include at least:
sessionIdresourceIdkindaction(Create/Open/Register/Close/Dispose/Unregister)osHandlephase
That makes it much easier to find one-wing flows where Create exists but Close does not.
4.4. Search for where the handle leaked, not just where the crash happened
This point is very important.
Handle leaks often look like this:
- crash line:
CreateEventfailed - real leak:
CloseHandlehad been missing on a failure path for days
In other words, the API that finally crashes is often the exit point of the damage, not the entry point of the cause.
So the order of investigation is usually better like this:
- see which resource keeps growing
- see at which operation boundary it fails to return
- find the place where the
create/openandclose/disposepair is broken - only then read the final crash site
That order makes it much less likely that you get lost.
5. What logs are needed for recurrence prevention
5.1. The minimum set worth keeping
What helped in this investigation was not “just add more logs.”
It was adding more of the information that would later let us reach the cause.
At minimum, it helps to keep the following:
| Category | Minimum fields worth keeping | Why |
|---|---|---|
| operation context | cameraId, sessionId, operationId, reconnectCount, phase |
to connect an event to the exact operation and retry cycle |
| process resources | handleCount, privateBytes, workingSet, threadCount |
to split what is actually growing |
| resource lifecycle | action, resourceId, kind, osHandle, owner |
to follow create/open against close/dispose |
| external call results | win32Error, HRESULT, sdkError, timeoutMs |
to compare failure types later |
| state transitions | OpenStart, OpenDone, ReconnectStart, ReconnectDone, ShutdownStart, etc. |
to know which phase collapsed |
| runtime environment | pid, tid, buildVersion, machineName |
to match dumps, symbols, and deployment artifacts |
That is not everything.
But without at least this much, the logs easily degenerate into “we only know that it crashed.”
5.2. The logs we actually strengthened
In this case, the logging was strengthened in four directions:
- periodic heartbeat
- emit
Handle Count/Private Bytes/Thread Count/ReconnectCountevery 1 to 5 minutes
- emit
- camera-session boundary logs
OpenStartCallbackRegisteredAcquisitionStartTimeoutDetectedReconnectStartReconnectDoneCloseStartCloseDone
- resource lifecycle logs
Create/Open/RegisterandClose/Dispose/Unregisterfor events, threads, files, timers, and SDK registration tokens
- error normalization
- instead of ending with only the exception message, emit
win32Error,HRESULT,sdkError, andphasetogether
- instead of ending with only the exception message, emit
One important point is to keep the log shape consistent between success and failure.
If abnormal cases switch to a totally different format, later aggregation becomes harder.
5.3. How fine the log granularity should be
The tempting mistake is to “just log everything at INFO.”
But if you do that, you later end up reading a wall of noise.
A realistic split is usually:
- periodic monitoring
Handle Count,Private Bytes,Thread Count,ReconnectCount
- operation boundaries
- session start / done / fail
- resource boundaries
create/open/registerandclose/dispose/unregister
- abnormal-case detail
- error codes, stack traces, dump-trigger conditions
Detailed per-frame logs are usually unnecessary.
For long-running failures, logs that make it readable which responsibility opened a resource and which responsibility closed it are much more valuable.
6. Rough rule-of-thumb guide
- Crashes happen only after days or weeks
- first add a heartbeat for
Handle Count/Private Bytes/Thread Count
- first add a heartbeat for
- The system has retry / reconnect / shutdown paths
- build a harness that drives only those boundaries at high volume
- The app uses a lot of native SDK, P/Invoke, or Win32
- the Application Verifier techniques in part 2 are especially worth using
- The app also includes a GUI
- in addition to
Handle Count, also look atGDI ObjectsandUSER Objects
- in addition to
- The final exception tells you almost nothing
- it is usually faster to improve structured logs for operation / session / resource lifecycle first
That last point matters a lot.
In defect investigation, the deciding factor is often not the analysis technique itself but whether the system was made observable in the first place.
7. Summary
The important points to keep are these.
How to read the symptoms:
- if the app crashes only after long-running operation, look at
Handle Countas well as memory - handle leaks tend to hide on abnormal failure paths rather than on the normal path
- the crash site is often not the leak site but only the exit point of the secondary damage
Design changes that help prevent recurrence:
- bring
create/openandclose/disposeresponsibilities closer together - keep context-rich logs by session / operation
- record both process-level resource metrics and per-resource lifecycle events
Test strategy that helps:
- do not wait for month-scale reproduction; drive timeout / reconnect / shutdown in a short loop
- do not make “it does not break” the only acceptance criterion; also require that “when it breaks, we can trace it”
- in part 2, use Application Verifier to surface hard-to-trigger failures like resource exhaustion and handle anomalies earlier
In control applications, it is important that the normal path works.
But in long-running operation it is also very important to know what happened when it broke.
Handle leaks are exactly the kind of bug where that difference pays off.
Instead of looking only at the crash instant, look at growth, boundaries, and lifetime symmetry. That makes them much easier to trace.
8. References
- GetProcessHandleCount function
- Process.HandleCount Property
- Part 2: What Application Verifier Is and How to Build Failure-Path Test Infrastructure
Author GitHub
The author of this article, Go Komura, uses the GitHub account gomurin0428.
On GitHub, he publishes COM_BLAS and COM_BigDecimal.
Author GitHub
The author of this article, Go Komura, is on GitHub as gomurin0428 .
You can also find COM_BLAS and COM_BigDecimal there.