How to Burn Images and Text into MP4 Frames with Media Foundation - Source Reader, Drawing, Color Conversion, Sink Writer, and a One-File C++ Sample
Watermarks, inspection results, device IDs, operator names, and timestamps.
Requirements like these, where you need to burn information into every frame of an MP4 and produce a new MP4, are very common in surveillance, inspection, traceability, and analysis tools.
But once you start touching Media Foundation, you quickly run into IMFSourceReader, IMFSample, IMFMediaBuffer, IMFTransform, and IMFSinkWriter, and it suddenly becomes much less obvious where image and text overlay logic should actually live.
This article first organizes the overall shape as Source Reader -> drawing -> color conversion -> Sink Writer, then shows a one-file sample you can paste directly into a Visual Studio C++ console application.
The sample reads a given MP4, draws a specified image and HelloWorld on every frame, and writes a new output MP4.
To keep the sample easy to paste and run, it is intentionally structured to re-encode video only.
You can absolutely extend the approach to include audio remux, but the central topic here is per-frame image and text overlay, so that is where the sample stays focused.
Contents
- 1. The short answer
- 2. Why this is a little more confusing than it first sounds
- 3. A quick comparison table
- 3.1 Processing image
- 4. How to think about the pipeline in pieces
- 4.1 Use IMFSourceReader for input
- 4.2 Think about image and text composition as GDI+ or Direct2D / DirectWrite
- 4.3 RGB32 usually is not the final input your H.264 encoder wants
- 4.4 Use IMFSinkWriter for output
- 4.5 Audio is easier if you separate it conceptually at first
- 5. Assumptions and usage
- 6. What matters when you read the implementation
- 6.1 ReadSample can succeed with a null sample
- 6.2 Preserving input timestamps and durations is usually safer
- 6.3 IMFSample may not contain one simple buffer
- 6.4 Do not hardcode stride as width * 4
- 6.5 Audio is intentionally left out
- 7. Where to grow this for production
- 8. Wrap-up
- 9. Related articles
- 10. References
1. The short answer
- The basic shape for adding images or text to every MP4 frame is
decode with Source Reader -> composite on uncompressed frames -> convert color if needed -> re-encode with Sink Writer. - The actual act of drawing images and text is not Media Foundation’s job. That part is usually easier to think about with drawing APIs such as
GDI+,Direct2D,DirectWrite, orWIC. - If the final target is
MP4 (H.264), you often need a bridge between draw-friendly formats such asRGB32 / ARGB32and encoder-friendly formats such asNV12 / I420 / YUY2. - If your main goal is simply to get the first version working,
Source Reader -> RGB32 -> draw with GDI+ -> NV12 -> Sink Writeris a very understandable route. - If your main goal is performance and long-term extensibility, a design closer to
D3D11 / DXGI surface -> Direct2D / DirectWrite -> Video Processor MFT -> Sink Writerhas more headroom.
2. Why this is a little more confusing than it first sounds
The sentence “put text on a video” actually mixes together four different concerns.
-
Container vs. codec
mp4is a container, not a frame format. Inside it, the video is usually compressed asH.264orH.265. -
Decode and encode
You generally cannot take compressed video data and draw text or PNG content onto it directly with an ordinary 2D API. First you need decoded, uncompressed frames. -
Drawing
Text rendering, logo placement, alpha blending, and anti-aliased text are not really core Media Foundation responsibilities. That is whereGDI+orDirect2D / DirectWrite / WICcome in. -
Color space and pixel format
The format that is pleasant to draw on is often not the same format the encoder prefers. This is where many first implementations start to get sticky.
In practice, it is easier to think of the problem not as “add text with Media Foundation” but as “move frames through Media Foundation, draw with a rendering API, then convert as needed before encoding.”
3. A quick comparison table
| Direction | Shape | Good fit | Main caution |
|---|---|---|---|
| Get it working first | Source Reader -> RGB32 -> composite -> NV12 -> Sink Writer |
Batch jobs, internal tools, first implementation | CPU-side copies and conversions can add up |
| Push performance | D3D11 / DXGI surface -> Direct2D / DirectWrite -> Video Processor MFT -> Sink Writer |
Long videos, high resolution, large volumes | D3D11 and DXGI resource management become part of the job |
| Build a reusable video effect | Implement a custom MFT and insert it into a topology |
Shared effects across multiple apps or pipelines | Implementation, registration, and debugging become harder |
This article stays with the first row on purpose: the clearest path to a correct first implementation.
3.1 Processing image
flowchart LR
A["input.mp4"] --> B["IMFSourceReader"]
B --> C["Uncompressed frame<br/>RGB32"]
C --> D["Draw image + HelloWorld with GDI+"]
D --> E["BGRA -> NV12 conversion"]
E --> F["IMFSinkWriter"]
F --> G["output.mp4"]
B --> H["Audio samples"]
H --> I["Copy as-is<br/>or re-encode"]
I --> F
The important point is that drawing itself is not really a Media Foundation responsibility.
Media Foundation is the frame transport and encoding side. The overlay step belongs much more naturally to a rendering API.
4. How to think about the pipeline in pieces
4.1 Use IMFSourceReader for input
If the input is a file path, MFCreateSourceReaderFromURL is a straightforward starting point. If the input is video data already in memory, creating an IMFByteStream and using MFCreateSourceReaderFromByteStream is a natural variation.
The first major choice is whether you want to receive frames in a draw-friendly format or an encoder-oriented format.
- If you want simpler overlay logic, use
RGB32orARGB32 - If you want to optimize around encode efficiency, use a YUV format such as
NV12
But because text and PNG composition are dramatically easier to reason about on RGB-style frames, starting with RGB32 / ARGB32 is often the calmest first move.
If you enable MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, the Source Reader can help by converting YUV -> RGB32 and handling deinterlacing.
That is convenient when the goal is simply to get usable frames out, though for long or high-resolution video it can become one of the places to revisit later for performance.
4.2 Think about image and text composition as GDI+ or Direct2D / DirectWrite
Once you pull the buffer from the IMFSample, you can place a logo image and draw text on top of it.
In this sample, I use GDI+ because the priority is a one-file example that is easy to paste and run.
- it can load images
- it can draw text
- it has a relatively small amount of setup
- it fits comfortably inside one console-app
.cpp
For long videos or 4K-heavy workloads, D3D11 + Direct2D + DirectWrite has more room to grow.
So a very practical progression is: start with GDI+, then move toward Direct2D / DirectWrite when performance or rendering requirements justify it.
4.3 RGB32 usually is not the final input your H.264 encoder wants
When you write back to MP4 (H.264), the Microsoft H.264 encoder often expects a YUV-family input format such as I420 / IYUV / NV12 / YUY2 / YV12.
So it is not always enough to draw in RGB32 / ARGB32 and then pass that frame straight into IMFSinkWriter.
That means you usually need one of these two routes:
- insert a
Video Processor MFTand convertRGB32 / ARGB32 -> NV12 - add your own
RGB -> NV12conversion step
This sample chooses the second route, explicit self-contained conversion, because the priority is a one-file sample.
In production, a Video Processor MFT can be very attractive because it can centralize color conversion, resizing, and deinterlacing as well.
4.4 Use IMFSinkWriter for output
The key idea is to configure two different views of the stream:
- output stream type: the format you want written to the file
Example:MFVideoFormat_H264 - input stream type: the format your application will feed into the writer
Example:MFVideoFormat_NV12
From the writer’s perspective, that means:
- your application provides uncompressed
NV12frames - the writer encodes them as H.264 and stores them in MP4
4.5 Audio is easier if you separate it conceptually at first
Very often, the real requirement is “add a logo and text to the video” while leaving the audio unchanged.
In practical work, a very usable shape is:
- process only the video stream through
Source Reader -> composite -> Sink Writer - keep the audio stream compressed and remux it
This sample intentionally focuses on burning images and text into video frames, so the output is video-only MP4.
If you want to preserve audio, that is usually best added in the next step rather than mixed into the very first example.
5. Assumptions and usage
The sample assumes:
- Windows 10 or 11
- a Visual Studio 2022 C++ console project
x64build- no precompiled headers for that
.cpp - input width and height are even
- ordinary MP4 input
- video-only output MP4
- overlay images are in a format GDI+ can load, such as PNG / JPEG / BMP / GIF
Because NV12 is a 4:2:0 format, width and height need to be even.
6. What matters when you read the implementation
6.1 ReadSample can succeed with a null sample
ReadSample can return S_OK and still give you sample == nullptr. Typical cases include MF_SOURCE_READERF_STREAMTICK, MF_SOURCE_READERF_ENDOFSTREAM, and other stream events.
6.2 Preserving input timestamps and durations is usually safer
Media Foundation timestamps are in 100-nanosecond units, and duration is a separate value. If one input frame becomes one output frame, it is usually safer to carry input timing forward as much as possible.
6.3 IMFSample may not contain one simple buffer
That is why ConvertToContiguousBuffer is such a common first step when you want a predictable memory layout for drawing.
6.4 Do not hardcode stride as width * 4
Padding, pitch, and 2D buffer behavior can break that assumption. If IMF2DBuffer::Lock2D is available, it is usually safer to use it and honor the returned pitch.
6.5 Audio is intentionally left out
The sample omits audio not because audio is impossible, but because the goal is to make the per-frame overlay path easy to understand first. In real projects, a very practical next step is often to re-encode video and remux audio if it can stay as-is.
7. Where to grow this for production
- Add audio remux.
- Replace
GDI+withDirect2D / DirectWritewhen rendering quality or throughput matters more. - Move the RGB-to-NV12 stage toward
Video Processor MFTor GPU-oriented processing. - Graduate to
D3D11 / DXGIsurfaces for higher-throughput pipelines. - Consider a custom
MFTif the effect needs to be reused across apps or pipelines.
8. Wrap-up
The practical split is:
Source Readergets the frames outGDI+orDirect2D / DirectWritehandles overlay drawing- explicit conversion bridges into
NV12 Sink Writerwrites the new MP4
If you want something you can paste into a .cpp file and run, that is a very workable first architecture. It is not the final architecture for every production system, but it is an honest and practical one.
9. Related articles
- What Media Foundation Actually Is - Why It Feels So Closely Tied to COM and Windows Media APIs
- How to Extract a Still Image from an MP4 at a Specific Time with Media Foundation - a One-File C++ Sample
10. References
- Microsoft Learn: Using the Source Reader to Process Media Data
- Microsoft Learn: MFCreateSourceReaderFromByteStream
- Microsoft Learn: MFCreateMFByteStreamOnStream
- Microsoft Learn: IMFSourceReader::SetCurrentMediaType
- Microsoft Learn: MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING
- Microsoft Learn: MF_SOURCE_READER_ENABLE_ADVANCED_VIDEO_PROCESSING
- Microsoft Learn: IMFSourceReader::ReadSample
- Microsoft Learn: Working with Media Samples
- Microsoft Learn: IMF2DBuffer::Lock2D
- Microsoft Learn: Video Subtype GUIDs
- Microsoft Learn: H.264 Video Encoder
- Microsoft Learn: Video Processor MFT
- Microsoft Learn: Using the Sink Writer
- Microsoft Learn: Tutorial: Using the Sink Writer to Encode Video
- Microsoft Learn: Interoperability Overview (Direct2D)
- Microsoft Learn: Text Rendering with Direct2D and DirectWrite
- Microsoft Learn: Writing a Custom MFT
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Where This Topic Connects
This article connects naturally to the following service pages.
Windows App Development
This topic maps directly to Windows application development because it crosses Media Foundation, GDI+, Direct2D / DirectWrite, pixel-format conversion, and video output architecture in one implementation.
Technical Consulting & Design Review
If the real decision is how to grow a one-file sample into a production-ready design, where to add audio remux, or when to move toward GPU-oriented processing, this fits technical consulting and design review well.