Burning Images and Text into MP4 Frames with Media Foundation - Source Reader, Drawing, Color Conversion, Sink Writer, and a Single-File C++ Sample

· · Media Foundation, C++, Windows Development, GDI+, Direct2D, DirectWrite, H.264

The short version

The basic shape for burning images or text into every frame of an MP4 looks like this:

  1. Decode with Source Reader -> pull out uncompressed frames
  2. Composite with a drawing API -> use GDI+ or Direct2D to lay images and text on top
  3. Color convert if needed -> e.g. RGB32 -> NV12
  4. Re-encode with Sink Writer -> write a new MP4

Drawing images and text is not Media Foundation’s job. That part belongs to drawing APIs like GDI+ or Direct2D/DirectWrite.

Why it feels complicated

“Putting text on a video” actually mixes four separate concerns.

Concern What it means
Container vs. codec mp4 is a container; the payload inside is compressed data like H.264
Decode/encode You can’t draw on compressed data, so you have to bring it back to uncompressed
Drawing Compositing text and images is the job of GDI+ or Direct2D
Color space and pixel format The format that’s easy to draw on and the format the encoder wants are different (RGB32 vs. NV12)

The big picture

input.mp4 -> Source Reader -> uncompressed frame (RGB32) -> draw with GDI+ -> BGRA->NV12 conversion -> Sink Writer -> output.mp4

How the pipeline splits up

Input: Source Reader

  • Turning on MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING makes it handle YUV->RGB32 conversion and deinterlacing automatically
  • Receiving frames in RGB32 is the easiest starting point because it’s draw-friendly

Drawing: GDI+ or Direct2D

  • If you just want it working: GDI+ - lightweight to bring in and easy to keep in a single file
  • If you care about speed: Direct2D/DirectWrite - better for long videos and high resolutions

Color conversion: RGB32 -> NV12

The H.264 encoder expects YUV formats like NV12. Either use the Video Processor MFT, or convert it yourself.

Output: Sink Writer

  • Output stream type: the format you want written to disk (e.g. MFVideoFormat_H264)
  • Input stream type: the format your app hands over (e.g. MFVideoFormat_NV12)

Audio

In practice it’s much easier to re-encode only the video and remux the audio as compressed.

Notes on the sample code

This article includes a single-file sample you can paste straight into a Visual Studio 2022 C++ console app.

OverlayMp4.exe input.mp4 overlay.png output.mp4

What the code assumes

  • Targets Windows 10/11, x64 build, no precompiled headers
  • The input video’s width and height must be even (because NV12 is 4:2:0)
  • The output is a video-only MP4
  • The overlay text is fixed in kOverlayText (defaults to HelloWorld)

Flow of the implementation

  1. Initialize MF and GDI+ via ScopedMf and ScopedGdiplus
  2. Pull input video info and configure RGB32 reception in ConfigureSourceReader
  3. Create the output file (H.264/NV12) in CreateSinkWriter
  4. Loop: ReadSample -> CopySampleToTopDownBgra -> DrawOverlay -> BgraToNv12 -> WriteSample
  5. Wrap up with Finalize

Things to watch for when reading the code

Normalize stride and orientation early

Video frames don’t always have stride equal to width x 4, and the vertical orientation can be flipped. The code normalizes everything into a top-down BGRA buffer before drawing.

Check both the flags and the sample from ReadSample

ReadSample can return S_OK while sample == nullptr (e.g. STREAMTICK, ENDOFSTREAM). You have to check HRESULT, flags, and inputSample together.

Carry timestamp and duration from the input

Rather than recomputing on the assumption of a fixed fps every iteration, carrying through the input sample’s timestamp/duration as much as possible is more robust.

Where to take it for production

  1. Add audio remux: re-encode video only, pass audio through
  2. Use the Video Processor MFT: handles color space conversion, resizing, and deinterlacing in one place
  3. Swap drawing for Direct2D/DirectWrite: better for high resolution and long videos
  4. Move to D3D11 surfaces: when you want to push work onto the GPU path
  5. Factor it out as a custom MFT: when you want to reuse the logic across multiple apps

Wrap-up

When you burn images or text into video frames with Media Foundation, splitting the problem into read, draw, convert, write back keeps things tidy. Get something running first with Source Reader -> RGB32 -> GDI+ -> NV12 -> Sink Writer, then layer in improvements as the use case demands - that’s the practical path.

Related Articles

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Related Topics

These topic pages place the article in a broader service and decision context.

Where This Topic Connects

This article connects naturally to the following service pages.

Back to the Blog