blog tags:


I'm Dmitry Popov,
lead developer and director of Infognition.

Known in the interwebs as Dee Mon since 1997. You could see me as thedeemon on reddit or LiveJournal.

Articles Technology Blog News Company
Optimizations in a multicore pipeline
November 6, 2014

This is a story that happened during the development of Video Enhancer a few minor versions ago. It is a video processing application that, when doing its work, shows two images: "before" and "after", i.e. part of original video frame and the same part after processing.

It uses DirectShow and has a graph where vertices (called filters) are things like file reader, audio/video streams splitter, decoders, encoders, a muxer, a file writer and a number of processing filters, and the graph edges are data streams. What usually happens is: a reader reads the source video file, splitter splits it in two streams (audio and video) and splits them by frames, decoder turns compressed frames into raw bitmaps, a part of bitmap is drawn on screen (the "before"), then processing filters turn them into stream of different bitmaps (in this case our Super Resolution filter increases resolution, making each frame bigger), then a part of processed frame is displayed on screen (the "after"), encoder compresses the frame and sends to AVI writer that collects frames from both video and audio streams and writes to an output AVI file.

Doing it in this order sequentially is not very effective because now we usually have multiple CPU cores and it would be better to use them all. In order to do it special Parallelizer filters were added to the filter chain. Such filter receives a frame, puts it into a queue and immediately returns. In another thread it takes frames from this queue and feeds them to downstream filter. In effect, as soon as the decoder decompressed a frame and gave it to parallelizer it can immediately start decoding the next frame, and the just decoded frame will be processed in parallel. Similarly, as soon as a frame is processed the processing filter can immediately start working on the next frame and the just processed frame will be encoded and written in parallel, on another core. A pipeline!

At some point I noticed this pipeline didn't work as well on my dual core laptop as on quad core desktop, so I decided to look closer what happens, when and where any unnecessary delays may be. I added some logging to different parts of the pipeline and, since in text form they weren't too clear, made a converter into SVG, gaining some interesting pictures.

Here's how the process looked like on a quad core box with DivX used for encoding. The source was a FLV file but its splitter and decoder are not shown here because they don't write to my logs, being third-party components. Time in this picture goes down. Different colors mark different frames. Decoder gives a frame to the first Sample Grabber which remembers the part of frame we want to display on screen. It sends the frame to a Parallelizer that puts it into its queue. In another thread (different threads are shown with different background colors here) the frame is processed by SR (Super Resolution) filter and sent to another Parellelizer. In yet another thread processed frame from the Parallelizer queue is given to second Sample Grabber which takes a part of it to display and sends the frame to an encoder (DivX) which later gives the compressed frame to AVI writer.

DivX doesn't write anything to my logs but since on both sides of it are my filters I know exactly when it takes a frame and when it gives it out, compressed. In this picture everything goes smoothly: speed of whole pipeline is limited by its slowest part (here it's SR), this part works 100% of the time and the other parts work in parallel, taking a rest from time to time waiting for next data portion.

And here's how a similar process looked on my dual core laptop. Source file and processing was the same but instead of DivX I was using XviD.

None of the pipeline parts work 100% of the time! The processing thread is idle half of the time, waiting for the encoding-and-writing thread. And that thread is acting weird: sometimes taking a part of bitmap for displaying happens in a snap and sometimes it takes a lot of time. And while it's working the encoder does nothing. Also, we can see that the encoder is working weirdly: compressing some frames takes significant time while some others are compressed in no time at all. Knowing that we're dealing with MPEG4 where there are B-frames (bidirectional, they are based on both previous and following frames), it's clear that XviD just remembers every other frame and when receiving the next frame it compresses both the remembered and the new one. After compressing two frames it outputs one of them immediately and the second one on the next call, when it will be remembering a new frame. But why does extracting a part of frame for displaying takes so long sometimes? It is such a simple operation! If we look closer at the picture here we'll notice: this extraction works super fast (as it should) when the received frame is still hot, just processed. But if it spends some time in the queue, waiting, then extracting a part from it takes much longer. It's CPU cache! In first case the frame is still in the cache and everything is quick. In second case the cache gets filled with other data and to extract part of frame it needs to be loaded from memory which is much slower, hence the delays.

This lead to first solution: I need to swap places the second Sample Grabber and Parallelizer. Then extracting a part of frame in the grabber will happen immediately after processing the frame, it won't leave the cache. And indeed, swapping these two filters made this extraction invariably fast. The extraction operation became faster without changing a letter in its implementation!

In this picture encoder's times are not really displayed correctly because we know exactly only when a frame is finished compressing but not when its encoding started. But even so, another problem is clearly visible here. XviD here is still the slowest part of the pipeline and ideally it should work 100% of the time while other parts of pipeline should sometimes wait until XviD can take another frame. But here it's not working all the time, part of time it waits until another frame is finished processing. But why wasn't that frame processed earlier? The processing filter already had this frame and had the time to process it.

Here's what happened. When some splitter, decoder or other filter wants to create another sample of data (like a video frame) it requests memory buffer for it from an allocator. This allocator is negotiated with downstream filter before processing starts (particularly, downstream filter might be a renderer and provide an allocator that allocates samples directly in video memory). Number of buffers that an allocator has is strictly limited. If a filter requests a buffer when all of them are taken, this request blocks until some buffer gets released and becomes free to use. This is how Parallelizer limits its queue length: by limiting number of buffers in its allocator. Because if we don't limit this length and producing filter will work faster than consuming one, it will quickly eat all the memory. If we set a limit but make it rather big, it may take too much RAM for uncompressed video frames. So a long time ago it was decided to limit Parallelizer's queue by just two samples: one buffer is used by downstream consumer filter, while the other buffer is getting filled by upstream producing filter in parallel. When the downstream filter finishes its processing it will release the buffer to allocator and it will become available for the upstream filter to place one more sample in the queue. This way they work in parallel without consuming too much memory, but the queue may not have more than one frame waiting. When XviD compresses a frame and swallows the next one (which was waiting in the queue), remembering it for compressing later, the queue suddenly becomes empty, and now XviD needs to wait until the upstream processing filter makes a new processed frame. And before this moment processing filter couldn't do it because both buffers were taken - one was being compressed, another one waited in the queue.

Hence another solution: taking into account that some codecs behave this way, the queue of frames for encoder should be made longer. After increasing number of buffers in the second Parallelizer from 2 to 3 I got the following picture (samples in encoder not depicted, as they overlap because of increased queue length):

Processing thread started to work fulltime, no more unnecessary waiting! Overall speed increased, although not too much, because XviD still took too long for some frames but that was out of my control already. Again, the acceleration happened without changing a bit in the processing parts doing the hard work, just a change of one constant in the pipeline organization. That was fun!