Fast video rendering and encoding using web APIs

Never read pixels into JavaScript memory.
profile photo
Adam Pietrasiak
TLDR:
  • At all costs, avoid reading pixel data in JavaScript. This will take 80% of the time and will be the biggest bottleneck (it seems it reads from GPU memory into slow CPU memory - I don’t know the details)
  • Use native video encoding APIs - VideoFrame created directly from the canvas, VideoEncoder, or VideoDecoder.
  • If you do it correctly - native encoding will be your bottleneck, not the speed of your JavaScript code.

User experience

At screen.studio app - export speed is critical for a great user experience. If you have to wait 15 minutes to export a 1-minute recording, this kills your experience.

General flow

Image without caption
Typical screen.studio project consists of the original recording and some background. The original recording is just a video file.
There are 2 main modes of playing the animation on the screen.studio.
  • Preview mode (aka. Editor)
  • Export mode

Preview mode

Preview mode is the result you see while you edit your project.
It is OK if some frames are dropped in preview mode
In essence, this is how the preview mode works:
  • If the user hits ‘Play’ - we play the original video file the same way you’d play any HTMLVideo.
  • Everything else tries its best to be synced with this playing video, but it is OK if it will not make it.
  • If the user hits ‘Pause’ and seeks some frame
    • original recording video seeks to that time using video.currentTime
    • Everything else also renders according to that frame.
As seen, preview mode has some limitations but is also “optimistic” in many ways, making the code simple. It is also naturally “fast” as it plays precisely at the speed of your original recording.

Export mode

Export mode is entirely different because of one core requirement
In export mode, every frame needs to be rendered correctly, and no dropped frames are allowed. If you export the same project twice - the result should be identical
Because of that, the general flow is:
  • Go over each single frame
  • Make sure every asset (video, images) is loaded
  • Seek every video at the correct time and wait for it to actually display the frame you requested
  • Render everything using WebGL
  • Capture frame in some way
  • Send captured frame to video encoder
  • Go to the next frame and repeat
As seen, there is way more complexity here. There is also a lot of room for bottlenecks in terms of performance.
Two core areas can be a bottleneck:
  • Preparing & Rendering frame
  • Capturing & encoding frame
Each of those gave me a lot of trouble.

Preparing and rendering frame

In this stage, I need to ensure my WebGL canvas renders precisely what should be visible on the given frame of the final video. Everything needs to be ready before we capture.
The main problem here is rendering videos in a performant way. To be precise - set recording video time exactly where we want it to be and wait for it to let us know, “OK, I’m showing what I should be showing now.”

First attempt - native HTML5Video and waiting for seeked event after changing video.currentTime

As a web developer, that felt the easiest and most natural to start with. It is also reliable as we use a browser video element that is absolutely perfectly battle tested.
The problem here is its speed.
When I set the video.currentTime doesn’t mean the video is instantly showing the given frame of the video. It might, for example, have to load it, or it’ll take some time for the codec to decode a given frame.
Turns out the seeked event is usually fired 20-100ms after I change currentTime - the time depends significantly on how unexpected the change is. E.g., if we jump to a totally different part of the video - it will take way more time. It is also quite random; sometimes, it takes 100ms just because it does.
In essence - this is a huge bottleneck. Even being optimistic and saying we wait 20ms on average, it is already setting a hard limit of at most 50fps export speed. And there is much more to do for each frame than to show the original recording video frame.
Note I still use this method in Edit mode, as it is OK to wait a moment in Edit mode when you move your playhead.

Second attempt - VideoDecoder and VideoFrame api

This attempt is the one I currently use, and it works well.
The flow is quite complex, however.
First, we load the .mp4 file as raw data ArrayBuffer. Rather straightforward
Then we need to demux this video. What is demuxing, you’ll ask? In general, .mp4 is a container file. It means it can contain a lot of tracks. For example, a typical movie has at least 2 tracks - one video track and one audio track.
Demuxing means ‘picking raw data about one track out of .mp4 container file’.
I used the MP4Box.js project as I quickly realized it is tough to do in-house. You need to understand all ins and outs of .mp4 formats, binary flags, data offsets, etc.
OK, when we demuxed mp4 into the raw track, so-called “samples,” we create VideoDecoder, a native API that is quite powerful, as it can create video frames directly in the GPU memory.
When we have the decoder, we configure it to use the same codec that our video track is coded with.
Then we pass each sample to it, and it gives us VideoFrame objects.
Those are also quite powerful, as we can create WebGL textures out of those without ever having to read them into JS memory.
There are some severe limitations here, however, but they were not critical in my case:
  • VideoFrame is a GPU texture, so it takes up much memory space. It means you cannot hold a lot of them in memory at once
  • Because of that, VideoDecoder gives you around 10 of them at once and waits till you say, “I’m done with them; you can destroy them” by calling videoFrame.close(); only then it’ll give you the next ones.
  • Because of that, decoding video this way can only move in one direction, aka if you close some frame, you cannot move back in time; you can only move forward. This is fine in my case - I only use this method when exporting, which is OK, as I export frame by frame, always moving forward.
Turns out this method is extremely fast once you load the video data. Creating and displaying VideoFrame as WebGL texture takes a sub-millisecond time. Perfect!
There are a lot of other downsides to this method, however:
  • It is very manual and error-prone.
  • I need to manually parse all time information about each frame which is very different from natural timestamp (video codecs use the “timescale” property, which is kinda like saying, “this is how many milliseconds there are in 1 second, now I’ll tell the time in those custom millisecond”. If I set the timescale to 9000, 9000 means 1 second.
  • The entire decoding code is highly coupled with the file format I handle. If, for example, I stopped expecting recordings to be .mp4 files encoded with H264 coded, I’d need to refactor the entire decoding pipeline.
I now use this method to quickly show original recording frames in the final export. It is so fast once the video is loaded, it is not a problem anymore.

Capturing & Encoding frames

Once WebGL renders a given frame; ideally, we need to capture it and encode it into the final video file.
This process can be implemented relatively simply and naively, but it’ll be slow.

First round - WebGL readPixels() → save frame into .png file → encode final video out of hundreds of those frame images.

It already sounds terrible.
My first approach was to use the readPixels method on my canvas. It gives me a massive array of numbers, each representing red, green, blue, and alpha values in the range of 0-255. For example, if I capture one frame of HD export - 1920x1080 - it’ll require 1920*1080*4 slots in this array (8294400 numbers for one frame).
Then, having those raw pixels data - I create a .png file out of it, which already takes a lot of time.
Then, the final encoder has to read those images back to raw pixels data and create the video.
Exports were painfully slow.
The biggest bottleneck here was surprising - converting raw pixels into a .png file for each frame.

Second round - WebGL readPixels() → pass raw pixels data directly into the encoder

In my 2nd attempt, I created an open ReadableStream while exporting the video and connected to the encoder, reading those pixels “live” and encoding them ad-hoc. I was bombarding this stream with billions of RGBA, RGBA, RGBA values.
This removed the step of creating a .png file and then converting it back to raw pixels.
This was, however highly fragile process. For example, if I told the encoder I wanted the video to be 900 pixels wide, but by mistake, I was sending pixels as if it were 901 pixels wide, the resulting video was tilted by 45degree in quite a funny way as each following row of pixels was moved to the right by 1 pixel.
This method was significantly faster, but it was still painfully slow.

Third round - batching readPixels calls.

I was lost, not knowing what else can I do here. At some point, I experimented with measuring the speed of calling .readPixels() 10 times a row. I was amazed that it was ~50% faster.
I don’t know precisely why - I bet GPU was switching to ‘send pixels’ mode once, and then sending pixels many times in a row without anything else happening in between was faster.
The flow was:
  • I was rendering 60 frames in a row
  • Each time one frame was ready - I copying it to another canvas so I did not lose it
  • Copying from canvas to canvas happens on the GPU level, so I’m not reading pixels data
  • When I had 60 canvases like this, each with a new frame, I was iterating over each of those and only then reading pixels from each of those
  • Then I send pixels data from each of them into the encoder
This was another iterative improvement, but it was still not there.

Third round: VideoEncoder and VideoFrame captured out of the canvas.

When debugging the previous method, I quickly realized the most significant bottleneck was not rendering the frame on canvas, but reading pixels out of it. It took 70% of the total time to handle 1 frame. To be precise, reading it and sending it to the encoder was painfully slow.
Only then I realized there is one fantastic thing possible in modern JS API - new VideoFrame(canvas) - it will create VideoFrame out of canvas on GPU level, never reading pixels data.
Then you can pass this frame directly to the native VideoEncoder using an encoder.encode(frame) method.
This way, we’re encoding the video (or, to be precise - the video track of a video file) without reading pixel data into JS memory.
It was a massive change as it removed the step that took 70% of JS time previously.
I quickly realized I was rendering frames and sending frames to the encoder faster than it could encode them, putting them into a queue and increasing memory usage.
As a result, I set the max queue limit, after which the JS thread waits for the encoder to encode pending frames.
This is a fundamental change, as JS (a slow language) now waits for native APIs to do their job instead of everything else waiting for JavaScript.
Now, this is more complex, as VideoEncoder only encodes a single video track - not an entire video file.
At in start of this article, when I was using VideoDecoder, I needed to “demux” a video file into a video track, now I needed to do the reverse - VideoEncoder was giving me raw data about the video track, which I needed to “mux” into a video file.
After all that is done, I also needed to add a proper audio track in case you recorded voiceover, etc.
There are also a lot of other moving pieces related here, such as using correct codecs, color profiles, etc. But I’ll describe it in another article.
Related posts
post image
When the same bug is caused by multiple issues and only fixing all of them at once will make it go away
post image
Don’t increase your Electron app code complexity as you add more windows to it
Due to a simple bug, Screen Studio app did generate over 2 petabytes of network traffic
Powered by Notaku