- Use native video encoding APIs -
VideoFramecreated directly from the canvas,
At screen.studio app - export speed is critical for a great user experience. If you have to wait 15 minutes to export a 1-minute recording, this kills your experience.
Typical screen.studio project consists of the original recording and some background. The original recording is just a video file.
There are 2 main modes of playing the animation on the screen.studio.
- Preview mode (aka. Editor)
- Export mode
Preview mode is the result you see while you edit your project.
It is OK if some frames are dropped in preview mode
In essence, this is how the preview mode works:
- If the user hits ‘Play’ - we play the original video file the same way you’d play any HTMLVideo.
- Everything else tries its best to be synced with this playing video, but it is OK if it will not make it.
- If the user hits ‘Pause’ and seeks some frame
- original recording video seeks to that time using
- Everything else also renders according to that frame.
As seen, preview mode has some limitations but is also “optimistic” in many ways, making the code simple. It is also naturally “fast” as it plays precisely at the speed of your original recording.
Export mode is entirely different because of one core requirement
In export mode, every frame needs to be rendered correctly, and no dropped frames are allowed. If you export the same project twice - the result should be identical
Because of that, the general flow is:
- Go over each single frame
- Make sure every asset (video, images) is loaded
- Seek every video at the correct time and wait for it to actually display the frame you requested
- Render everything using WebGL
- Capture frame in some way
- Send captured frame to video encoder
- Go to the next frame and repeat
As seen, there is way more complexity here. There is also a lot of room for bottlenecks in terms of performance.
Two core areas can be a bottleneck:
- Preparing & Rendering frame
- Capturing & encoding frame
Each of those gave me a lot of trouble.
Preparing and rendering frame
In this stage, I need to ensure my WebGL canvas renders precisely what should be visible on the given frame of the final video. Everything needs to be ready before we capture.
The main problem here is rendering videos in a performant way. To be precise - set recording video time exactly where we want it to be and wait for it to let us know, “OK, I’m showing what I should be showing now.”
First attempt - native
HTML5Video and waiting for
seeked event after changing
As a web developer, that felt the easiest and most natural to start with. It is also reliable as we use a browser video element that is absolutely perfectly battle tested.
The problem here is its speed.
When I set the
video.currentTimedoesn’t mean the video is instantly showing the given frame of the video. It might, for example, have to load it, or it’ll take some time for the codec to decode a given frame.
Turns out the
seekedevent is usually fired 20-100ms after I change
currentTime- the time depends significantly on how unexpected the change is. E.g., if we jump to a totally different part of the video - it will take way more time. It is also quite random; sometimes, it takes 100ms just because it does.
In essence - this is a huge bottleneck. Even being optimistic and saying we wait 20ms on average, it is already setting a hard limit of at most 50fps export speed. And there is much more to do for each frame than to show the original recording video frame.
Note I still use this method in Edit mode, as it is OK to wait a moment in Edit mode when you move your playhead.
Second attempt -
This attempt is the one I currently use, and it works well.
The flow is quite complex, however.
First, we load the
.mp4file as raw data
ArrayBuffer. Rather straightforward
Then we need to demux this video. What is demuxing, you’ll ask? In general,
.mp4is a container file. It means it can contain a lot of tracks. For example, a typical movie has at least 2 tracks - one video track and one audio track.
Demuxing means ‘picking raw data about one track out of .mp4 container file’.
I used the
MP4Box.jsproject as I quickly realized it is tough to do in-house. You need to understand all ins and outs of
.mp4formats, binary flags, data offsets, etc.
OK, when we demuxed mp4 into the raw track, so-called “samples,” we create
VideoDecoder, a native API that is quite powerful, as it can create video frames directly in the GPU memory.
When we have the decoder, we configure it to use the same codec that our video track is coded with.
Then we pass each sample to it, and it gives us
Those are also quite powerful, as we can create WebGL textures out of those without ever having to read them into JS memory.
There are some severe limitations here, however, but they were not critical in my case:
VideoFrameis a GPU texture, so it takes up much memory space. It means you cannot hold a lot of them in memory at once
- Because of that,
VideoDecodergives you around 10 of them at once and waits till you say, “I’m done with them; you can destroy them” by calling
videoFrame.close(); only then it’ll give you the next ones.
- Because of that, decoding video this way can only move in one direction, aka if you close some frame, you cannot move back in time; you can only move forward. This is fine in my case - I only use this method when exporting, which is OK, as I export frame by frame, always moving forward.
Turns out this method is extremely fast once you load the video data. Creating and displaying
VideoFrameas WebGL texture takes a sub-millisecond time. Perfect!
There are a lot of other downsides to this method, however:
- It is very manual and error-prone.
- I need to manually parse all time information about each frame which is very different from natural timestamp (video codecs use the “timescale” property, which is kinda like saying, “this is how many milliseconds there are in 1 second, now I’ll tell the time in those custom millisecond”. If I set the timescale to 9000, 9000 means 1 second.
- The entire decoding code is highly coupled with the file format I handle. If, for example, I stopped expecting recordings to be .mp4 files encoded with
H264coded, I’d need to refactor the entire decoding pipeline.
I now use this method to quickly show original recording frames in the final export. It is so fast once the video is loaded, it is not a problem anymore.
Capturing & Encoding frames
Once WebGL renders a given frame; ideally, we need to capture it and encode it into the final video file.
This process can be implemented relatively simply and naively, but it’ll be slow.
First round - WebGL
readPixels() → save frame into
.png file → encode final video out of hundreds of those frame images.
It already sounds terrible.
My first approach was to use the
readPixelsmethod on my canvas. It gives me a massive array of numbers, each representing red, green, blue, and alpha values in the range of 0-255. For example, if I capture one frame of HD export - 1920x1080 - it’ll require 1920*1080*4 slots in this array (8294400 numbers for one frame).
Then, having those raw pixels data - I create a
.pngfile out of it, which already takes a lot of time.
Then, the final encoder has to read those images back to raw pixels data and create the video.
Exports were painfully slow.
The biggest bottleneck here was surprising - converting raw pixels into a
.pngfile for each frame.
Second round - WebGL
readPixels() → pass raw pixels data directly into the encoder
In my 2nd attempt, I created an open
ReadableStreamwhile exporting the video and connected to the encoder, reading those pixels “live” and encoding them ad-hoc. I was bombarding this stream with billions of RGBA, RGBA, RGBA values.
This removed the step of creating a
.pngfile and then converting it back to raw pixels.
This was, however highly fragile process. For example, if I told the encoder I wanted the video to be 900 pixels wide, but by mistake, I was sending pixels as if it were 901 pixels wide, the resulting video was tilted by 45degree in quite a funny way as each following row of pixels was moved to the right by 1 pixel.
This method was significantly faster, but it was still painfully slow.
Third round - batching
I was lost, not knowing what else can I do here. At some point, I experimented with measuring the speed of calling
.readPixels()10 times a row. I was amazed that it was ~50% faster.
I don’t know precisely why - I bet GPU was switching to ‘send pixels’ mode once, and then sending pixels many times in a row without anything else happening in between was faster.
The flow was:
- I was rendering 60 frames in a row
- Each time one frame was ready - I copying it to another canvas so I did not lose it
- Copying from canvas to canvas happens on the GPU level, so I’m not reading pixels data
- When I had 60 canvases like this, each with a new frame, I was iterating over each of those and only then reading pixels from each of those
- Then I send pixels data from each of them into the encoder
This was another iterative improvement, but it was still not there.
VideoFrame captured out of the canvas.
When debugging the previous method, I quickly realized the most significant bottleneck was not rendering the frame on canvas, but reading pixels out of it. It took 70% of the total time to handle 1 frame. To be precise, reading it and sending it to the encoder was painfully slow.
Only then I realized there is one fantastic thing possible in modern JS API -
new VideoFrame(canvas)- it will create VideoFrame out of canvas on GPU level, never reading pixels data.
Then you can pass this frame directly to the native
This way, we’re encoding the video (or, to be precise - the video track of a video file) without reading pixel data into JS memory.
It was a massive change as it removed the step that took 70% of JS time previously.
I quickly realized I was rendering frames and sending frames to the encoder faster than it could encode them, putting them into a queue and increasing memory usage.
As a result, I set the max queue limit, after which the JS thread waits for the encoder to encode pending frames.
Now, this is more complex, as
VideoEncoderonly encodes a single video track - not an entire video file.
At in start of this article, when I was using
VideoDecoder, I needed to “demux” a video file into a video track, now I needed to do the reverse -
VideoEncoderwas giving me raw data about the video track, which I needed to “mux” into a video file.
After all that is done, I also needed to add a proper audio track in case you recorded voiceover, etc.
There are also a lot of other moving pieces related here, such as using correct codecs, color profiles, etc. But I’ll describe it in another article.