SVT-AV1 Spatial and Temporal Scalability Explained

This article provides a technical overview of how the Scalable Video Technology for AV1 (SVT-AV1) encoder systematically implements spatial and temporal scalability. It explores the architectural mechanisms behind multi-layer encoding, reference frame management, inter-layer prediction, and hierarchical prediction structures that enable SVT-AV1 to deliver flexible, adaptive bitstreams for diverse network conditions.

Understanding Scalability in AV1

Scalable Video Coding (SVC) allows an encoder to generate a single bitstream containing multiple representation layers of a video. Clients can decode only a subset of this stream depending on their bandwidth or hardware capabilities. SVT-AV1, the open-source AV1 encoder developed by Intel and the Alliance for Open Media (AOMedia), features native, highly optimized support for both temporal (frame rate) and spatial (resolution) scalability, as well as combined spatial-temporal scalability modes.

Systematic Implementation of Temporal Scalability

Temporal scalability is achieved by dividing the video stream into hierarchical layers of varying frame rates. SVT-AV1 implements this using a structured, dyadic prediction hierarchy:

Systematic Implementation of Spatial Scalability

Spatial scalability involves encoding the video at multiple resolutions within a single bitstream. SVT-AV1 manages this through systematic inter-layer prediction and resolution scaling:

Combining Spatial and Temporal Scalability

SVT-AV1 supports complex multi-dimensional scaling structures (e.g., L2T3, which represents two spatial layers and three temporal layers) by cross-referencing both dimensions systematically: