I-Lab: 2D/3D video processing and coding

I-Lab has an extensive research background on multiple aspects of 2D and 3D video coding for transmission over a variety of communication networks.

Among the most significant areas of the expertise are novel compression schemes applied to 3D stereoscopic and multi-view videos that are based on widely deployed international video coding standards, such as H.264/MPEG-4 AVC, scalable video coding extension (SVC), multi-view coding (MVC) and more recently high efficiency video coding (HEVC).

Contributions include the advancement in the error robustness schemes of existing encoders, making use of the statistical dependencies between different elements of 3D media, and improvements in the general rate-distortion efficiency for targeted applications. Video coding is usually not considered independently from the underlying transmission scheme. Thus, research in I-Lab is also focussed on novel cross-layer design for video transmission over wireless networks (such as Wimax and WLan). Similarly, recent trends in I-Lab’s video coding research are towards monitoring and actively incorporating the QoE during the video coding stage for improved perceptual efficiency. I-Lab has a strong track record of publications on video coding related research in highly reputed and peer-reviewed journals and conferences. The publications can be seen from the publications database that is regularly updated.

Research themes

There are two main research themes related to video coding in the I-Lab - video coding and 3D video research.

Previous research work in I-Lab had considered the joint use of scalable coding principles and multiple-description coding principles for conventional 2D videos to have better protection of transmitted videos over error prone environment and at the same time have better adaptation performance to different user terminal devices. Scalable video coding creates a global bit-stream that comprises multiple frame rate and spatial resolution representations of encoded video.

Multiple description coding on the other hand creates many low quality versions (descriptions) of the encoded video and when a sub-set of these descriptions are decoded at a time, a certain quality representation of the original video can be obtained. This is a quite useful approach especially in error prone transmission environments. Previous research in I-Lab had considered combining both schemes for early colour-plus-depth type 3D stereoscopic videos. This work comprises utilising SVC for backward compatibility with legacy 2D display sets and utilising even-odd frame splitting based MDC approach for error robustness for transmission especially over error prone wireless media. Both descriptions can individually be decoded to produce lower temporal resolution representations of the base layer representation of the encoded colour-plus-depth video. Results indicated that significant quality gains could be achieved at a wide range of bit-rates (500 kbps – 3 Mbps).

Recent trends in immersive and rich multimedia experience that is based on 3D video with a third depth dimension point the use of effective video compression techniques. These techniques should be both rate-distortion performance wise efficient and scalable for transmission over a variety of networks with different capacities and error patterns and for a variety of 3D displays (e.g. passive polarised displays, active displays, multi-view and light-field displays). Codec designs should be compatible with various 3D video formats.

Currently, simple stereoscopic 3D video (with no free-viewpoint viewing capability) that is side-by-side is the mostly used form in commercially available services (such as TV channels over satellite and/or terrestrial). MPEGs stereoscopic video coding solution (as in AVC’s and MVC’s stereo profiles) comprises encoding one of the camera views independently, while encoding the other camera based on predictive coding from the independently encoded camera view. This way, legacy 2D displays and TV sets can decode only one view and display it.

Depth maps are core components of many modern 3D video formats and depth image based rendering (DIBR) is a paradigm that enables 3D scene rendering beyond the borders of the actually shot 2D or stereoscopic video by exploiting the depth maps. Depth maps have similarities to and differences from the corresponding texture sequences. A major research effort in I-Lab has also been on exploiting the differences of depth maps from texture sequences by adapting state-of-the-art video coding tools to efficiently encode depth maps.

Since the motion activity in depth maps is not only restricted to two dimension, but three, motion estimation process of existing AVC codec can be improved to take in to consideration the motion of a macroblock in Z direction (pointing in or out of the display plane). Subsequently, better quality reconstructed depth map sequences are obtained in the user terminal sides resulting in better view synthesis performance.

Another novel approach performed in I-Lab is to compress the depth maps based on the minimisation of rendering distortions rather than the minimisation of block coding error of depth macroblocks. Therefore, existing AVC encoder’s rate-distortion optimised block mode decision cycle is modified to calculate the distortion in the final rendered image using the reconstructed depth macroblock (through DIBR process).

Results have shown significant improvements in the rate-distortion performance of the depth map encoder with respect to the state-of-the-art joint model (JM) encoder that is based on the AVC standard.

Another previous research work done in I-Lab on depth map compression was to utilise a temporal subsampling approach for depth maps within a multi-view depth transmission scenario. It was intended to skip encoding depth map frames (temporal sub-layers) of camera views that are within the baseline of other two camera views. "TL" represents temporal layers. Marked frames depict un-encoded temporal sub-layer frames.

Depth map coding experiments done at several bit-rates using the MVC standard and with different settings (i.e. skipping different numbers of temporal sub-layers of targeted camera views’ depth maps) have shown subjective performance improvements more notable at high bit-rates.

One of the recent video coding research works going on in I-Lab is on developing application-aware compression suits. The proliferation of video consumption, especially over mobile devices, has created a demand for efficient interactive video applications and high-level video analysis. Pixel-domain video processing is inefficient for many applications running on resource constrained devices due to its complexity, whereas compressed domain processing offers fast but relatively unreliable results.

In order to achieve fast and effective video processing, a novel video encoding architecture that facilitates efficient compressed domain processing is developed. This is achieved by optimizing the accuracy of motion information in the compressed video, in addition to compression efficiency. In a motion detection application, the motion that is estimated by the developed application aware encoder is directly used to extract object information, where this motion information does not necessarily have to reflect the motion vectors that minimise residual errors. The incurred rate distortion overheads can be weighed against the reduced processing required for video analysis targeting a wide spectrum of computer vision applications.

A part of the recent research work in I-Lab is also concentrated on the deployment of highly flexible, scalable rich multi-view video dissemination mechanism utilising popular P2P multicast scheme. This research work, which is conducted within EU-FP7 DIOMEDES Project, aims to bring immersive 3D video experience that is beyond conventional stereoscopic video experience. It comprises free-viewpoint viewing by dynamically choosing, decoding and rendering particular camera views.

For this purpose the scalable video coding standard is utilised to supply means of backwards compatible solution for legacy 3D-TV systems in bandwidth constrained environments. To improve the perceptual performance between successive layers of the encoded views, visual attention models are also utilised. It is used in a way to augment region-of-interest adaptive quantiser selection in encoded video frames. Subjective test results have proved that the perceptual performance of visual attention model base scalable multi-view coding is in general better than that of the conventional scalable multi-view coding.

In the meantime it is aimed to monitor the projected quality of 3D video experience during the video encoding stage and regularly transmit performance indicators to the user side to let more efficient adaptation take place. Media adaptation is also among the major research themes addressed in I-Lab. These values are depicted as key performance index (KPI) values, the derivation of which is a result of on-going QoE research in I-Lab. These indicative values, when delivered in the encoded bit-stream are used to compute the current QoE of users.

3D video has the ability to give its viewers an added dimension of experience with depth perception. In 3D video, two disparate images are displayed at the same time for each eye of a viewer, known as stereoscopic images. Driven by the recent advances in display technology and content production capabilities, and with the wide consumer appreciation of the recent Hollywood blockbuster ‘Avatar’, 3D video has become a buzzword in today’s multimedia industry. Since the release of ‘Avatar’ is 2009, several Hollywood films, such as ‘Clash of the Titans’, ‘Pirates of Caribbean’, and ‘Kung Fu Panda’, have been released in 3D. Several broadcasters, such as Sky, ESPN, Fox Sports and BBC, broadcasted certain programs including live coverage of sports events in 3D.

I-Lab multimedia communications research group is involved with capturing, processing, coding, rendering and transmission of three dimensional (3D) video. The I-Lab has extensive experience with multiview 3D video capturing, depth map generation and post-production processing of stereoscopic video as well as depth maps, compression of depth maps and depth image based rendering (DIBR). The I-Lab research group contributes to 3 major projects funded by the EU in these areas of 3D video. The facility is also equipped with state-of-the-art 3D video playback capabilities.

There are several issues that need to be solved in order to realize the full market potential for 3D multimedia systems. The lack of content and higher price of 3DTV displays is currently a reason for low demand for 3DTV systems. Compression of multiview video content remains a challenge to date. Furthermore, for 3D video to be a mass market success, it is imperative to make viewing of 3D video at least as comfortable as traditional television. Hence, visual discomfort that is associated with 3D viewing need to be minimized by appropriate methods. Under these developments, research in the I-Lab are focused on the following areas of 3D video:

  • 3D video capturing and post-production processing
  • Depth map coding
  • Depth image based rendering optimisation
  • Error resilient transmission of 3D video
  • Quality of experience of 3D video.

The I-Lab is involved with 3D video capturing related issues to deliver necessary 3D audiovisual content for the project purposes such as audio and video coding, modelling the audiovisual attention models and implementing it within the media encoder, display rendering and QoE assessment. The visual media laboratory within the Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey is utilised to capture, record and process the 3D media.

The ability to generate/render high quality novel views in multiview displays is a prime requirement of state-of-the-art 3D displays. One of the proven techniques that assist the generation of high quality novel viewpoints is the depth image based rendering (DIBR). Unlike other 3D scene rendering techniques (e.g. texture based), DIBR outputs high quality rendering results, if the scene depth information is accurate. Therefore, high quality depth map generation/estimation is a key research area in I-Lab. The depth maps are extracted using the multiview colour video sequences and the corresponding camera calibration parameters. A sample estimated depth image is shown in figure 2 for a particular camera view.

The estimated depth maps are post-processed to yield good quality depth maps. This post-processing involves adaptive filtering techniques to improve the spatial and temporal consistency of depth maps.

Post-processing of captured multiview video is an important step in preparing content that is suitable for distribution to the end users and also to aid depth map estimation and multiview video coding. Multiview rectification is necessary to align the epipolar lines of each camera view, as well as to compensate for the slight focal length differences existing between the cameras.

Multi-view colour correction is another essential step in the post-production workflow. Colour correction is important due to multiple reasons. First, the dense stereo matching process to estimate the disparity is badly affected if there are colour differences in two cameras in the matched regions.

Second, the multi-view video coding performance can be affected in terms of a loss in the Rate-Distortion performance, since the inter-view prediction scheme produces higher energy residuals. Third, free-viewpoint synthesis process is also negatively affected if there are colour differences between the two source cameras (which are warped to the image coordinates of the target viewpoint) that need blending. 3D images look bad if the two source images have different colours. To overcome all such negative effects, the multi-view videos need colour correction before depth estimation and coding.

One significant research project at I-Lab focused on exploiting the depth perception sensitivity of humans in suppressing the unnecessary spatial depth details, hence reducing the transmission overhead allocated to depth maps. Based on the Just noticeable difference in depth model derived during QoE evaluations, depth map sequences are pre-processed to suppress the depth details that are not perceivable by the viewers and to minimise the rendering artefacts that arise due to optical noise, where the optical noise is triggered by the inaccuracies in the depth estimation process.

Experimental results suggest that the bit rate for depth map coding can be reduced up to 78 per cent for the depth maps captured with depth-range cameras and up to 24 per cent for the depth maps estimated with computer vision algorithms, without affecting the 3D visual quality or the view synthesis quality.

Since the motion activity in depth maps is not only restricted to two dimension, but three, motion estimation process of existing AVC codec can be improved to take in to consideration the motion of a macroblock in Z direction (pointing in or out of the display plane). Figure 6 depicts this improved motion estimation process. Subsequently, better quality reconstructed depth map sequences are obtained in the user terminal sides resulting in better view synthesis performance.

Another novel approach performed in I-Lab is to compress the depth maps based on the minimisation of rendering distortions rather than the minimisation of block coding error of depth macroblocks. Therefore, existing AVC encoder’s rate-distortion optimised block mode decision cycle is modified to calculate the distortion in the final rendered image using the reconstructed depth macroblock (through DIBR process).

Results have shown significant improvements in the rate-distortion performance of the depth map encoder with respect to the state-of-the-art joint model (JM) encoder that is based on the AVC standard.

Another previous research work done in I-Lab on depth map compression was to utilise a temporal subsampling approach for depth maps within a multi-view depth transmission scenario. It was intended to skip encoding depth map frames (temporal sub-layers) of camera views that are within the baseline of other two camera views. "TL" represents temporal layers. Marked frames depict un-encoded temporal sub-layer frames.

Depth map coding experiments done at several bit-rates using the MVC standard and with different settings (i.e. skipping different numbers of temporal sub-layers of targeted camera views’ depth maps) have shown subjective performance improvements more notable at high bit-rates.

In another method the depth maps are down-sampled to reduce the bit rate required for transmission and is up-sampled at the receiver in an adaptive manner. This method is based on filters that are adaptive along the edges of the depth map. Edge aware down-sampling allows reduction in data size, while maintaining high frequency details and edge aware upsampling scheme after decoding allows the conservation and better reconstruction of critical object boundaries.

Apart from encoding depth maps in a way that is suitable for view rendering, I-Lab is also engaged in research to filter depth maps to improve rendering especially at the decoder side.

Since, when the depth maps are compressed using existing codecs, compression artifacts cause undesirable distortions in the rendered views, this work analyses the effect of compression artifacts on the novel view generation process. Based on this compression artifact analysis, a two step filtering process is proposed to post-process the compressed depth maps at the receiving end.

The proposed technique, which is based on adaptive bilateral filtering, successfully filters depth maps to minimize distortions in rendered views.

Research on error recovery in multi-view coding has received considerable interest in the recent past. While there is a multitude of literature concerning error recovery in 2D video, due to the statistical difference in motion compensation among temporal frames and disparity compensation among view points, such methods are inadequate to cater to the requirements of multiview video transmission.

One work at I-Lab addressed the above issue by redundant transmission of disparity vectors to conceal errors in non-key frames of multiview video. This system can be used along with a suitable temporal error recovery scheme to provide improved resilience to multiview video transmission. The experimental results suggest that the proposed algorithm fares significantly in error prone environments, in which the packet loss rate (PLR) is greater than 7 per cent.

Exploiting the unique correlations that exist between the color and their corresponding depth images, will lead to more error resilient video encoding schemes for 3D video. In a different attempt an error resilient 3D video communication scheme that exploits the correlation of motion vectors in color and depth video streams was developed in I-Lab. The motion estimation process is performed in a joint manner between color video and the corresponding depth video. Since the motion vectors in both the color and depth stream are same, when there is a loss of packets, the motion vectors can be easily recovered from the other stream.

Significant gain in the quality of rendered views are achieved with this joint motion estimation and motion vector sharing scheme under packet loss conditions.

Contact us

Find us

Address
Centre for Vision Speech and Signal Processing
Alan Turing Building (BB)
University of Surrey
Guildford
Surrey
GU2 7XH