Self-Supervised Deep Learning Approach for Video Stabilization

ARDEB 3501

Mehmet SARIGÜL (Execuitve), Levent KARACAN

Abstract: In video recordings taken with mobile devices such as handheld cameras, head cameras, mobile phones, undesired visual effects are formed such as shaking, flickering or periodic camera movements. The processes that eliminate these undesired effects are called as video stabilization. With the need to shoot video from moving platforms such as cars, unmanned aircrafts, wearable devices, the need for inexpensive and effective video stabilization has also increased. Although certain types of high-frequency motion effects can be eliminated with special equipment placed on camera systems, these equipment are both expensive and power consuming. In addition, motion range and degrees of freedom are limited in these systems. To overcome these limitations, computer vision methods basically try to estimate undesired motion between video frames by tracking various types of visual features, and then they correct camera path by warping scenes according to the estimated motion. In these methods, selecting the correct visual features and tracking algorithm is crucial for the performance of the method. As a result, the problem arises of determining which feature extractors and tracking algorithms should be used for different types of scenes. Unlike these methods that are difficult to adapt to new types of scenes and costly in terms of computation, learning-based methods have been developed. in recent years. However, these works use supervised learning methods that need a dataset containing a limited number of scenes with both stable and unstable pairs of the same scene obtained with special equipment. Hence, obtained models are dataset dependent and cannot be easily adapted to new scenes in different context and semantics.

In this project, we will explore to use unsupervised learning techniques on a video processing problem. More specifically, a novel self-supervised video stabilization method which does not require stable video supervision during training will be developed to overcome the dataset barrier in recent supervised learning methods and fill the gap of unsupervised video stabilization. Producing new dataset with proper labels is expensive and time-consuming even though abundance of unlabeled data. With this research, we will demonstrate that camera movements for video stabilization can be generated synthetically from the data itself and this can be used for training. The goal with this new self-supervised learning approach is to develop a new video stabilization model which is faster and more robust than previous non-learning-based methods and also more generalizable and more successful than deep supervised learning-based approaches which require stable video supervision. For this purpose, we will first study on learning the visual features required to distinguish desired and undesired motions in videos. Then, with the help of these visual features, a suitable self-supervised learning method will be investigated so that transformations that cause undesired motions will be predicted. In the last stage, we will develop a novel video to video translation model which translates unstable input videos to stable videos using conditional Generative Adversarial Network conditioned by predicted transformations.

The video stabilization is closely related with many other computer vision problems such as video classification, video generation, future frame prediction, motion detection, etc. Therefore, various knowledge, skills and experience related to all these problems will be gained, new researchers will be trained on related problems and thesis studies will be conducted. The video stabilization method will be developed can be used in various industrial areas such as automotive, defense industry, healthcare industry, etc. This project has also high potential to lead new projects.