Attention機構を使った動画要約

本資料は2020年12月15日に社内共有資料として展開していたものを WEBページ向けにリニューアルした内容になります。

Purpose of this material

Motivation

Early video summarization methods were based on unsupervised methods,leveraging low level spatio-temporal features and dimensionality reduction with clustering techniques.Success of these methods solely stands on the ability to define distance/cost functions between the keyshots/frames with respect to the original video.
Current state of the art methods for video summarization are based on recurrent encoder-decoder architectures, usually with bidirectional LSTM or GRU and soft attention. They are computationally demanding, especially in the bi-directional configuration.

Contribution

A novel approach to sequence to sequence transformation for video summarization based on soft, self-attention mechanism. In contrast, current state of the art relies on complex LSTM/GRU encoder-decoder methods.
A demonstration that a recurrent network can be successfully replaced with simpler, attention mechanism for the video summarization.

Feature Extraction

Given a time interval t, every 15 frames are collected in an ordered set X
Each set then is used as input to GoogLeNet for feature extraction
hen we extract the Pool 5 layer of GoogLeNet, which is a 1024 dimensional array (D = 1024).

Attention Network

Regressor Network

The output of the model VASNet is a probability of importance per frame
This probability must be analyzed in the range of the scene it corresponds
However to get the number of frames is relative per video
The problem to find the frames where a change a scene exist is called changepoint detection.
For the datasets used, the changepoints (cps) are already calculated by using KTS algorithm with hyperparameter tuning

Changepoint detection

In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change has occurred, or whether several changes might have occurred, and identifying the times of any such changes.

Kernel Temporal Segmentation (KTS)

Kernel Temporal Segmentation (KTS) method splits the video into a set of non-intersecting temporal segments.
It treats the cps detection as a dynamic programming problem.
The method is fast and accurate when combined with highdimensional descriptors.

Measuring method

P: Precision
R: Recall
F Score: [2 * P * R / (P + R)] * 100

Dataset Results

VASNet: https://arxiv.org/pdf/1812.01969.pdf
VASNet official implementation: https://github.com/ok1zjf/VASNet
KTS implementation: https://github.com/TatsuyaShirakawa/KTS
Video summarization datasets and review: https://hal.inria.fr/hal-01022967/PDF/video_summarization.pdf
Issue on testing on own videos: https://github.com/ok1zjf/VASNet/issues/2