Attention機構を使った動画要約

本資料は2020年12月15日に社内共有資料として展開していたものを WEBページ向けにリニューアルした内容になります。

■Purpose

Purpose of this material

  • Explore a solution to the task of video summarization using attention.

■Agenda

  • Introduction
    • Motivation
    • Contributions
  • Dataset
  • VASNet
    • Feature Extraction
    • Attention Network
    • Regressor Network
  • Inference
    • Changepoint Detection
    • Kernel Temporal Segmentation
  • Results
    • Measuring method
    • Dataset Results

■Introduction

Motivation

  • Early video summarization methods were based on unsupervised methods,leveraging low level spatio-temporal features and dimensionality reduction with clustering techniques.Success of these methods solely stands on the ability to define distance/cost functions between the keyshots/frames with respect to the original video.
  • Current state of the art methods for video summarization are based on recurrent encoder-decoder architectures, usually with bidirectional LSTM or GRU and soft attention. They are computationally demanding, especially in the bi-directional configuration.

Contribution

  • A novel approach to sequence to sequence transformation for video summarization based on soft, self-attention mechanism. In contrast, current state of the art relies on complex LSTM/GRU encoder-decoder methods.
  • A demonstration that a recurrent network can be successfully replaced with simpler, attention mechanism for the video summarization.

■Dataset

TVSum Dataset
SumMe Dataset

■VASNet

Feature Extraction

  • Given a time interval t, every 15 frames are collected in an ordered set X
  • Each set then is used as input to GoogLeNet for feature extraction
  • hen we extract the Pool 5 layer of GoogLeNet, which is a 1024 dimensional array (D = 1024).

Attention Network

Regressor Network

■Inference

  • The output of the model VASNet is a probability of importance per frame
  • This probability must be analyzed in the range of the scene it corresponds
  • However to get the number of frames is relative per video
  • The problem to find the frames where a change a scene exist is called changepoint detection.
  • For the datasets used, the changepoints (cps) are already calculated by using KTS algorithm with hyperparameter tuning

Changepoint detection

  • In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change has occurred, or whether several changes might have occurred, and identifying the times of any such changes.

Kernel Temporal Segmentation (KTS)

  • Kernel Temporal Segmentation (KTS) method splits the video into a set of non-intersecting temporal segments.
  • It treats the cps detection as a dynamic programming problem.
  • The method is fast and accurate when combined with highdimensional descriptors.

■Results

Measuring method

P: Precision
R: Recall
F Score: [2 * P * R / (P + R)] * 100

Dataset Results

■References

  • VASNet: https://arxiv.org/pdf/1812.01969.pdf
  • VASNet official implementation: https://github.com/ok1zjf/VASNet
  • KTS implementation: https://github.com/TatsuyaShirakawa/KTS
  • Video summarization datasets and review: https://hal.inria.fr/hal-01022967/PDF/video_summarization.pdf
  • Issue on testing on own videos: https://github.com/ok1zjf/VASNet/issues/2

■ダウンロード

Attention機構を使った動画要約.pdf