{"id":742,"date":"2020-12-15T16:37:00","date_gmt":"2020-12-15T07:37:00","guid":{"rendered":"https:\/\/arithmer.blog\/?p=742"},"modified":"2022-03-08T15:45:00","modified_gmt":"2022-03-08T06:45:00","slug":"summarizing-videos-with-attention","status":"publish","type":"post","link":"https:\/\/arithmer.blog\/blog\/summarizing-videos-with-attention","title":{"rendered":"Attention\u6a5f\u69cb\u3092\u4f7f\u3063\u305f\u52d5\u753b\u8981\u7d04"},"content":{"rendered":"\n<p class=\"has-small-font-size\">\u672c\u8cc7\u6599\u306f2020\u5e7412\u670815\u65e5\u306b\u793e\u5185\u5171\u6709\u8cc7\u6599\u3068\u3057\u3066\u5c55\u958b\u3057\u3066\u3044\u305f\u3082\u306e\u3092 WEB\u30da\u30fc\u30b8\u5411\u3051\u306b\u30ea\u30cb\u30e5\u30fc\u30a2\u30eb\u3057\u305f\u5185\u5bb9\u306b\u306a\u308a\u307e\u3059\u3002<\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"purpose\"><strong>\u25a0Purpose<\/strong><\/h3>\n\n\n\n<p style=\"font-size:18px\"><strong>Purpose of this material<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Explore a solution to the task of video summarization using attention.<\/li><\/ul>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"agenda\"><strong>\u25a0Agenda<\/strong><\/h3>\n\n\n\n<ul style=\"font-size:16px\"><li><strong>Introduction<\/strong><ul><li>Motivation<\/li><li>Contributions<\/li><\/ul><\/li><li><strong>Dataset<\/strong><\/li><li><strong>VASNet<\/strong><ul><li>Feature Extraction<\/li><li>Attention Network<\/li><li>Regressor Network<\/li><\/ul><\/li><li><strong>Inference<\/strong><ul><li>Changepoint Detection<\/li><li>Kernel Temporal Segmentation<\/li><\/ul><\/li><li><strong>Results<\/strong><ul><li>Measuring method<\/li><li>Dataset Results<\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"introduction\"><strong>\u25a0Introduction<\/strong><\/h3>\n\n\n\n<p style=\"font-size:18px\"><strong>Motivation<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Early video summarization methods were based on unsupervised methods,leveraging low level spatio-temporal features and dimensionality reduction with clustering techniques.Success of these methods solely stands on the ability to define <strong>distance\/cost functions between the keyshots\/frames with respect to the original video.<\/strong><br><\/li><li>Current state of the art methods for video summarization are based on recurrent encoder-decoder architectures, <strong>usually with bidirectional LSTM or GRU and soft attention<\/strong>. They are computationally demanding, especially in the bi-directional configuration.<\/li><\/ul>\n\n\n\n<p style=\"font-size:18px\"><strong>Contribution<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>A novel approach to sequence to sequence transformation for video summarization based on soft, self-attention mechanism. In contrast, current state of the art relies on complex LSTM\/GRU encoder-decoder methods.<br><\/li><li>A demonstration that a recurrent network can be successfully replaced with simpler, attention mechanism for the video summarization.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"281\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_01.jpg\" alt=\"\" class=\"wp-image-755\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_01.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_01-300x82.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_01-768x211.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_01-304x83.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"dataset\"><strong>\u25a0Dataset<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"433\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_02.jpg\" alt=\"\" class=\"wp-image-756\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_02.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_02-300x127.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_02-768x325.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_02-304x129.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p style=\"font-size:16px\"><a href=\"https:\/\/github.com\/yalesong\/tvsum\" target=\"_blank\" rel=\"noreferrer noopener\">TVSum Dataset<br><\/a><a href=\"https:\/\/gyglim.github.io\/me\/vsum\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">SumMe Dataset<\/a><\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"vasnet\"><strong>\u25a0VASNet<\/strong><\/h3>\n\n\n\n<p style=\"font-size:18px\"><strong>Feature Extraction<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Given a time interval t, every 15 frames are collected in an ordered set X<\/li><li>Each set then is used as input to GoogLeNet for feature extraction<\/li><li>hen we extract the Pool 5 layer of GoogLeNet, which is a 1024 dimensional array (D = 1024).<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"238\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_03.jpg\" alt=\"\" class=\"wp-image-757\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_03.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_03-300x70.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_03-768x179.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_03-304x71.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p id=\"attention-network\" style=\"font-size:18px\"><strong>Attention Network<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"365\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_04.jpg\" alt=\"\" class=\"wp-image-758\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_04.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_04-300x107.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_04-768x274.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_04-304x108.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"417\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_05.jpg\" alt=\"\" class=\"wp-image-759\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_05.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_05-300x122.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_05-768x313.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_05-304x124.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p style=\"font-size:18px\"><strong>Regressor Network<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"328\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_06.jpg\" alt=\"\" class=\"wp-image-747\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_06.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_06-300x96.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_06-768x246.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_06-304x97.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"410\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_07.jpg\" alt=\"\" class=\"wp-image-748\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_07.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_07-300x120.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_07-768x308.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_07-304x122.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"inference\"><strong>\u25a0Inference<\/strong><\/h3>\n\n\n\n<ul style=\"font-size:16px\"><li>The output of the model VASNet is a probability of importance per frame<\/li><li>This probability must be analyzed in the range of the scene it corresponds<\/li><li>However to get the number of frames is relative per video<\/li><li>The problem to find the frames where a change a scene exist is called changepoint detection.<\/li><li>For the datasets used, the changepoints (cps) are already calculated by using KTS algorithm with hyperparameter tuning<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"147\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_08.jpg\" alt=\"\" class=\"wp-image-749\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_08.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_08-300x43.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_08-768x110.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_08-304x44.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p style=\"font-size:18px\"><strong>Changepoint detection<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change has occurred, or whether several changes might have occurred, and identifying the times of any such changes.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"395\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_09.jpg\" alt=\"\" class=\"wp-image-750\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_09.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_09-300x116.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_09-768x296.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_09-304x117.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p style=\"font-size:18px\"><strong>Kernel Temporal Segmentation (KTS)<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Kernel Temporal Segmentation (KTS) method splits the video into a set of non-intersecting temporal segments.<\/li><li>It treats the cps detection as a dynamic programming problem.<\/li><li>The method is fast and accurate when combined with highdimensional descriptors.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"328\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_10.jpg\" alt=\"\" class=\"wp-image-751\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_10.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_10-300x96.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_10-768x246.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_10-304x97.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"results\"><strong>\u25a0Results<\/strong><\/h3>\n\n\n\n<p style=\"font-size:18px\"><strong>Measuring method<\/strong><\/p>\n\n\n\n<p style=\"font-size:16px\">P: Precision<br>R: Recall<br>F Score: [2 * P * R \/ (P + R)] * 100<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"157\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_11.jpg\" alt=\"\" class=\"wp-image-752\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_11.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_11-300x46.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_11-768x118.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_11-304x47.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p style=\"font-size:18px\"><strong>Dataset Results<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1024\" height=\"279\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_12.jpg\" alt=\"\" class=\"wp-image-753\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_12.jpg 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_12-300x82.jpg 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_12-768x209.jpg 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20201215_12-304x83.jpg 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<ul style=\"font-size:16px\"><li><a rel=\"noreferrer noopener\" href=\"https:\/\/www.youtube.com\/watch?v=873CBVbPJVE\" target=\"_blank\">Long video<\/a><\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/www.youtube.com\/watch?v=weW4memH3Dg\" target=\"_blank\">Summarize<\/a><\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/www.youtube.com\/playlist?list=PLEdpjt8KmmQMfQEat4HvuIxORwiO9q9DB\" target=\"_blank\">Full playlist<\/a><\/li><\/ul>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"references\"><strong>\u25a0References<\/strong><\/h3>\n\n\n\n<ul style=\"font-size:16px\"><li>VASNet: https:\/\/arxiv.org\/pdf\/1812.01969.pdf<\/li><li>VASNet official implementation: https:\/\/github.com\/ok1zjf\/VASNet<\/li><li>KTS implementation: https:\/\/github.com\/TatsuyaShirakawa\/KTS<\/li><li>Video summarization datasets and review: https:\/\/hal.inria.fr\/hal-01022967\/PDF\/video_summarization.pdf<\/li><li>Issue on testing on own videos: https:\/\/github.com\/ok1zjf\/VASNet\/issues\/2<\/li><\/ul>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9\"><strong>\u25a0\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9<\/strong><\/h3>\n\n\n\n<p style=\"font-size:16px\"><a rel=\"noreferrer noopener\" href=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/14_Attention\u6a5f\u69cb\u3092\u4f7f\u3063\u305f\u52d5\u753b\u8981\u7d04.pdf\" target=\"_blank\">Attention\u6a5f\u69cb\u3092\u4f7f\u3063\u305f\u52d5\u753b\u8981\u7d04.pdf<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u672c\u8cc7\u6599\u306f2020\u5e7412\u670815\u65e5\u306b\u793e\u5185\u5171\u6709\u8cc7\u6599\u3068\u3057\u3066\u5c55\u958b\u3057\u3066\u3044\u305f\u3082\u306e\u3092 WEB\u30da\u30fc\u30b8\u5411\u3051\u306b\u30ea\u30cb\u30e5\u30fc\u30a2\u30eb\u3057\u305f\u5185\u5bb9\u306b\u306a\u308a\u307e\u3059\u3002 \u25a0Purpose Purpose of this material Explore a solut &#8230; <\/p>\n","protected":false},"author":3,"featured_media":754,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[20,51,77,79,78,24,36],"_links":{"self":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts\/742"}],"collection":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=742"}],"version-history":[{"count":4,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts\/742\/revisions"}],"predecessor-version":[{"id":763,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts\/742\/revisions\/763"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/media\/754"}],"wp:attachment":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=742"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=742"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=742"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}