{"id":325,"date":"2021-07-08T13:59:00","date_gmt":"2021-07-08T04:59:00","guid":{"rendered":"https:\/\/arithmer.blog\/?p=325"},"modified":"2022-03-08T15:44:04","modified_gmt":"2022-03-08T06:44:04","slug":"video-inference-for-human-body-pose-and-shape-estimation","status":"publish","type":"post","link":"https:\/\/arithmer.blog\/blog\/video-inference-for-human-body-pose-and-shape-estimation","title":{"rendered":"VIBE:\u52d5\u753b\u304b\u3089\u306e\u4eba\u4f53\u306e\u59ff\u52e2\u30fb\u5f62\u72b6\u63a8\u5b9a"},"content":{"rendered":"\n<p class=\"has-small-font-size\">\u672c\u8cc7\u6599\u306f2021\u5e747\u670808\u65e5\u306b\u793e\u5185\u5171\u6709\u8cc7\u6599\u3068\u3057\u3066\u5c55\u958b\u3057\u3066\u3044\u305f\u3082\u306e\u3092WEB\u30da\u30fc\u30b8\u5411\u3051\u306b\u30ea\u30cb\u30e5\u30fc\u30a2\u30eb\u3057\u305f\u5185\u5bb9\u306b\u306a\u308a\u307e\u3059\u3002<\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"contents\"><strong>\u25a0Contents<\/strong><\/h3>\n\n\n\n<p id=\"introduction\" style=\"font-size:16px\"><strong>Introduction<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Problem to Solve<\/li><\/ul>\n\n\n\n<p id=\"dataset\" style=\"font-size:16px\"><strong>Dataset<\/strong><\/p>\n\n\n\n<p id=\"vibe-approach\" style=\"font-size:16px\"><strong>VIBE approach<\/strong><\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Pretrained Model<\/li><li>Temporal Encoder<\/li><li>Motion Discriminator<\/li><\/ul>\n\n\n\n<p id=\"results\" style=\"font-size:16px\"><strong>Results<\/strong><\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"problem\"><strong>\u25a0Problem<\/strong><\/h3>\n\n\n\n<ul><li><strong>Lack of in-the-wild ground-truth 3D<\/strong><\/li><li><strong>Previous work combine indoor 3D datasets with videos having<\/strong><br><strong>2D ground-truth or pseudo ground-truth keypoint annotations<\/strong><ul><li>Indoor 3D are limited in the number of subjects, range of motion and image complexity<\/li><li>Poor amount of video labeled with ground-truth 2D pose<\/li><li>Pseudo-ground-truth 2D labels are not reliable for modeling 3D human motion<\/li><\/ul><\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"201\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_01.png\" alt=\"\" class=\"wp-image-331\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_01.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_01-300x59.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_01-768x151.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_01-304x60.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<p class=\"has-text-align-center has-small-font-size\">\u203bLearning 3D Human Dynamics from Video \u2013 https:\/\/arxiv.org\/pdf\/1812.01601.pdf<\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"dataset\"><strong>\u25a0Dataset<\/strong><\/h3>\n\n\n\n<p id=\"dataset\" style=\"font-size:18px\"><strong>AMASS dataset for 3D motion capture<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"358\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_02.png\" alt=\"\" class=\"wp-image-332\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_02.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_02-300x105.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_02-768x269.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_02-304x106.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"what-is-vibe\"><strong>\u25a0What is VIBE<\/strong><\/h3>\n\n\n\n<p style=\"font-size:16px\">\u201cOur key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions andthose produced by our temporal pose and shape regression networks. We define a novel temporal network architecture with aself-attention mechanism and show that adversarial training, at the sequence level, produces kinematically plausible motionsequences without in-the-wild ground-truth 3D labels.\u201d<br>Adversarial learning framework &amp; discriminate, are terms used when referring togenerative adversarial networks. The architecture involves the simultaneous training of two models: the generator and the discriminator. (Thanks enrico for the notes:<a href=\"https:\/\/www.notion.so\/Generative-Adversarial-Networks-0692b1ea34e641a0ae011237345a51c4\" target=\"_blank\" rel=\"noreferrer noopener\">notion<\/a> )<br>Novel temporal network architecture. Since we are analyzing videos, the concept of sequence is implied. VIBE uses a gated recurrent units (GRU) to capture the sequential nature of human motion.<br>Self-attention mechanism is used to amplify the contribution of distinctive frames.<\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"elements-vibe\"><strong>\u25a0Elements VIBE<\/strong><\/h3>\n\n\n\n<p style=\"font-size:16px\">\u201cOur key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions andthose produced by our temporal pose and shape regression networks. We define a novel temporal network architecture with aself-attention mechanism and show that adversarial training, at the sequence level, produces kinematically plausible motionsequences without in-the-wild ground-truth 3D labels.\u201d<\/p>\n\n\n\n<p id=\"architectures-used\" style=\"font-size:16px\"><strong>Architectures used<\/strong>:<\/p>\n\n\n\n<ul style=\"font-size:16px\"><li>Yolov3, for detecting the person box<\/li><li>Resnet50, for feature extraction<\/li><li>GRU, for sequence encoding<\/li><li>Self attention, for frame scoring<\/li><li>GAN, for adversarial training and loss<\/li><\/ul>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"vibe-architecture\"><strong>\u25a0VIBE architecture<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"220\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_03.png\" alt=\"\" class=\"wp-image-333\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_03.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_03-300x64.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_03-768x165.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_03-304x65.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"pre-trained-model\"><strong>\u25a0Pre-trained model<\/strong><\/h3>\n\n\n\n<p style=\"font-size:16px\">A sequence of $T$ frames is fed to a convolutional network, $\u0192$, which functions as a feature extractor and outputs a vector $\u0192$<sub>i<\/sub> \u2208\u211d<sup>2048<\/sup> for each frame<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1020\" height=\"91\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_04.png\" alt=\"\" class=\"wp-image-334\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_04.png 1020w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_04-300x27.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_04-768x69.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_04-304x27.png 304w\" sizes=\"(max-width: 1020px) 100vw, 1020px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"temporal-encoder-output\"><strong>\u25a0Temporal encoder output<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"309\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_05.png\" alt=\"\" class=\"wp-image-335\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_05.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_05-300x91.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_05-768x232.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_05-304x92.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"temporal-encoder\"><strong>\u25a0Temporal encoder<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"389\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_06.png\" alt=\"\" class=\"wp-image-336\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_06.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_06-300x114.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_06-768x292.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_06-304x115.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"motion-discriminator\"><strong>\u25a0Motion Discriminator<\/strong><\/h3>\n\n\n\n<p style=\"font-size:16px\">Enforces the generator to produce feasible real world poses that are aligned with 2D joint locations.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"382\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_07.png\" alt=\"\" class=\"wp-image-337\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_07.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_07-300x112.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_07-768x287.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_07-304x113.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"results\"><strong>\u25a0Results<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"359\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_08.png\" alt=\"\" class=\"wp-image-338\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_08.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_08-300x105.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_08-768x269.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_08-304x107.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<ol style=\"font-size:16px\"><li>Kanazawa et al., End-to-end Recovery of Human Shape and Pose, CVPR 2018<\/li><li>Kanazawa et al., Learning 3D Human Dynamics from Video, CVPR 2019<\/li><li>Kolotouros et al., Learning to Reconstruct 3D Human Pose and Shape via Modeling-fitting in the Loop, ICCV 2019<\/li><\/ol>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img decoding=\"async\" width=\"1024\" height=\"401\" src=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_09.png\" alt=\"\" class=\"wp-image-329\" srcset=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_09.png 1024w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_09-300x117.png 300w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_09-768x301.png 768w, https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/NS20210708_09-304x119.png 304w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><\/figure>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9\"><strong>\u25a0\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9<\/strong><\/h3>\n\n\n\n<p style=\"font-size:16px\"><a href=\"https:\/\/arithmer.blog\/wp-content\/uploads\/2022\/02\/22_VIBE\uff1a\u52d5\u753b\u304b\u3089\u306e\u4eba\u4f53\u306e\u59ff\u52e2\u30fb\u5f62\u72b6\u63a8\u5b9a_VIBE-VideoInferenceForHumanBodyPoseAnd2021_0708.pdf\">VIBE:\u52d5\u753b\u304b\u3089\u306e\u4eba\u4f53\u306e\u59ff\u52e2\u30fb\u5f62\u72b6\u63a8\u5b9a.pdf<\/a><\/p>\n\n\n\n<h3 class=\"has-medium-font-size wp-block-heading\" id=\"reference\"><strong>\u25a0Reference<\/strong><\/h3>\n\n\n\n<ul style=\"font-size:16px\"><li><a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/pdf\/1912.05656.pdf\" target=\"_blank\">VIBE<\/a>\u30fb<a rel=\"noreferrer noopener\" href=\"https:\/\/www.notion.so\/Generative-Adversarial-Networks-0692b1ea34e641a0ae011237345a51c4\" target=\"_blank\">Notes on GAN<\/a><\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/machinelearningmastery.com\/generative-adversarial-network-loss-functions\/\" target=\"_blank\">GAN Loss Function<\/a> <\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/dl4physicalsciences.github.io\/files\/nips_dlps_2017_slides_louppe.pdf\" target=\"_blank\">More of GAN<\/a>\u30fb<a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/pdf\/1812.07035.pdf\" target=\"_blank\">Angle to 6D Notatio<\/a><\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/pdf\/1712.06584.pdf\" target=\"_blank\">Iterative Regression with 3D Feedback<\/a> <\/li><li><a rel=\"noreferrer noopener\" href=\"https:\/\/web.stanford.edu\/class\/cs231a\/course_notes\/01-camera-models.pdf\" target=\"_blank\">Camera Weak Perspective<\/a><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>\u672c\u8cc7\u6599\u306f2021\u5e747\u670808\u65e5\u306b\u793e\u5185\u5171\u6709\u8cc7\u6599\u3068\u3057\u3066\u5c55\u958b\u3057\u3066\u3044\u305f\u3082\u306e\u3092WEB\u30da\u30fc\u30b8\u5411\u3051\u306b\u30ea\u30cb\u30e5\u30fc\u30a2\u30eb\u3057\u305f\u5185\u5bb9\u306b\u306a\u308a\u307e\u3059\u3002 \u25a0Contents Introduction Problem to Solve Dataset VIB &#8230; <\/p>\n","protected":false},"author":3,"featured_media":330,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[32,20,33,38,35,37,24,36],"_links":{"self":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts\/325"}],"collection":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=325"}],"version-history":[{"count":13,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts\/325\/revisions"}],"predecessor-version":[{"id":720,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/posts\/325\/revisions\/720"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=\/wp\/v2\/media\/330"}],"wp:attachment":[{"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=325"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=325"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/arithmer.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=325"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}