自己教師あり学習の画像セグメンテーションへの適用

本資料は2020年7月2日に社内共有資料として展開していたものをWEBページ向けにリニューアルした内容になります。

■Agenda

Introduction

  • Image Segmentation
  • Problems
  • Self-supervised learning
  • Content of this presentation

Survey

  • Pseudo Labeling
  • Class Activation Map
  • Image Depth Information
  • How about using videos

Some comments

■Image Segmentation

■Problems

  • very hard to annotate (=expensive)
    • pixel-wise annotation
    • different types of objects
    • boundaries are often blurry
  • easy to make mistake
    • lots of unclear cases
    • Need to keep focus for long time

How can we change this situation?

■Self-supervised learning

  • Research topic pushed by “godfathers of AI” and specially by Yann Le Cun
  • Pre-train a feature extractor in an unsupervised way then train the classifier with annotated data
  • Reach SOTA with only 10% of labels in the last papers (starting to go beyond)
  • Methods coming from NLP

■Content of this presentation

  • 3 very different directions
    • Using pseudo labelling
    • Using Activation Map (like grad-CAM)
    • Using depth information
    • Leverage frames in videos

Spoiler: there is no self-supervised learning semantic segmentation method that is very convincing…

  • High level explanations
  • For more details, I provided notes for some papers
  • For even more details, please read the original papers

■Pseudo Labeling

  • Usually use softmax to differentiate “strong prediction” to “weak prediction”
  • Keep only “strong prediction” as pseudo-labels
  • popular research topic

Issue

  • Softmax is not the best way since it takes the best score without considering other scores
    (which might be promising also)

■Pseudo Labeling: Entropy-guided

My notes
https://arithmer.co.jp/wp-content/uploads/pdf/notes_ESL_Entropy-guided_Self-supervised_Learning_for_Domain_Adaptation_in_Semantic_Segmentation.pdf
Original paper
https://arxiv.org/abs/2006.08658v1
Repo
https://github.com/liyunsheng13/BDL

Model Pre-trained on GTA5 dataset and method used on Cityscapes:
consistently improves the final accuracy but by less than 1% mIoU State-of-the-art for fully supervised learning: 85.1% (IT = image translation)

■Class Activation Map

  • Uses gradient information flow to the neurons of a specific CNN layer to identify regions of activation
  • Step forward interpretability
  • Work from 2017 (last version from last December)

original paper: https://arxiv.org/abs/1610.02391

■Class Activation Map: Equivariant Attention Mechanism

  1. Need image level annotation (class)
  2. 3 loss functions combines (ECR, ER and cls)
    • a. both CAM outputs should be similar despite Affine transform
    • b. But CAM degenerates (converge to a trivial solution) so ECR regularizes the PCM outputs with the original CAM

My notes
https://arithmer.co.jp/wp-content/uploads/pdf/notes_Self-supervised_Equivariant_Attention_Mechanism_for_Weakly_Supervised_Semantic_Segmentation.pdf
Original paper
https://arxiv.org/abs/2004.04581
Repo
https://github.com/YudeWang/SEAM

Main contribution: PCM module

  • uses attention to refine the mask from grad-CAM
  • attention can capture contextual information

■Class Activation Map: Experiments

■Image Depth Information: HN labels

  • HN labels generation
    • Computing angles and height relative to the floor plane using RGB-D
    • Everything is binned to create labels
  • Training segmentation model on HN-labels
  • Fine-tuning on real dataset

The point is more to show:pretrained HN labels >> pretrained ImageNet

■Image Depth Information: Experiments

dataset: NUYv2, trained on 50 epochs only. HN labels are from NUYv2. ImageNet is 25x bigger

■How about using videos: Self-supervised Video Object Segmentation

  • Learn unsupervised to track similar pixels across frames
  • Then when giving an initial mask (on this illustration, in the frame 0), the model can infer the same object for the next frames

Original paper: https://arxiv.org/abs/2006.12480v1

  • Learn unsupervised to track similar pixels across frames
  • Then when giving an initial mask (on this illustration, in the frame 0), the model can infer the same object for the next frames

Original paper: https://arxiv.org/abs/2006.12480v1

■ダウンロード

自己教師あり学習の画像セグメンテーションへの適用.pdf