Fine-grained Background Representation for Weakly Supervised Semantic Segmentation

Korea Advanced Institute of Science and Technology (KAIST)
IEEE TCSVT 2024

Abstract

Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon using the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our proposed method can be seamlessly plugged into various models, yielding new state-of-the-art results under various WSSS settings across benchmarks. Leveraging solely image-level (I) labels as supervision, our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively. Furthermore, by incorporating saliency maps as an additional supervision signal (I+S), we attain 74.9 mIoU on Pascal VOC test set. Concurrently, our FBR approach demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks, showcasing its robustness and strong generalization capabilities across diverse domains.

What is NROI?
MY ALT TEXT

To conduct the fore-to-background (FB) contrastive learning, the common strategy (a) exhaustively compares background pixel features (triangles) with foreground queries (the red part), which requires expensive computation and is susceptible to implausible labels. In this study, we propose recognizing the fine-grained BG semantic, termed NROI, and implementing FB contrastive learning by comparing queries (the red rectangle) against NROIs.

Architecture Overview
MY ALT TEXT

We present a simple FBR method proposed to optimize two contrastive relationships, (1) fore-to-background and (2) intra-foreground, to address the weakly supervised semantic segmentation. The core contribution is the developed fine-grained background primitive, dubbed NROI, to effectively represent image background and implement fore-to-background contrastive learning to enhance class activation maps' ability to distinguish co-occurring background cues. Also, we introduce an active method to sample efficient foreground negatives and conduct intra-foreground contrastive learning to activate integral object regions.

Qualitative Results

Quantitative Results

Poster

BibTeX

@article{yin2024fine,
  title={Fine-grained Background Representation for Weakly Supervised Semantic Segmentation},
  author={Yin, Xu and Im, Woobin and Min, Dongbo and Huo, Yuchi and Pan, Fei and Yoon, Sung-Eui},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2024},
  publisher={IEEE}
}