VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection


1KAIST    2NAVER LABS
VERIA motivation figure
Motivation for VERIA. (a) Driving datasets exhibit long-tail distributions, limiting 3D perception performance. (b) LiDAR point returns grow sparser with range, amplifying intra-class geometric variation. (c) Existing methods operate in the LiDAR domain and place objects without scene context, constraining diversity to curated asset libraries. (d) VERIA synthesizes objects conditioned on RGB context using foundation models, supporting subclass-level diversity with synchronized pseudo-LiDAR.

Abstract

Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB--LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings.

Method

VERIA method overview figure
Overview of VERIA. (a) Given a target category $\mathcal{C}$, a VLM generates a subclass-level description $\mathcal{T}_c$ and physical size priors; a 3D bounding box is sampled and projected to define the inpainting region for RGB-context-conditioned synthesis. Semantic verification retains candidates that pass category correctness, scene-level plausibility, and artifact severity checks. (b) Verified RGB instances are converted to synchronized pseudo-LiDAR via segmentation, depth estimation, and spherical projection. Geometric verification further filters implausible reconstructions, yielding verified RGB--LiDAR pairs for downstream training.

BibTeX

@misc{lee2026veriaverificationcentricmultimodalinstance,
  title         = {VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection},
  author        = {Jumin Lee and Siyeong Lee and Namil Kim and Sungeui Yoon},
  year          = {2026},
  eprint        = {2603.24294},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {http://arxiv.org/abs/2603.24294}
}