Learning to Disentangle Latent Physical Factors of Deformable Faces – SGVR Lab, Scalable Graphics, Vision, Robotics Lab. (카이스트, 대용량 컴퓨터 그래픽스, 비젼, 로보틱스 연구실)

Abstract

We proposed a monocular image disentanglement framework based on a compositional model. Our model disentangles the input image into its constituent components of albedo, depth, deformation, pose, and illumination. Instead of relying on any handcrafted priors, we trained our deep neural network to understand the physical meaning of each element by mimicking real-world operations, allowing it to reconstruct images in a self-supervised manner. Our model, trained on multi-frame images of each subject, demonstrates a better understanding of the objects without requiring any supervision or strong model assumptions. We utilized a deformation-free canonical space to align multi-frame images in the same space. This approach enables the understanding of information from multi-frame images in the same space. Our experiments showed that our approach accurately disentangled the physical elements of deformable faces from images with wide variations found in the wild.