We proposed a monocular image disentanglement framework based on a compositional model. Our model disentangles the input image into its constituent components of albedo, depth, deformation, pose, and illumination. Instead of relying on any handcrafted priors, we trained our deep neural network to understand the physical meaning of each element by mimicking real-world operations, allowing it to reconstruct images in a self-supervised manner. Our model, trained on multi-frame images of each subject, demonstrates a better understanding of the objects without requiring any supervision or strong model assumptions. We utilized a deformation-free canonical space to align multi-frame images in the same space. This approach enables the understanding of information from multi-frame images in the same space. Our experiments showed that our approach accurately disentangled the physical elements of deformable faces from images with wide variations found in the wild.