UMFuse: Unified Multi View Fusion for Human Editing applications

Accepted in ICCV, October 2023

Posted by Rishabh Jain on August 13, 2023 · 4 mins read
teacher Diagram To remedy the shortcomings of single view person image generation, we propose merging information from multiple source images. In Multi-view Human Reposing, given multiple images of a person, we generate a new image in the desired target pose. In Mix and Match Human Image Generation, we model all of the images in a tuple (id, top, bottom, pose) jointly to synthesise the final output

Abstract

Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable perpixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative.

Network architecture

Nework architecture UMFuse framework: (A) is the single view PHIG network on top of which the UMFuse framework operates. The source image and its keypoints along with the target keypoints are used to produce warped images and a visibility map (Iwv ,Iwi ,Vt). These are used to obtain the texture encoding at different scales l (et,l), and the source and target poses are used to obtain the pose encoding ep. Together they are used to render the output with a GAN-based renderer. (B) Shows the UMFuse adaptation for the single view network. The multiple source images and poses are passed individually to the warping and visibility prediction module to obtain multiple warped images and visibility maps, and from those, multiple texture-encoding vectors. Likewise, the source poses paired with the target pose are used to obtain multiple pose encoding vectors. These are merged in an affine combination using the predicted Appearance Retrieval Maps(s1-3). These maps are obtained using the Multi-view Fusion module and are a key contribution of the UMFuse framework.

Results

Results1 Here, we show incremental improvements gained by the network after the addition of multiple images. Gen1 is generated using only View1 (repeated thrice). Gen2 is generated using View1 and View2 pair with ArMap2 showing the attribution map using red and green colors respectively. Gen3 is obtained using all the 3 images and ArMap3 encodes the weights in RGB respectively. Results2 Qualitative Results for Mix & Match Human Image Generation showing realistic generation quality while preserving bodyshape(a), modelling complex pose(b), complicated design and texture of cloth(c), heavy occlusions(d), missing information(e), multiple clothing garments with accessories(f). Additional results can be found in the supplemental material

Resources

Citation

@inproceedings{jain2023umfuse,
title={UMFuse: Unified Multi View Fusion for Human Editing applications},
author={Jain, Rishabh and Hemani, Mayur and Ceylan, Duygu and Singh, Krishna Kumar and Lu, Jingwan and Sarkar, Mausoom and Krishnamurthy, Balaji},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={7182--7191},
year={2023}
}