UniPercept: A Unified Diffusion Model for Generalizable Visual Perception

1State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China
2University of Chinese Academy of Sciences (CAS), China
banner

Overview of visual perception tasks supported by UniPercept, including structural, semantic, and material / lighting properties.

Abstract

Diffusion models have shown impressive performance in generative tasks, demonstrating their ability to capture detailed structural and semantic information. Recently, these capabilities have been extended to visual understanding, with studies employing diffusion models as the backbone for various perception tasks. However, existing diffusion-based perception models are generally restricted to a single task or a fixed set of predefined tasks, lacking an efficient mechanism to generalize to novel tasks. To overcome this limitation, we propose a unified DiT-based perception framework called UniPercept, which introduces a novel foundation-adapter paradigm for general visual perception. In this framework, a shared diffusion-based foundation model is trained to capture common and generalizable visual knowledge across diverse perception tasks, with task-specific adapters integrated for each individual task. Leveraging its superior generalization ability, the foundation model can be efficiently adapted to novel domains through lightweight adapters, requiring as few as 1,000 training samples and less than 1% of trainable parameters. Furthermore, UniPercept demonstrates strong performance across various perception tasks, outperforming state-of-the-art generalist models in most cases and achieving accuracy comparable to specialist models.

Framework of UniPercept

Results

Image-to-Depth

Image-to-Depth

Image-to-Normal

Image-to-Normal

Image-to-Albedo

Image-to-Albedo

Image-to-Irradiance

Image-to-Irradiance

Image-to-Metallicity

Metallicity

Image-to-Roughness

Roughness

Image-to-Semantic Segmentation

Semantic Segmentation

Image-to-Dichotomous Segmentation

Dichotomous Segmentation

Image-to-Line Art

Line Art

Image-to-Edge Detection

Edge Detection

Image-to-Human Skeleton

Human Skeleton

Image-to-DensePose

DensePose

Image-to-Animal Pose

Animal Pose

Image-to-Line Segment

Line Segment

BibTeX

@inproceedings{zhao2026unipercept,
  title={UniPercept: A Unified Diffusion Model for Generalizable Visual Perception},
  author={Zhao, Zuyan and He, Zhenliang and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}