iMontage

Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Zhoujie Fu1,2, Xianfang Zeng2, Jinghong Lan2, Xinyao Liao1,2,
Cheng Chen1, Junyi Chen3, Jiacheng Wei1, Wei Cheng2, Shiyu Liu2,
Yunuo Chen2,3, Gang Yu†,2, Guosheng Lin†,1

1Nanyang Technology University, 2StepFun, 3Shanghai Jiao Tong University,
Corresponding Author

Demo Video

Discover the power of iMontage through our comprehensive demonstration video showcasing unified, versatile, and highly dynamic many-to-many image generation capabilities.

Method Overview

Overview of iMontage. The model accepts a flexible set of reference images and produces N outputs conditioned on a text prompt. Images are encoded by a 3D VAE separately, text by a language model, and both token streams are processed by an MMDiT. We concatenate clean reference-image tokens with noisy target tokens before denoising. Right: training uses fixed-length text tokens and variable-length image/noise tokens, transitions from dual stream to single stream blocks. For image branch, we apply Marginal RoPE, a head-tail temporal indexing that separates input and output pseudo-frames, preserves spatial RoPE, and supports many-to-many generation. In figure, notation H and W with subscription denote the height/width indices of the 2D RoPE computed at the images' native resolution, while notation T represents assigned time index for temporal dimension.

iMontage Pipeline

Image Editing

Place your mouse on image and find iMontage's powerful editing capability over different tasks. (Might lag, please patiently wait.)

In-context Generation

iMontage can synthesize high-quality images that preserve the identity and style of multiple reference images.

In-context + Vision Signal

Combining reference images with vision signals (depth maps, openpose and canny) for controlled generation.

Reference & Depth Signal

Reference Image
Reference
Depth Signal
Depth Map

Generated Result

Result

Style Reference Generation

Seamlessly transfer artistic styles to your content while preserving structure.

World Exploration

Navigate through scenes by changing perspectives via text descriptions.

Storyboard Generation

Generate consistent storyboards from character references with specific narrative descriptions.