Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Discover the power of iMontage through our comprehensive demonstration video showcasing unified, versatile, and highly dynamic many-to-many image generation capabilities.
Overview of iMontage. The model accepts a flexible set of reference images and produces N outputs conditioned on a text prompt. Images are encoded by a 3D VAE separately, text by a language model, and both token streams are processed by an MMDiT. We concatenate clean reference-image tokens with noisy target tokens before denoising. Right: training uses fixed-length text tokens and variable-length image/noise tokens, transitions from dual stream to single stream blocks. For image branch, we apply Marginal RoPE, a head-tail temporal indexing that separates input and output pseudo-frames, preserves spatial RoPE, and supports many-to-many generation. In figure, notation H and W with subscription denote the height/width indices of the 2D RoPE computed at the images' native resolution, while notation T represents assigned time index for temporal dimension.
Place your mouse on image and find iMontage's powerful editing capability over different tasks. (Might lag, please patiently wait.)
iMontage can synthesize high-quality images that preserve the identity and style of multiple reference images.
"Have the human from the first picture, the person in image 2, and the person in the third picture stand together, holding a large, ancient map, with the man from the first picture pointing to a spot on the map, all of them leaning in with curiosity and excitement."
"She gently strokes the cat curled up on her lap in a dimly lit Victorian-style study, where a single vintage lamp casts warm golden light onto the velvet armchair and polished wooden floor. The cat's fur glows subtly in the light, its wide, glistening eyes gazing up at her with a mix of curiosity and trust. Dust particles float visibly in the air, and the shadows of bookshelves stretch across the walls."
"Let the human in the second picture sit astride the white horse with a dark mane from the first figure, gently leaning forward to stroke its neck with one hand while the other holds the reins, as golden sunlight filters through nearby trees and casts long shadows across the peaceful grassy meadow."
"The vibrant purple flower is set in a small glass vase on the rustic wooden counter of the cozy store, and the plant with red leaves are arranged beside it, allowing the warm glow of string lights to enhance their colors. Behind the counter, shelves laden with artisanal candles and handmade pottery create a welcoming, homey feel, while the faint smell of fresh coffee drifts in from the adjoining cafe section."
"He plays guitar in the place."
"Confucius from the first image, Moses from the second image and Solon from the last image are having a debate in front of the supreme court."
Combining reference images with vision signals (depth maps, openpose and canny) for controlled generation.
Seamlessly transfer artistic styles to your content while preserving structure.
Navigate through scenes by changing perspectives via text descriptions.
Generate consistent storyboards from character references with specific narrative descriptions.
"A woman in a kimono walks through a lush garden path, holding a red parasol."
"She kneels down to gently reach out to a white cat on a leaf-covered path."
"The woman strolls along a sunlit garden path, her parasol casting a shadow behind her."
"She stands by a sunlit window, enjoying a steaming cup of tea with an open book nearby."
"She gracefully walks through a sunlit doorway, her dress flowing as she moves."
"She sits thoughtfully on a plush sofa in a warmly lit room, surrounded by elegant decor."
"The woman is waiting for the airplane in the airport terminal."
"She takes in the view from her window seat, marveling at the clouds below."
"Finally, she arrives at her destination, a beautiful beach."
"Nezuko sits in a forest peacefully."
"In the night, Nezuko fights with some wolves, protecting the village."
"On the victory celebration, Nezuko sings happily in front of a bonfire."
"Hepbrun carrying the yellow bag, walking elegantly in the street of a beautiful Italian small town."
"She holds a pose to the camera, lifts up the yellow bag."
"After filming, Hepburn is crowded by a group of fans."