iMontage

Method Overview

Overview of iMontage. The model accepts a flexible set of reference images and produces N outputs conditioned on a text prompt. Images are encoded by a 3D VAE separately, text by a language model, and both token streams are processed by an MMDiT. We concatenate clean reference-image tokens with noisy target tokens before denoising. Right: training uses fixed-length text tokens and variable-length image/noise tokens, transitions from dual stream to single stream blocks. For image branch, we apply Marginal RoPE, a head-tail temporal indexing that separates input and output pseudo-frames, preserves spatial RoPE, and supports many-to-many generation. In figure, notation H and W with subscription denote the height/width indices of the 2D RoPE computed at the images' native resolution, while notation T represents assigned time index for temporal dimension.

Image Editing

Place your mouse on image and find iMontage's powerful editing capability over different tasks. (Might lag, please patiently wait.)

Add sunglasses

Change the background to beach

Turn tomatoes yellow

Change material to paper

Change to japanese anime style

Make the girl dance

Remove the woman

Replace the content in her hand to a red paper heart

Change cups' color to red

Change the material of outfit to marble

Make the woman look straight to camera and smile.

Replace to sci-wire-frame equipment

In-context Generation

iMontage can synthesize high-quality images that preserve the identity and style of multiple reference images.

Reference Images & Prompt

Text Prompt

"Have the human from the first picture, the person in image 2, and the person in the third picture stand together, holding a large, ancient map, with the man from the first picture pointing to a spot on the map, all of them leaning in with curiosity and excitement."

Generated Result

Reference Images & Prompt

Text Prompt

"She gently strokes the cat curled up on her lap in a dimly lit Victorian-style study, where a single vintage lamp casts warm golden light onto the velvet armchair and polished wooden floor. The cat's fur glows subtly in the light, its wide, glistening eyes gazing up at her with a mix of curiosity and trust. Dust particles float visibly in the air, and the shadows of bookshelves stretch across the walls."

Generated Result

Reference Images & Prompt

Text Prompt

"Let the human in the second picture sit astride the white horse with a dark mane from the first figure, gently leaning forward to stroke its neck with one hand while the other holds the reins, as golden sunlight filters through nearby trees and casts long shadows across the peaceful grassy meadow."

Generated Result

Reference Images & Prompt

Text Prompt

"The vibrant purple flower is set in a small glass vase on the rustic wooden counter of the cozy store, and the plant with red leaves are arranged beside it, allowing the warm glow of string lights to enhance their colors. Behind the counter, shelves laden with artisanal candles and handmade pottery create a welcoming, homey feel, while the faint smell of fresh coffee drifts in from the adjoining cafe section."

Generated Result

Reference Images & Prompt

Text Prompt

"He plays guitar in the place."

Generated Result

Reference Images & Prompt

Prompt

"Confucius from the first image, Moses from the second image and Solon from the last image are having a debate in front of the supreme court."

Generated Result

World Exploration

Navigate through scenes by changing perspectives via text descriptions.

Start View

"Look left side..."

Stop 1

"Turn right..."

Stop 2

"Zoom in..."

Stop 3

Start View

"Move forward..."

Stop 1

"Look right side..."

Stop 2

"Zoom in view..."

Stop 3

Start View

"Right side view..."

Stop 1

"Left side view..."

Stop 2

"Zoom in..."

Stop 3

Start View

"Step forward..."

Stop 1

"Look up to sky..."

Stop 2

"Zoom out..."

Stop 3

Start View

"Lean back..."

Stop 1

"Move forward..."

Stop 2

"Keep going..."

Stop 3

Storyboard Generation

Generate consistent storyboards from character references with specific narrative descriptions.

Character Reference Japanese Life Story

"A woman in a kimono walks through a lush garden path, holding a red parasol."

"She kneels down to gently reach out to a white cat on a leaf-covered path."

"The woman strolls along a sunlit garden path, her parasol casting a shadow behind her."

Character Reference Art Portrait

"She stands by a sunlit window, enjoying a steaming cup of tea with an open book nearby."

"She gracefully walks through a sunlit doorway, her dress flowing as she moves."

"She sits thoughtfully on a plush sofa in a warmly lit room, surrounded by elegant decor."

Character Reference Tourist Story

"The woman is waiting for the airplane in the airport terminal."

"She takes in the view from her window seat, marveling at the clouds below."

"Finally, she arrives at her destination, a beautiful beach."

Character Reference Japanese Anime

"Nezuko sits in a forest peacefully."

"In the night, Nezuko fights with some wolves, protecting the village."

"On the victory celebration, Nezuko sings happily in front of a bonfire."

Character References Vintage Film

"Hepbrun carrying the yellow bag, walking elegantly in the street of a beautiful Italian small town."

"She holds a pose to the camera, lifts up the yellow bag."

"After filming, Hepburn is crowded by a group of fans."

iMontage

Demo Video

Method Overview

Image Editing

In-context Generation

Reference Images & Prompt

Generated Result

Reference Images & Prompt

Generated Result

Reference Images & Prompt

Generated Result

Reference Images & Prompt

Generated Result

Reference Images & Prompt

Generated Result

Reference Images & Prompt

Generated Result

In-context + Vision Signal

Reference & Depth Signal

Generated Result

References & OpenPose Signal

Generated Result

References & Canny Signal

Generated Result

Style Reference Generation

World Exploration

Storyboard Generation