Growing Visual Generative Capacity for Pre-Trained MLLMs

1 University of Maryland, College Park   2 CUHK MMLab   3 ByteDance
*Equal contribution     Co-corresponding authors     Project lead

Abstract

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

Method

Architecture. Bridge adopts a Mixture-of-Transformers (MoT) architecture with two experts: a frozen understanding expert for text and visual understanding tokens, and a trainable generation expert for visual generation tokens. Both experts share unified causal attention across all tokens. Hard routing is employed to dispatch tokens between the two experts. Text tokens are passed directly into the understanding expert. Images for understanding tasks are first encoded by the continuous vision encoder inherited from the backbone MLLM and then fed into the understanding expert. In contrast, image generation tokens are routed to the generation expert.

Bridge method

Semantic-to-Pixel Discrete Visual Representation. For generation, each image is represented by a compact sequence that first contains a small number of semantic tokens capturing global structure, followed by a longer sequence of fine-grained pixel tokens that reconstruct details:

<BOI> <SEM0> <SEM1> … <PIX0> <PIX1> … <EOI>

where <BOI> and <EOI> denote special tokens marking the beginning and end of an image, <SEMi> represents the i-th semantic token, and <PIXi> represents the i-th pixel token. This design tightly aligns with language modeling while maintaining high visual fidelity, and only requires a minimum increase in token length.

Experiments

Quantitative Results on Visual Understanding Benchmarks

Model Base (M)LLM POPE↑ MME-P↑ MME-C↑ MMB↑ SEED↑ MMMU↑
LLaVA-v1.5 Vicuna-7B 85.9 1511 - 64.3 58.6 35.4
Qwen-VL Qwen-7B - 1488 - 60.6 58.2 -
LLaVA-NeXT Vicuna-7B 86.5 1519 - 67.4 64.7 35.1
DeepSeek-VL DeepSeek-7B 88.1 - - 73.2 70.4 36.6
LLaVA-OV Qwen2-7B 87.2 1580 418 80.8 75.4 48.8
ILLUME Vicuna-7B 88.5 1445 - 65.1 72.9 38.2
Chameleon - - - - - - 22.4
LWM LLaMA2-7B 75.2 - - - - -
Emu3 - 85.2 - - 58.5 68.2 31.6
Liquid GEMMA-7B 81.1 1119 - - - -
UniTok LLaMA2-7B 83.2 1448 - - - -
VILA-U LLaMA2-7B 85.8 1402 - - 59.0 -
Janus-Pro DeepSeek-7B 87.4 1567 260 79.2 72.1 41.0
TokenFlow-XL Qwen-2.5-14B 87.8 1551 371 76.8 72.6 43.2
MetaMorph LLaMA-3.1-8B - - - 75.2 71.8 41.8
Tar Qwen2.5-7B 87.8 1571 355 74.4 73.0 39.0
Show-o2 Qwen2.5-7B - 1621 - 79.3 69.8 48.9
BAGEL Qwen2.5-7B - 1687 - 85.0 - 55.3
LMFusion LLaVA-NeXT-8B - 1604 - 72.1 72.5 41.7
MetaQuery-XL LLaVA-NeXT-8B - 1685 - 83.5 76.9 58.6
UniWorld-V1 Qwen2.5-VL-7B - - - 83.5 - 58.6
BLIP3-o Qwen2.5VL-7B - 1683 647 83.5 77.5 50.6
Bridge (Ours) InternVL3-8B 88.4 1730 677 84.4 77.4 57.4

Quantitative Results on Text-to-image Generation Benchmarks

† refers to methods using prompt augmentation, e.g., LLM rewriter or self-CoT.

Method DPG Bench GenEval WISE
Entity Relation Overall↑ Two Obj. Color Attr. Overall↑ Time Space Overall↑
SDXL 82.43 86.76 74.65 0.74 0.23 0.55 0.48 0.47 0.43
Playground v2.5 82.59 84.08 75.47 - - - 0.58 0.55 0.49
Hunyuan DiT 80.59 74.36 78.87 - - - - - -
DALLE3 89.61 90.58 83.50 0.87 0.45 0.67 - - -
SD3-Medium 91.01 80.70 84.08 0.94 0.60 0.74 0.44 0.48 0.42
SANA-1.5 - - 84.70 0.93 0.65 0.81 - - -
NextStep-1 - - 85.28 - - 0.63 / 0.73† 0.54 0.61 0.54 / 0.79
Chameleon - - - - - 0.39 - - -
LWM - - - 0.41 0.15 0.47
Emu3 86.68 90.22 80.60 0.71 0.21 0.54 / 0.66† 0.45 0.48 0.39
SEED-X-13B - - - 0.58 0.14 0.49 - - -
Transfusion - - - - - 0.63 - - -
ILLUME - - - 0.86 0.28 0.61 - - -
Janus-Pro-7B 88.90 89.32 84.19 0.89 0.66 0.80 0.37 0.49 0.35
Tar-7B 88.62 93.98 84.19 0.92 0.65 0.84 - - -
Show-o2-7B 91.78 91.81 86.14 0.87 0.62 0.76 - - -
MetaQuery-XL - - 82.05 - - 0.80† 0.55 0.62 0.55
BAGEL - - - 0.94 0.63 0.82 / 0.88 0.55 0.68 0.52 / 0.70
UniWorld-V1 - - - 0.93 0.70 0.80 0.55 0.73 0.55
BLIP3-o-8B - - 81.60 - - 0.84 - - 0.62
Bridge (Ours) 90.10 92.27 85.51 0.93 0.66 0.74 / 0.82† 0.56 0.65 0.53 / 0.69†

Comparison of Image Editing Performance on ImgEdit Benchmark

Model Add Adjust Extract Replace Remove Background Style Hybrid Action Overall↑
Instruct-P2P 2.45 1.83 1.44 2.01 1.50 1.44 3.55 1.20 1.46 1.88
AnyEdit 3.18 2.95 1.88 2.47 2.23 2.24 2.85 1.56 2.65 2.45
UltraEdit 3.44 2.81 2.13 2.96 1.45 2.83 3.76 1.91 2.98 2.70
Step1X-Edit 3.88 3.14 1.76 3.40 2.41 3.16 4.63 2.64 2.52 3.06
BAGEL 3.56 3.31 1.70 3.30 2.62 3.24 4.49 2.38 4.17 3.20
UniWorld-V1 3.82 3.64 2.27 3.47 3.24 2.99 4.21 2.96 2.74 3.26
Bridge (Ours) 3.49 2.64 2.93 3.45 3.48 3.45 4.14 3.09 3.85 3.39

Image Editing Visualization