Bridge: Growing Visual Generative Capacity for Pre-Trained MLLMs

Growing Visual Generative Capacity for Pre-Trained MLLMs

¹ University of Maryland, College Park ² CUHK MMLab ³ ByteDance

^*Equal contribution ^†Co-corresponding authors ^‡Project lead

Abstract

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

Method

Architecture. Bridge adopts a Mixture-of-Transformers (MoT) architecture with two experts: a frozen understanding expert for text and visual understanding tokens, and a trainable generation expert for visual generation tokens. Both experts share unified causal attention across all tokens. Hard routing is employed to dispatch tokens between the two experts. Text tokens are passed directly into the understanding expert. Images for understanding tasks are first encoded by the continuous vision encoder inherited from the backbone MLLM and then fed into the understanding expert. In contrast, image generation tokens are routed to the generation expert.

Semantic-to-Pixel Discrete Visual Representation. For generation, each image is represented by a compact sequence that first contains a small number of semantic tokens capturing global structure, followed by a longer sequence of fine-grained pixel tokens that reconstruct details:

            <BOI> <SEM0> <SEM1> … <PIX0> <PIX1> … <EOI>
          

where <BOI> and <EOI> denote special tokens marking the beginning and end of an image, <SEM_i> represents the i-th semantic token, and <PIX_i> represents the i-th pixel token. This design tightly aligns with language modeling while maintaining high visual fidelity, and only requires a minimum increase in token length.

Experiments

Quantitative Results on Visual Understanding Benchmarks

Model	Base (M)LLM	POPE↑	MME-P↑	MME-C↑	MMB↑	SEED↑	MMMU↑
LLaVA-v1.5	Vicuna-7B	85.9	1511	-	64.3	58.6	35.4
Qwen-VL	Qwen-7B	-	1488	-	60.6	58.2	-
LLaVA-NeXT	Vicuna-7B	86.5	1519	-	67.4	64.7	35.1
DeepSeek-VL	DeepSeek-7B	88.1	-	-	73.2	70.4	36.6
LLaVA-OV	Qwen2-7B	87.2	1580	418	80.8	75.4	48.8
ILLUME	Vicuna-7B	88.5	1445	-	65.1	72.9	38.2
Chameleon	-	-	-	-	-	-	22.4
LWM	LLaMA2-7B	75.2	-	-	-	-	-
Emu3	-	85.2	-	-	58.5	68.2	31.6
Liquid	GEMMA-7B	81.1	1119	-	-	-	-
UniTok	LLaMA2-7B	83.2	1448	-	-	-	-
VILA-U	LLaMA2-7B	85.8	1402	-	-	59.0	-
Janus-Pro	DeepSeek-7B	87.4	1567	260	79.2	72.1	41.0
TokenFlow-XL	Qwen-2.5-14B	87.8	1551	371	76.8	72.6	43.2
MetaMorph	LLaMA-3.1-8B	-	-	-	75.2	71.8	41.8
Tar	Qwen2.5-7B	87.8	1571	355	74.4	73.0	39.0
Show-o2	Qwen2.5-7B	-	1621	-	79.3	69.8	48.9
BAGEL	Qwen2.5-7B	-	1687	-	85.0	-	55.3
LMFusion	LLaVA-NeXT-8B	-	1604	-	72.1	72.5	41.7
MetaQuery-XL	LLaVA-NeXT-8B	-	1685	-	83.5	76.9	58.6
UniWorld-V1	Qwen2.5-VL-7B	-	-	-	83.5	-	58.6
BLIP3-o	Qwen2.5VL-7B	-	1683	647	83.5	77.5	50.6
Bridge (Ours)	InternVL3-8B	88.4	1730	677	84.4	77.4	57.4

Quantitative Results on Text-to-image Generation Benchmarks

† refers to methods using prompt augmentation, e.g., LLM rewriter or self-CoT.

Method	DPG Bench			GenEval			WISE
Method	Entity	Relation	Overall↑	Two Obj.	Color Attr.	Overall↑	Time	Space	Overall↑
SDXL	82.43	86.76	74.65	0.74	0.23	0.55	0.48	0.47	0.43
Playground v2.5	82.59	84.08	75.47	-	-	-	0.58	0.55	0.49
Hunyuan DiT	80.59	74.36	78.87	-	-	-	-	-	-
DALLE3	89.61	90.58	83.50	0.87	0.45	0.67	-	-	-
SD3-Medium	91.01	80.70	84.08	0.94	0.60	0.74	0.44	0.48	0.42
SANA-1.5	-	-	84.70	0.93	0.65	0.81	-	-	-
NextStep-1	-	-	85.28	-	-	0.63 / 0.73†	0.54	0.61	0.54 / 0.79†
Chameleon	-	-	-	-	-	0.39	-	-	-
LWM	-	-	-	0.41	0.15	0.47
Emu3	86.68	90.22	80.60	0.71	0.21	0.54 / 0.66†	0.45	0.48	0.39
SEED-X-13B	-	-	-	0.58	0.14	0.49	-	-	-
Transfusion	-	-	-	-	-	0.63	-	-	-
ILLUME	-	-	-	0.86	0.28	0.61	-	-	-
Janus-Pro-7B	88.90	89.32	84.19	0.89	0.66	0.80	0.37	0.49	0.35
Tar-7B	88.62	93.98	84.19	0.92	0.65	0.84	-	-	-
Show-o2-7B	91.78	91.81	86.14	0.87	0.62	0.76	-	-	-
MetaQuery-XL	-	-	82.05	-	-	0.80†	0.55	0.62	0.55
BAGEL	-	-	-	0.94	0.63	0.82 / 0.88†	0.55	0.68	0.52 / 0.70†
UniWorld-V1	-	-	-	0.93	0.70	0.80	0.55	0.73	0.55
BLIP3-o-8B	-	-	81.60	-	-	0.84	-	-	0.62
Bridge (Ours)	90.10	92.27	85.51	0.93	0.66	0.74 / 0.82†	0.56	0.65	0.53 / 0.69†

Comparison of Image Editing Performance on ImgEdit Benchmark

Model	Add	Adjust	Extract	Replace	Remove	Background	Style	Hybrid	Action	Overall↑
Instruct-P2P	2.45	1.83	1.44	2.01	1.50	1.44	3.55	1.20	1.46	1.88
AnyEdit	3.18	2.95	1.88	2.47	2.23	2.24	2.85	1.56	2.65	2.45
UltraEdit	3.44	2.81	2.13	2.96	1.45	2.83	3.76	1.91	2.98	2.70
Step1X-Edit	3.88	3.14	1.76	3.40	2.41	3.16	4.63	2.64	2.52	3.06
BAGEL	3.56	3.31	1.70	3.30	2.62	3.24	4.49	2.38	4.17	3.20
UniWorld-V1	3.82	3.64	2.27	3.47	3.24	2.99	4.21	2.96	2.74	3.26
Bridge (Ours)	3.49	2.64	2.93	3.45	3.48	3.45	4.14	3.09	3.85	3.39