Architecture. Bridge adopts a Mixture-of-Transformers (MoT) architecture with two experts: a frozen understanding expert for text and visual understanding tokens, and a trainable generation expert for visual generation tokens. Both experts share unified causal attention across all tokens.
Hard routing is employed to dispatch tokens between the two experts. Text tokens are passed directly into the understanding expert. Images for understanding tasks are first encoded by the continuous vision encoder inherited from the backbone MLLM and then fed into the understanding expert. In contrast, image generation tokens are routed to the generation expert.
Semantic-to-Pixel Discrete Visual Representation. For generation, each image is represented by a compact
sequence that first contains a small number of semantic tokens capturing global structure, followed by a
longer sequence of fine-grained pixel tokens that reconstruct details:
<BOI> <SEM0> <SEM1> … <PIX0> <PIX1> … <EOI>
where <BOI>
and <EOI>
denote special tokens marking the beginning and end of an image, <SEMi>
represents the i-th semantic token, and <PIXi>
represents the i-th pixel token.
This design tightly aligns with language modeling while maintaining high visual fidelity, and only requires a minimum increase in token length.