Image style transfer is an interdisciplinary topic of computer vision and art that continuously attracts researchers’ interests. Different from traditional style transfer methods that require a style reference image as input to define the desired style, recent works start to tackle the problem in a text-guided manner. However, most existing solutions lack the flexibility to allow style input from multiple modalities. Besides, they often involve a time-consuming optimization or training procedure for every pair of image style and content. Moreover, many approaches produce undesirable artifacts in the transferred images. To address these issues, in this paper we present a unified framework for multimodality-guided image style transfer. Specifically, by adapting existing image-guided style transfer models, we simplify the entire task to a problem of content-agnostic style representation generation guided by multiple modalities. We solve this problem with a novel cross-modal GAN inversion method. Based on the generated style representations, we further develop a multi-style boosting strategy to enhance the style transfer quality. Extensive experiments on 2684 text-content combinations demonstrate that our method achieves state-of-the-art performance on text-guided image style transfer. Furthermore, comprehensive qualitative results confirm the effectiveness of our method on multimodality-guided style transfer and cross-modal style interpolation.