My research project


Xiao Han, Xiatian Zhu, Yi-Zhe Song, Tao Xiang (2023)FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks Xiao Han1,2 Xiatian Zhu1,3 Licheng Yu Li Zhang4 Yi-Zhe Song1,2 Tao Xiang1,2 1 CVSSP, University of Surrey 2 iFlyTek-Surrey Joint Research Centre on Artificial Intelligence 3 Surrey Institute for People-Centred Artificial Intelligence 4 Fudan University {xiao.han, xiatian.zhu, y.song, t.xiang}@surrey.ac.uk lichengyu24@gmail.com lizhangfd@fudan.edu.cn Abstract In the fashion domain, there exists a variety of vision- and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individ- ual input/output format and dataset size. It has been com- mon to design a task-specific model and fine-tune it in- dependently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to ex- ploit inter-task relatedness. To address such issues, we pro- pose a novel FAshion-focused Multi-task Efficient learn- ing method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a sta- ble and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conven- tional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL