Author
Abstract
Traditional tourism analytics have primarily relied on isolated sentiment analysis and image processing techniques, often failing to capture the subtle interaction between textual expressions and visual aesthetics inherent in tourist experiences. This study addresses these limitations by proposing a novel multi-modal framework that transforms textual reviews into AI-generated images using standardized prompts, thereby converting affective signals into explicit visual features. Leveraging state-of-the-art models—such as Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) for fine-grained emotion recognition and Contrastive Language–Image Pre‑training (CLIP) for semantic extraction of visual attributes—our approach maps complex sentiments onto interpretable visual characteristics, integrating explainable features to uncover the underlying structure in tourist perceptions. This approach enhances classification performance and provides a transparent mechanism for understanding how distinct emotional states correspond to specific visual cues. Experimental evaluations on a dataset encompassing four diverse tourist destinations—Berlin, Dublin, Cairo, and Málaga—demonstrate high classification accuracy and robust correlations between text-derived emotions and image-based features, close to more powerful embedding methods. Significant correlations were observed between emotions and visual features, e.g., brightness and contentment, as well as between entropy and shame, indicating that our method efficiently captures the affective resonance between visual and textual modalities. Our findings underscore the transformative potential of converting textual sentiment into visual representations to facilitate more accurate, interpretable, and actionable analytics in the tourism sector. This framework suggests promising avenues for dynamic destination characterization, informed marketing strategies, and enhanced urban planning initiatives, laying the foundation for future advancements in multi-modal tourism analytics.
Suggested Citation
Víctor Calderón-Fajardo & Ignacio Rodríguez-Rodríguez & Miguel Puig-Cabrera, 2025.
"From words to visuals: a transformer-based multi-modal framework for emotion-driven tourism analytics,"
Information Technology & Tourism, Springer, vol. 27(4), pages 939-979, December.
Handle:
RePEc:spr:infott:v:27:y:2025:i:4:d:10.1007_s40558-025-00334-2
DOI: 10.1007/s40558-025-00334-2
Download full text from publisher
As the access to this document is restricted, you may want to
for a different version of it.
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:infott:v:27:y:2025:i:4:d:10.1007_s40558-025-00334-2. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.