Author
Listed:
- Zheng You Lim
(Centre for Advanced Analytics, CoE for Artificial Intelligence, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia)
- Ying Han Pang
(Centre for Advanced Analytics, CoE for Artificial Intelligence, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia
Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia)
- Edwin Chan Kah Jun
(Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia)
- Shih Yin Ooi
(Centre for Advanced Analytics, CoE for Artificial Intelligence, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia
Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia)
- Goh Fan Ling
(FINEXT Sdn Bhd, B-23A-7, Vertical Business Suite Avenue 3 Bangsar South City, No 8, Jalan Kerinchi, Kuala Lumpur 59200, Malaysia)
Abstract
Infected URLs are always regarded as a serious threat to cybersecurity, serving as pathways to phishing, maliciousness, and other offenses. Although transformer-based models have demonstrated good performance in malicious URL detection, their high computational cost and latency make them impractical for deployment in real-time or resource-constrained systems. Allocated on the basis of knowledge distillation (KD), lightweight models tend to be efficient but are commonly not sufficiently discriminative to distinguish between malicious and benign URLs with non-cataclysmic lexical overlaps, particularly when dealing with an imbalanced dataset. In order to address these issues, we propose Contra-KD, a lightweight transformer model that incorporates contrastive learning (CL) and KD. This proposed framework imposes structured embedding matching, allowing the student model to learn more meaningful and generalized depictions. Contra-KD uses a compact 6-layer student transformer architecture based on ELECTRA to scale parameters up and can achieve more than 90% computational fidelity with a high accuracy. In this scheme, CL improves the feature of discrimination by semantically clustering similar URLs and separating different URLs. This tendency serves to limit confusion, especially when a common lexical trait is held between two words and/or in the presence of adversarial obfuscation. Through a large-scale publicly available Kaggle dataset of 651,191 URLs in imbalanced scenarios, the proposed Contra-KD can achieve 99.05% accuracy, 99.96% ROC-AUC, and 98.18% MCC which are superior to their counterparts including lightweight models and transformer-based ones. To summarize, Contra-KD proposes an efficient transformer architecture that is both small and effective in computation while delivering stable detection performance.
Suggested Citation
Zheng You Lim & Ying Han Pang & Edwin Chan Kah Jun & Shih Yin Ooi & Goh Fan Ling, 2026.
"Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation,"
Future Internet, MDPI, vol. 18(3), pages 1-20, March.
Handle:
RePEc:gam:jftint:v:18:y:2026:i:3:p:157-:d:1896771
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:18:y:2026:i:3:p:157-:d:1896771. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.