Author
Listed:
- Dinara Kaibassova
- Bigul Mukhametzhanova
- Dinara Tokseit
- Aigul Kubegenova
- Murad Kozhanov
Abstract
The purpose of this study is to develop an effective hybrid model for automatic document classification by combining statistical and semantic text vectorization techniques with machine learning algorithms. The methodology integrates Term Frequency–Inverse Document Frequency (TF-IDF) and Word2Vec embeddings with classifiers such as Support Vector Machine (SVM) and Random Forest. The proposed approach includes data preprocessing (tokenization, normalization, stop word removal, and lemmatization), feature extraction, model training, and evaluation using classification metrics such as accuracy, F1-score, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Experimental results demonstrate that the Word2Vec + SVM model outperforms other configurations, achieving 90.2% accuracy and an F1-score of 82.52%, thus highlighting the advantage of incorporating semantic context into vector representation. The study concludes that hybrid methods combining TF-IDF and Word2Vec with robust classifiers improve both the precision and generalizability of document analysis models. Practical implications include potential applications in sentiment analysis, topic modeling, text classification for legal and healthcare domains, and multilingual contexts. This research provides a foundation for developing high-performance text analysis systems applicable to various real-world natural language processing tasks.
Suggested Citation
Dinara Kaibassova & Bigul Mukhametzhanova & Dinara Tokseit & Aigul Kubegenova & Murad Kozhanov, 2025.
"Document analysis via combined vectorization and machine learning approaches,"
International Journal of Innovative Research and Scientific Studies, Innovative Research Publishing, vol. 8(4), pages 2195-2204.
Handle:
RePEc:aac:ijirss:v:8:y:2025:i:4:p:2195-2204:id:8356
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:aac:ijirss:v:8:y:2025:i:4:p:2195-2204:id:8356. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Natalie Jean (email available below). General contact details of provider: https://ijirss.com/index.php/ijirss/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.