IDEAS home Printed from https://ideas.repec.org/a/aac/ijirss/v8y2025i4p2195-2204id8356.html
   My bibliography  Save this article

Document analysis via combined vectorization and machine learning approaches

Author

Listed:
  • Dinara Kaibassova
  • Bigul Mukhametzhanova
  • Dinara Tokseit
  • Aigul Kubegenova
  • Murad Kozhanov

Abstract

The purpose of this study is to develop an effective hybrid model for automatic document classification by combining statistical and semantic text vectorization techniques with machine learning algorithms. The methodology integrates Term Frequency–Inverse Document Frequency (TF-IDF) and Word2Vec embeddings with classifiers such as Support Vector Machine (SVM) and Random Forest. The proposed approach includes data preprocessing (tokenization, normalization, stop word removal, and lemmatization), feature extraction, model training, and evaluation using classification metrics such as accuracy, F1-score, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa. Experimental results demonstrate that the Word2Vec + SVM model outperforms other configurations, achieving 90.2% accuracy and an F1-score of 82.52%, thus highlighting the advantage of incorporating semantic context into vector representation. The study concludes that hybrid methods combining TF-IDF and Word2Vec with robust classifiers improve both the precision and generalizability of document analysis models. Practical implications include potential applications in sentiment analysis, topic modeling, text classification for legal and healthcare domains, and multilingual contexts. This research provides a foundation for developing high-performance text analysis systems applicable to various real-world natural language processing tasks.

Suggested Citation

  • Dinara Kaibassova & Bigul Mukhametzhanova & Dinara Tokseit & Aigul Kubegenova & Murad Kozhanov, 2025. "Document analysis via combined vectorization and machine learning approaches," International Journal of Innovative Research and Scientific Studies, Innovative Research Publishing, vol. 8(4), pages 2195-2204.
  • Handle: RePEc:aac:ijirss:v:8:y:2025:i:4:p:2195-2204:id:8356
    as

    Download full text from publisher

    File URL: https://ijirss.com/index.php/ijirss/article/view/8356/1874
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:aac:ijirss:v:8:y:2025:i:4:p:2195-2204:id:8356. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Natalie Jean (email available below). General contact details of provider: https://ijirss.com/index.php/ijirss/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.