IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v15y2023i9p314-d1242286.html
   My bibliography  Save this article

Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task

Author

Listed:
  • Artyom V. Gorchakov

    (Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia)

  • Liliya A. Demidova

    (Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia)

  • Peter N. Sovietov

    (Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA—Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia)

Abstract

In this paper we consider the research and development of classifiers that are trained to predict the task solved by source code. Possible applications of such task detection algorithms include method name prediction, hardware–software partitioning, programming standard violation detection, and semantic code duplication search. We provide the comparative analysis of modern approaches to source code transformation into vector-based representations that extend the variety of classification and clustering algorithms that can be used for intelligent source code analysis. These approaches include word2vec, code2vec, first-order and second-order Markov chains constructed from abstract syntax trees (AST), histograms of assembly language instruction opcodes, and histograms of AST node types. The vectors obtained with the forementioned approaches are then used to train such classification algorithms as k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and multilayer perceptron (MLP). The obtained results show that the use of program vectors based on first-order AST-based Markov chains with an RF-based classifier leads to the highest accuracy, precision, recall, and F1 score. Increasing the order of Markov chains considerably increases the dimensionality of a vector, without any improvements in classifier quality, so we assume that first-order Markov chains are best suitable for real world applications. Additionally, the experimental study shows that first-order AST-based Markov chains are least sensitive to the used classification algorithm.

Suggested Citation

  • Artyom V. Gorchakov & Liliya A. Demidova & Peter N. Sovietov, 2023. "Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task," Future Internet, MDPI, vol. 15(9), pages 1-28, September.
  • Handle: RePEc:gam:jftint:v:15:y:2023:i:9:p:314-:d:1242286
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/15/9/314/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/15/9/314/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. María Pérez-Ortiz & Silvia Jiménez-Fernández & Pedro A. Gutiérrez & Enrique Alexandre & César Hervás-Martínez & Sancho Salcedo-Sanz, 2016. "A Review of Classification Problems and Algorithms in Renewable Energy Applications," Energies, MDPI, vol. 9(8), pages 1-27, August.
    2. Liliya A. Demidova & Elena G. Andrianova & Peter N. Sovietov & Artyom V. Gorchakov, 2023. "Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant," Data, MDPI, vol. 8(6), pages 1-16, June.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Liliya A. Demidova, 2024. "Decision-Making on the Diagnosis of Oncological Diseases Using Cost-Sensitive SVM Classifiers Based on Datasets with a Variety of Features of Different Natures," Mathematics, MDPI, vol. 12(4), pages 1-40, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Prince Waqas Khan & Yung-Cheol Byun & Sang-Joon Lee & Dong-Ho Kang & Jin-Young Kang & Hae-Su Park, 2020. "Machine Learning-Based Approach to Predict Energy Consumption of Renewable and Nonrenewable Power Sources," Energies, MDPI, vol. 13(18), pages 1-16, September.
    2. Alvaro Furlani Bastos & Surya Santoso, 2021. "Optimization Techniques for Mining Power Quality Data and Processing Unbalanced Datasets in Machine Learning Applications," Energies, MDPI, vol. 14(2), pages 1-21, January.
    3. Sellak, Hamza & Ouhbi, Brahim & Frikh, Bouchra & Palomares, Iván, 2017. "Towards next-generation energy planning decision-making: An expert-based framework for intelligent decision support," Renewable and Sustainable Energy Reviews, Elsevier, vol. 80(C), pages 1544-1577.
    4. Khalfan Al Kharusi & Abdelsalam El Haffar & Mostefa Mesbah, 2022. "Fault Detection and Classification in Transmission Lines Connected to Inverter-Based Generators Using Machine Learning," Energies, MDPI, vol. 15(15), pages 1-23, July.
    5. Wolfram Rozas & Rafael Pastor-Vargas & Angel Miguel García-Vico & José Carpio, 2023. "Consumption–Production Profile Categorization in Energy Communities," Energies, MDPI, vol. 16(19), pages 1-27, October.
    6. Nilsa Duarte da Silva Lima & Irenilza de Alencar Nääs & João Gilberto Mendes dos Reis & Raquel Baracat Tosi Rodrigues da Silva, 2020. "Classifying the Level of Energy-Environmental Efficiency Rating of Brazilian Ethanol," Energies, MDPI, vol. 13(8), pages 1-16, April.
    7. Raffaele Cioffi & Marta Travaglioni & Giuseppina Piscitelli & Antonella Petrillo & Fabio De Felice, 2020. "Artificial Intelligence and Machine Learning Applications in Smart Production: Progress, Trends, and Directions," Sustainability, MDPI, vol. 12(2), pages 1-26, January.
    8. Hongyu Li & Ping Ju & Chun Gan & Feng Wu & Yichen Zhou & Zhe Dong, 2018. "Stochastic Stability Analysis of the Power System with Losses," Energies, MDPI, vol. 11(3), pages 1-11, March.
    9. Kewei Cai & Belema Prince Alalibo & Wenping Cao & Zheng Liu & Zhiqiang Wang & Guofeng Li, 2018. "Hybrid Approach for Detecting and Classifying Power Quality Disturbances Based on the Variational Mode Decomposition and Deep Stochastic Configuration Network," Energies, MDPI, vol. 11(11), pages 1-18, November.
    10. Mariana Syamsudin & Cheng-I Chen & Sunneng Sandino Berutu & Yeong-Chin Chen, 2024. "Efficient Framework to Manipulate Data Compression and Classification of Power Quality Disturbances for Distributed Power System," Energies, MDPI, vol. 17(6), pages 1-20, March.
    11. Sunme Park & Soyeong Park & Myungsun Kim & Euiseok Hwang, 2020. "Clustering-Based Self-Imputation of Unlabeled Fault Data in a Fleet of Photovoltaic Generation Systems," Energies, MDPI, vol. 13(3), pages 1-16, February.
    12. Arcos Jiménez, Alfredo & Gómez Muñoz, Carlos Quiterio & García Márquez, Fausto Pedro, 2019. "Dirt and mud detection and diagnosis on a wind turbine blade employing guided waves and supervised learning classifiers," Reliability Engineering and System Safety, Elsevier, vol. 184(C), pages 2-12.
    13. Sunil Kumar Mohapatra & Sushruta Mishra & Hrudaya Kumar Tripathy & Akash Kumar Bhoi & Paolo Barsocchi, 2021. "A Pragmatic Investigation of Energy Consumption and Utilization Models in the Urban Sector Using Predictive Intelligence Approaches," Energies, MDPI, vol. 14(13), pages 1-28, June.
    14. Salcedo-Sanz, S. & Cornejo-Bueno, L. & Prieto, L. & Paredes, D. & García-Herrera, R., 2018. "Feature selection in machine learning prediction systems for renewable energy applications," Renewable and Sustainable Energy Reviews, Elsevier, vol. 90(C), pages 728-741.
    15. Tasos Stylianou & Konstantinos Ntelas, 2023. "Impact of COVID-19 Pandemic on Mental Health and Socioeconomic Aspects in Greece," IJERPH, MDPI, vol. 20(3), pages 1-21, January.
    16. Bo Chen & Ping Wang & Yifeng Wang & Wei Li & Fuqiang Han & Shuhuai Zhang, 2017. "Comparative Analysis and Optimization of Power Loss Based on the Isolated Series/Multi Resonant Three-Port Bidirectional DC-DC Converter," Energies, MDPI, vol. 10(10), pages 1-26, October.
    17. Cheng-Shan Wang & Wei Li & Yi-Feng Wang & Fu-Qiang Han & Bo Chen, 2017. "A High-Efficiency Isolated LCLC Multi-Resonant Three-Port Bidirectional DC-DC Converter," Energies, MDPI, vol. 10(7), pages 1-22, July.
    18. Marcelo Bruno Capeletti & Bruno Knevitz Hammerschmitt & Renato Grethe Negri & Fernando Guilherme Kaehler Guarda & Lucio Rene Prade & Nelson Knak Neto & Alzenira da Rosa Abaide, 2022. "Identification of Nontechnical Losses in Distribution Systems Adding Exogenous Data and Artificial Intelligence," Energies, MDPI, vol. 15(23), pages 1-23, November.
    19. Peláez-Rodríguez, C. & Pérez-Aracil, J. & Fister, D. & Prieto-Godino, L. & Deo, R.C. & Salcedo-Sanz, S., 2022. "A hierarchical classification/regression algorithm for improving extreme wind speed events prediction," Renewable Energy, Elsevier, vol. 201(P2), pages 157-178.
    20. Carlos Ruiz & Carlos M. Alaíz & José R. Dorronsoro, 2020. "Multitask Support Vector Regression for Solar and Wind Energy Prediction," Energies, MDPI, vol. 13(23), pages 1-21, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:15:y:2023:i:9:p:314-:d:1242286. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.