IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v15y2023i11p363-d1277030.html
   My bibliography  Save this article

Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Author

Listed:
  • Panagiotis Skondras

    (Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece)

  • Panagiotis Zervas

    (Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece)

  • Giannis Tzimas

    (Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece)

Abstract

In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.

Suggested Citation

  • Panagiotis Skondras & Panagiotis Zervas & Giannis Tzimas, 2023. "Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification," Future Internet, MDPI, vol. 15(11), pages 1-12, November.
  • Handle: RePEc:gam:jftint:v:15:y:2023:i:11:p:363-:d:1277030
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/15/11/363/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/15/11/363/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:15:y:2023:i:11:p:363-:d:1277030. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.