IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0302070.html
   My bibliography  Save this article

Using full-text content to characterize and identify best seller books: A study of early 20th-century literature

Author

Listed:
  • Giovana D da Silva
  • Filipi N Silva
  • Henrique F de Arruda
  • Bárbara C e Souza
  • Luciano da F Costa
  • Diego R Amancio

Abstract

Artistic pieces can be studied from several perspectives, one example being their reception among readers over time. In the present work, we approach this interesting topic from the standpoint of literary works, particularly assessing the task of predicting whether a book will become a best seller. Unlike previous approaches, we focused on the full content of books and considered visualization and classification tasks. We employed visualization for the preliminary exploration of the data structure and properties, involving SemAxis and linear discriminant analyses. To obtain quantitative and more objective results, we employed various classifiers. Such approaches were used along with a dataset containing (i) books published from 1895 to 1923 and consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the same period but not being mentioned in that list. Our comparison of methods revealed that the best-achieved result—combining a bag-of-words representation with a logistic regression classifier—led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such an outcome enhances the difficulty in predicting the success of books with high accuracy, even using the full content of the texts. Nevertheless, our findings provide insights into the factors leading to the relative success of a literary work.

Suggested Citation

  • Giovana D da Silva & Filipi N Silva & Henrique F de Arruda & Bárbara C e Souza & Luciano da F Costa & Diego R Amancio, 2024. "Using full-text content to characterize and identify best seller books: A study of early 20th-century literature," PLOS ONE, Public Library of Science, vol. 19(4), pages 1-20, April.
  • Handle: RePEc:plo:pone00:0302070
    DOI: 10.1371/journal.pone.0302070
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0302070
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0302070&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0302070?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Mayra Z Rodriguez & Cesar H Comin & Dalcimar Casanova & Odemir M Bruno & Diego R Amancio & Luciano da F Costa & Francisco A Rodrigues, 2019. "Clustering algorithms: A comparative approach," PLOS ONE, Public Library of Science, vol. 14(1), pages 1-34, January.
    2. Kyuhan Lee & Jinsoo Park & Iljoo Kim & Youngseok Choi, 2018. "Predicting movie success with machine learning techniques: ways to improve accuracy," Information Systems Frontiers, Springer, vol. 20(3), pages 577-588, June.
    3. Tohalino, Jorge A.V. & Amancio, Diego R., 2022. "On predicting research grants productivity via machine learning," Journal of Informetrics, Elsevier, vol. 16(2).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hren, Darko & Pina, David G. & Norman, Christopher R. & Marušić, Ana, 2022. "What makes or breaks competitive research proposals? A mixed-methods analysis of research grant evaluation reports," Journal of Informetrics, Elsevier, vol. 16(2).
    2. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    3. Corrêa, Edilson A. & Marinho, Vanessa Q. & Amancio, Diego R., 2020. "Semantic flow in language networks discriminates texts by genre and publication date," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 557(C).
    4. Ioannis Mikrou & Nickolas S. Sapidis, 2024. "Enhancing operational research in mechatronic systems via modularization: comparative analysis of four clustering algorithms using validation indices," Operational Research, Springer, vol. 24(4), pages 1-44, December.
    5. Polina Bombina & Dwayne Tally & Zachary B Abrams & Kevin R Coombes, 2024. "SillyPutty: Improved clustering by optimizing the silhouette width," PLOS ONE, Public Library of Science, vol. 19(6), pages 1-17, June.
    6. Ebba Mark & Ryan Rafaty & Moritz Schwarz, 2022. "Spatial-temporal dynamics of employment shocks in declining coal mining regions and potentialities of the 'just transition'," Papers 2211.12619, arXiv.org.
    7. Simon Crase & Suresh N Thennadil, 2022. "An analysis framework for clustering algorithm selection with applications to spectroscopy," PLOS ONE, Public Library of Science, vol. 17(3), pages 1-24, March.
    8. K. S. Sablin & E. S. Kagan & E. S. Chernova, 2020. "Clustering of the Russian coal mining regions: Investment and innovation activity," Journal of New Economy, Ural State University of Economics, vol. 21(1), pages 89-106, March.
    9. Chong, Woon Kian & Chang, Chiachi, 2024. "Information exploitation of human resource data with persistent homology," Journal of Business Research, Elsevier, vol. 172(C).
    10. Narjes Vara & Mahdieh Mirzabeigi & Hajar Sotudeh & Seyed Mostafa Fakhrahmad, 2022. "Application of k-means clustering algorithm to improve effectiveness of the results recommended by journal recommender system," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 3237-3252, June.
    11. Singh Vikash & Dahiya Surbhi & Abraham Albert & Tausif Ahmad, 2024. "Emergence of Technology Driven Promotional Strategies for Commercialised Indian Cinema," Economics and Applied Informatics, "Dunarea de Jos" University of Galati, Faculty of Economics and Business Administration, issue 3, pages 204-217.
    12. Jianhua Hou & Bili Zheng & Hao Li & Wenjing Li, 2025. "Evolution and impact of the science of science: from theoretical analysis to digital-AI driven research," Palgrave Communications, Palgrave Macmillan, vol. 12(1), pages 1-9, December.
    13. Abderrazek Azri & Cécile Favre & Nouria Harbi & Jérôme Darmont & Camille Noûs, 2023. "Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning," Information Systems Frontiers, Springer, vol. 25(5), pages 1795-1810, October.
    14. Jong-Min Kim & Leixin Xia & Iksuk Kim & Seungjoo Lee & Keon-Hyung Lee, 2020. "Finding Nemo: Predicting Movie Performances by Machine Learning Methods," JRFM, MDPI, vol. 13(5), pages 1-12, May.
    15. Théophile Carniel & José Halloy & Jean-Michel Dalle, 2023. "A novel clustering approach to bipartite investor-startup networks," PLOS ONE, Public Library of Science, vol. 18(1), pages 1-20, January.
    16. Alfred Kume & Stephen G Walker, 2021. "The utility of clusters and a Hungarian clustering algorithm," PLOS ONE, Public Library of Science, vol. 16(8), pages 1-23, August.
    17. Joshua Eklund & Jong-Min Kim, 2022. "Examining Factors That Affect Movie Gross Using Gaussian Copula Marginal Regression," Forecasting, MDPI, vol. 4(3), pages 1-14, July.
    18. Mark, Ebba & Rafaty, Ryan & Schwarz, Moritz, 2024. "Spatial–temporal dynamics of structural unemployment in declining coal mining regions and potentialities of the ‘just transition’," Energy Policy, Elsevier, vol. 195(C).
    19. Quispe, Laura V.C. & Tohalino, Jorge A.V. & Amancio, Diego R., 2021. "Using virtual edges to improve the discriminability of co-occurrence text networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 562(C).
    20. Sultan Mahmud & Ferdausi Mahojabin Sumana & Md Mohsin & Md. Hasinur Rahaman Khan, 2022. "Redefining homogeneous climate regions in Bangladesh using multivariate clustering approaches," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 111(2), pages 1863-1884, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0302070. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.