IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v125y2020i3d10.1007_s11192-020-03382-z.html
   My bibliography  Save this article

unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

Author

Listed:
  • Tarek Saier

    (Karlsruhe Institute of Technology (KIT))

  • Michael Färber

    (Karlsruhe Institute of Technology (KIT))

Abstract

In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. In this paper, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers’ plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations, as we show prototypically in this article.

Suggested Citation

  • Tarek Saier & Michael Färber, 2020. "unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 3085-3108, December.
  • Handle: RePEc:spr:scient:v:125:y:2020:i:3:d:10.1007_s11192-020-03382-z
    DOI: 10.1007/s11192-020-03382-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-020-03382-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-020-03382-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Reingewertz, Yaniv & Lutmar, Carmela, 2018. "Academic in-group bias: An empirical examination of the link between author and journal affiliation," Journal of Informetrics, Elsevier, vol. 12(1), pages 74-86.
    2. Fang Liu & Guangyuan Hu & Li Tang & Weishu Liu, 2018. "The penalty of containing more non-English articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(1), pages 359-366, January.
    3. Liming Liang & Ronald Rousseau & Zhen Zhong, 2013. "Non-English journals and papers in physics and chemistry: bias in citations?," Scientometrics, Springer;Akadémiai Kiadó, vol. 95(1), pages 333-350, April.
    4. Zara Nasar & Syed Waqar Jaffry & Muhammad Kamran Malik, 2018. "Information extraction from scientific articles: a survey," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 1931-1990, December.
    5. Aaron Elkiss & Siwei Shen & Anthony Fader & Güneş Erkan & David States & Dragomir Radev, 2008. "Blind men and elephants: What do citation summaries tell us about a research article?," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 59(1), pages 51-62, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Moreno La Quatra & Luca Cagliero & Elena Baralis, 2021. "Leveraging full-text article exploration for citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8275-8293, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Chang, Yu-Wei, 2022. "Capability of non-English-speaking countries for securing a foothold in international journal publishing," Journal of Informetrics, Elsevier, vol. 16(3).
    2. Maria Cláudia Cabrini Grácio & Ely Francina Tannuri Oliveira & Zaida Chinchilla-Rodríguez & Henk F. Moed, 2020. "Does corresponding authorship influence scientific impact in collaboration: Brazilian institutions as a case of study," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1349-1369, November.
    3. Weishu Liu & Li Tang & Guangyuan Hu, 2020. "Funding information in Web of Science: an updated overview," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(3), pages 1509-1524, March.
    4. Zhenglu Yu & Zheng Ma & Haiyan Wang & Jia Jia & Lu Wang, 2020. "Communication value of English-language S&T academic journals in non-native English language countries," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1389-1402, November.
    5. Liliana Pedraja-Rejas & Emilio Rodríguez-Ponce & Camila Muñoz-Fritis & David Laroze, 2023. "Sustainable Development Goals and Education: A Bibliometric Review—The Case of Latin America," Sustainability, MDPI, vol. 15(12), pages 1-19, June.
    6. Lin Zhang & Yuanyuan Shang & Ying Huang & Gunnar Sivertsen, 2022. "Gender differences among active reviewers: an investigation based on publons," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(1), pages 145-179, January.
    7. Rey-Long Liu, 2017. "A new bibliographic coupling measure with descriptive capability," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 915-935, February.
    8. Dangzhi Zhao & Andreas Strotmann, 2020. "Telescopic and panoramic views of library and information science research 2011–2018: a comparison of four weighting schemes for author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(1), pages 255-270, July.
    9. Shannon Mason & Yusuke Sakurai, 2021. "A ResearchGate-way to an international academic community?," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1149-1171, February.
    10. Masaki Eto, 2013. "Evaluations of context-based co-citation searching," Scientometrics, Springer;Akadémiai Kiadó, vol. 94(2), pages 651-673, February.
    11. Kim, Ha Jin & Jeong, Yoo Kyung & Song, Min, 2016. "Content- and proximity-based author co-citation analysis using citation sentences," Journal of Informetrics, Elsevier, vol. 10(4), pages 954-966.
    12. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.
    13. Zhang, Lin & Shang, Yuanyuan & HUANG, Ying & Sivertsen, Gunnar, 2021. "Gender differences among active reviewers: an investigation based on Publons," SocArXiv 4z6w8, Center for Open Science.
    14. Annarelli, Alessandro & Battistella, Cinzia & Nonino, Fabio & Parida, Vinit & Pessot, Elena, 2021. "Literature review on digitalization capabilities: Co-citation analysis of antecedents, conceptualization and consequences," Technological Forecasting and Social Change, Elsevier, vol. 166(C).
    15. Lipeng Fan & Yuefen Wang & Shengchun Ding & Binbin Qi, 2020. "Productivity trends and citation impact of different institutional collaboration patterns at the research units’ level," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1179-1196, November.
    16. Shengbo Liu & Chaomei Chen, 2012. "The proximity of co-citation," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(2), pages 495-511, May.
    17. Antonio Cavacini, 2015. "What is the best database for computer science journal articles?," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2059-2071, March.
    18. Keshra Sangwal, 2013. "Some citation-related characteristics of scientific journals published in individual countries," Scientometrics, Springer;Akadémiai Kiadó, vol. 97(3), pages 719-741, December.
    19. Shutian Ma & Jin Xu & Chengzhi Zhang, 2018. "Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1303-1330, August.
    20. Fang Liu & Guangyuan Hu & Li Tang & Weishu Liu, 2018. "The penalty of containing more non-English articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(1), pages 359-366, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:125:y:2020:i:3:d:10.1007_s11192-020-03382-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.