IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v125y2020i3d10.1007_s11192-020-03382-z.html
   My bibliography  Save this article

unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

Author

Listed:
  • Tarek Saier

    (Karlsruhe Institute of Technology (KIT))

  • Michael Färber

    (Karlsruhe Institute of Technology (KIT))

Abstract

In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. In this paper, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers’ plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations, as we show prototypically in this article.

Suggested Citation

  • Tarek Saier & Michael Färber, 2020. "unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 3085-3108, December.
  • Handle: RePEc:spr:scient:v:125:y:2020:i:3:d:10.1007_s11192-020-03382-z
    DOI: 10.1007/s11192-020-03382-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-020-03382-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-020-03382-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Reingewertz, Yaniv & Lutmar, Carmela, 2018. "Academic in-group bias: An empirical examination of the link between author and journal affiliation," Journal of Informetrics, Elsevier, vol. 12(1), pages 74-86.
    2. Fang Liu & Guangyuan Hu & Li Tang & Weishu Liu, 2018. "The penalty of containing more non-English articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(1), pages 359-366, January.
    3. Liming Liang & Ronald Rousseau & Zhen Zhong, 2013. "Non-English journals and papers in physics and chemistry: bias in citations?," Scientometrics, Springer;Akadémiai Kiadó, vol. 95(1), pages 333-350, April.
    4. Zara Nasar & Syed Waqar Jaffry & Muhammad Kamran Malik, 2018. "Information extraction from scientific articles: a survey," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 1931-1990, December.
    5. Aaron Elkiss & Siwei Shen & Anthony Fader & Güneş Erkan & David States & Dragomir Radev, 2008. "Blind men and elephants: What do citation summaries tell us about a research article?," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 59(1), pages 51-62, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Moreno La Quatra & Luca Cagliero & Elena Baralis, 2021. "Leveraging full-text article exploration for citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8275-8293, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Chang, Yu-Wei, 2022. "Capability of non-English-speaking countries for securing a foothold in international journal publishing," Journal of Informetrics, Elsevier, vol. 16(3).
    2. Liliana Pedraja-Rejas & Emilio Rodríguez-Ponce & Camila Muñoz-Fritis & David Laroze, 2023. "Sustainable Development Goals and Education: A Bibliometric Review—The Case of Latin America," Sustainability, MDPI, vol. 15(12), pages 1-19, June.
    3. Maria Cláudia Cabrini Grácio & Ely Francina Tannuri Oliveira & Zaida Chinchilla-Rodríguez & Henk F. Moed, 2020. "Does corresponding authorship influence scientific impact in collaboration: Brazilian institutions as a case of study," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1349-1369, November.
    4. Weishu Liu & Li Tang & Guangyuan Hu, 2020. "Funding information in Web of Science: an updated overview," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(3), pages 1509-1524, March.
    5. Zhenglu Yu & Zheng Ma & Haiyan Wang & Jia Jia & Lu Wang, 2020. "Communication value of English-language S&T academic journals in non-native English language countries," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1389-1402, November.
    6. Rey-Long Liu, 2017. "A new bibliographic coupling measure with descriptive capability," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 915-935, February.
    7. Masaki Eto, 2013. "Evaluations of context-based co-citation searching," Scientometrics, Springer;Akadémiai Kiadó, vol. 94(2), pages 651-673, February.
    8. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.
    9. Annarelli, Alessandro & Battistella, Cinzia & Nonino, Fabio & Parida, Vinit & Pessot, Elena, 2021. "Literature review on digitalization capabilities: Co-citation analysis of antecedents, conceptualization and consequences," Technological Forecasting and Social Change, Elsevier, vol. 166(C).
    10. Lipeng Fan & Yuefen Wang & Shengchun Ding & Binbin Qi, 2020. "Productivity trends and citation impact of different institutional collaboration patterns at the research units’ level," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1179-1196, November.
    11. Antonio Cavacini, 2015. "What is the best database for computer science journal articles?," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2059-2071, March.
    12. Fang Liu & Guangyuan Hu & Li Tang & Weishu Liu, 2018. "The penalty of containing more non-English articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(1), pages 359-366, January.
    13. Pancheng Wang & Shasha Li & Haifang Zhou & Jintao Tang & Ting Wang, 2019. "Cited text spans identification with an improved balanced ensemble model," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(3), pages 1111-1145, September.
    14. Rey-Long Liu, 2015. "Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-22, October.
    15. Radek Zdeněk & Jana Lososová, 2018. "An analysis of editorial board members’ publication output in agricultural economics and policy journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 563-578, October.
    16. Carmela Lutmar & Yaniv Reingewertz, 2021. "Academic in-group bias in the top five economics journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9543-9556, December.
    17. Lutmar, Carmela & Reingewertz, Yaniv, 2020. "Academic in-group bias in economics," MPRA Paper 104730, University Library of Munich, Germany.
    18. Jaime A. Teixeira da Silva, 2021. "The Matthew effect impacts science and academic publishing by preferentially amplifying citations, metrics and status," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(6), pages 5373-5377, June.
    19. Yangping Zhou, 2021. "Self-citation and citation of top journal publishers and their interpretation in the journal-discipline context," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6013-6040, July.
    20. Wang, Shiyun & Mao, Jin & Lu, Kun & Cao, Yujie & Li, Gang, 2021. "Understanding interdisciplinary knowledge integration through citance analysis: A case study on eHealth," Journal of Informetrics, Elsevier, vol. 15(4).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:125:y:2020:i:3:d:10.1007_s11192-020-03382-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.