IDEAS home Printed from https://ideas.repec.org/p/zbw/sfb649/sfb649dp2017-008.html

GitHub API based QuantNet Mining infrastructure in R

Author

Listed:
  • Borke, Lukas
  • Härdle, Wolfgang Karl

Abstract

QuantNet being an online GitHub based organization is an integrated environment consisting of different types of statistics-related documents and program codes called Quantlets. The QuantNet Style Guide and the yamldebugger package allow a standardized audit and validation of YAML annotated software repositories within this organization. The behavior statistics of QuantNet users are measured with Web Metrics from Google Analytics. We show how the search queries obtained from Google's metrics can be used in the test collections in order to calibrate and evaluate the information retrieval (IR) performance of QuantNet's search engine called QuantNetXploRer. For that purpose, different text mining (TM) models will be examined by means of the new TManalyzer package. Further, we introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP is a functional multi-staged instrument for clustering analysis, providing multivariate statistical analysis of the co-occurrence distribution of driving factors of the pipeline. The new package rgithubS, which enables a GitHub wide search for code and repositories using the GitHub Search API and which is an essential element of the QuantNet Mining infrastructure, is briefly presented. The TManalyzer results show that for all considered single term queries the number of true positives is maximal in a latent semantic analysis model configuration (LSA50). The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical clustering (HC) applies to 70 ? 90% of the cluster sizes for most of the considered quality indices. Further, we can infer that more accurate and comprehensive metadata increases the clustering quality. Subsequently, the findings of our experimental design are implemented into the QuantNetXploRer. The GitHub API driven QuantNetXploRer can be found and mined under http://www.quantlet.de

Suggested Citation

  • Borke, Lukas & Härdle, Wolfgang Karl, 2017. "GitHub API based QuantNet Mining infrastructure in R," SFB 649 Discussion Papers 2017-008, Humboldt University Berlin, Collaborative Research Center 649: Economic Risk.
  • Handle: RePEc:zbw:sfb649:sfb649dp2017-008
    as

    Download full text from publisher

    File URL: https://www.econstor.eu/bitstream/10419/162509/1/882061216.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Brock, Guy & Pihur, Vasyl & Datta, Susmita & Datta, Somnath, 2008. "clValid: An R Package for Cluster Validation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i04).
    2. Charrad, Malika & Ghazzali, Nadia & Boiteau, Véronique & Niknafs, Azam, 2014. "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i06).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Anahita Nodehi & Mousa Golalizadeh & Mehdi Maadooliat & Claudio Agostinelli, 2025. "Torus Probabilistic Principal Component Analysis," Journal of Classification, Springer;The Classification Society, vol. 42(2), pages 435-456, July.
    2. Wu, Tong & Rocha, Juan C. & Berry, Kevin & Chaigneau, Tomas & Hamann, Maike & Lindkvist, Emilie & Qiu, Jiangxiao & Schill, Caroline & Shepon, Alon & Crépin, Anne-Sophie & Folke, Carl, 2024. "Triple Bottom Line or Trilemma? Global Tradeoffs Between Prosperity, Inequality, and the Environment," World Development, Elsevier, vol. 178(C).
    3. Gülcan Aydin & Mehmet Tezcan & Bayram Ozgen & Tuğçe Nur Özkan, 2025. "Digital twin and predictive quality solution for insulated glass line," Journal of Intelligent Manufacturing, Springer, vol. 36(5), pages 3543-3567, June.
    4. Bulut, Tevfik, 2025. "Classifying the WHO European countries by noncommunicable diseases and risk factors," Health Policy, Elsevier, vol. 153(C).
    5. Marta Rocchi & Guglielmo Pescatore, 2022. "Modeling narrative features in TV series: coding and clustering analysis," Humanities and Social Sciences Communications, Palgrave Macmillan, vol. 9(1), pages 1-11, December.
    6. Obeidat, Laith M. & Al Nusair, Saja & Ma'bdeh, Shouib & Bataineh, Rahaf, 2024. "Redefining realistic and stochastic occupancy schedules and patterns for residential buildings in Jordan," Energy, Elsevier, vol. 313(C).
    7. repec:hum:wpaper:sfb649dp2017-008 is not listed on IDEAS
    8. Patrick Zschech & Kai Heinrich & Raphael Bink & Janis S. Neufeld, 2019. "Prognostic Model Development with Missing Labels," Business & Information Systems Engineering: The International Journal of WIRTSCHAFTSINFORMATIK, Springer;Gesellschaft für Informatik e.V. (GI), vol. 61(3), pages 327-343, June.
    9. Bolívar, Fernando & Duran, Miguel A. & Lozano-Vivas, Ana, 2023. "Bank business models, size, and profitability," Finance Research Letters, Elsevier, vol. 53(C).
    10. Reder, Maik & Yürüşen, Nurseda Y. & Melero, Julio J., 2018. "Data-driven learning framework for associating weather conditions and wind turbine failures," Reliability Engineering and System Safety, Elsevier, vol. 169(C), pages 554-569.
    11. Marcin Gąsior, 2021. "Environmental Attitudes and Willingness to Purchase Online—Classification Approach," Sustainability, MDPI, vol. 13(15), pages 1-17, August.
    12. Gainbi Park & Zengwang Xu, 2022. "The constituent components and local indicator variables of social vulnerability index," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 110(1), pages 95-120, January.
    13. Roopam Shukla & Ankit Agarwal & Kamna Sachdeva & Juergen Kurths & P. K. Joshi, 2019. "Climate change perception: an analysis of climate change and risk perceptions among farmer types of Indian Western Himalayas," Climatic Change, Springer, vol. 152(1), pages 103-119, January.
    14. Saemi Shin & Won Suck Yoon & Sang-Hoon Byeon, 2022. "Trends in Occupational Infectious Diseases in South Korea and Classification of Industries According to the Risk of Biological Hazards Using K-Means Clustering," IJERPH, MDPI, vol. 19(19), pages 1-19, September.
    15. Igor Kravchuk & Viktoriia Stoika, 2021. "Business Μodels of Βanks for the Financial Markets in the EU," European Research Studies Journal, European Research Studies Journal, vol. 0(2 - Part ), pages 371-382.
    16. Song He & Xinyu Song & Xiaoxi Yang & Jijun Yu & Yuqi Wen & Lianlian Wu & Bowei Yan & Jiannan Feng & Xiaochen Bo, 2021. "COMSUC: A web server for the identification of consensus molecular subtypes of cancer based on multiple methods and multi-omics data," PLOS Computational Biology, Public Library of Science, vol. 17(3), pages 1-10, March.
    17. Borke, Lukas & Härdle, Wolfgang Karl, 2016. "Q3-D3-Lsa," SFB 649 Discussion Papers 2016-049, Humboldt University Berlin, Collaborative Research Center 649: Economic Risk.
    18. Ana Alina Tudoran, 2022. "A machine learning approach to identifying decision-making styles for managing customer relationships," Electronic Markets, Springer;IIM University of St. Gallen, vol. 32(1), pages 351-374, March.
    19. Wu, Han-Ming, 2011. "On biological validity indices for soft clustering algorithms for gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1969-1979, May.
    20. Jihane El Ouadi & Hanae Errousso & Nicolas Malhene & Siham Benhadou & Hicham Medromi, 2022. "A machine-learning based hybrid algorithm for strategic location of urban bundling hubs to support shared public transport," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(5), pages 3215-3258, October.
    21. Cyril Atkinson-Clement & Eléonore Pigalle, 2021. "What can we learn from Covid-19 pandemic’s impact on human behaviour? The case of France’s lockdown," Humanities and Social Sciences Communications, Palgrave Macmillan, vol. 8(1), pages 1-12, December.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:zbw:sfb649:sfb649dp2017-008. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ZBW - Leibniz Information Centre for Economics (email available below). General contact details of provider: https://edirc.repec.org/data/sohubde.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.