IDEAS home Printed from https://ideas.repec.org/p/hum/wpaper/sfb649dp2017-008.html
   My bibliography  Save this paper

GitHub API based QuantNet Mining infrastructure in R

Author

Listed:
  • Wolfgang Karl Härdle
  • Lukas Borke

Abstract

QuantNet being an online GitHub based organization is an integrated environment consisting of different types of statistics-related documents and program codes called Quantlets. The QuantNet Style Guide and the yamldebugger package allow a standardized audit and validation of YAML annotated software repositories within this organization. The behavior statistics of QuantNet users are measured with Web Metrics from Google Analytics. We show how the search queries obtained from Google’s metrics can be used in the test collections in order to calibrate and evaluate the information retrieval (IR) performance of QuantNet’s search engine called QuantNetXploRer. For that purpose, different text mining (TM) models will be examined by means of the new TManalyzer package. Further, we introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP is a functional multi-staged instrument for clustering analysis, providing multivariate statistical analysis of the co-occurrence distribution of driving factors of the pipeline. The new package rgithubS, which enables a GitHub wide search for code and repositories using the GitHub Search API and which is an essential element of the QuantNet Mining infrastructure, is briefly presented. The TManalyzer results show that for all considered single term queries the number of true positives is maximal in a latent semantic analysis model configuration (LSA50). The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical clustering (HC) applies to 70 − 90% of the cluster sizes for most of the considered quality indices. Further, we can infer that more accurate and comprehensive metadata increases the clustering quality. Subsequently, the findings of our experimental design are implemented into the QuantNetXploRer. The GitHub API driven QuantNetXploRer can be found and mined under http://www.quantlet.de

Suggested Citation

  • Wolfgang Karl Härdle & Lukas Borke, 2017. "GitHub API based QuantNet Mining infrastructure in R," SFB 649 Discussion Papers SFB649DP2017-008, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.
  • Handle: RePEc:hum:wpaper:sfb649dp2017-008
    as

    Download full text from publisher

    File URL: http://sfb649.wiwi.hu-berlin.de/papers/pdf/SFB649DP2017-008.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Brock, Guy & Pihur, Vasyl & Datta, Susmita & Datta, Somnath, 2008. "clValid: An R Package for Cluster Validation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i04).
    2. Charrad, Malika & Ghazzali, Nadia & Boiteau, Véronique & Niknafs, Azam, 2014. "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i06).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Zharova, Alona & Härdle, Wolfgang Karl & Lessmann, Stefan, 2023. "Data-driven support for policy and decision-making in university research management: A case study from Germany," European Journal of Operational Research, Elsevier, vol. 308(1), pages 353-368.
    2. Petra Burdejová & Wolfgang K. Härdle, 2019. "Dynamic semi-parametric factor model for functional expectiles," Computational Statistics, Springer, vol. 34(2), pages 489-502, June.
    3. Marius Lux & Wolfgang Karl Härdle & Stefan Lessmann, 2020. "Data driven value-at-risk forecasting using a SVR-GARCH-KDE hybrid," Computational Statistics, Springer, vol. 35(3), pages 947-981, September.
    4. Alona Zharova & Wolfgang K. Härdle & Stefan Lessmann, 2017. "Is Scientific Performance a Function of Funds?," SFB 649 Discussion Papers SFB649DP2017-028, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.
    5. Adamyan, Larisa & Efimov, Kirill & Spokoiny, Vladimir, 2019. "Adaptive Nonparametric Community Detection," IRTG 1792 Discussion Papers 2019-006, Humboldt University of Berlin, International Research Training Group 1792 "High Dimensional Nonstationary Time Series".
    6. Lining Yu & Wolfgang Karl Hardle & Lukas Borke & Thijs Benschop, 2020. "An AI approach to measuring financial risk," Papers 2009.13222, arXiv.org.
    7. Yingxing Li & Chen Huang & Wolfgang Karl Härdle, 2017. "Spatial Functional Principal Component Analysis with Applications to Brain Image Data," SFB 649 Discussion Papers SFB649DP2017-024, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marta Rocchi & Guglielmo Pescatore, 2022. "Modeling narrative features in TV series: coding and clustering analysis," Palgrave Communications, Palgrave Macmillan, vol. 9(1), pages 1-11, December.
    2. Patrick Zschech & Kai Heinrich & Raphael Bink & Janis S. Neufeld, 2019. "Prognostic Model Development with Missing Labels," Business & Information Systems Engineering: The International Journal of WIRTSCHAFTSINFORMATIK, Springer;Gesellschaft für Informatik e.V. (GI), vol. 61(3), pages 327-343, June.
    3. Bolívar, Fernando & Duran, Miguel A. & Lozano-Vivas, Ana, 2023. "Bank business models, size, and profitability," Finance Research Letters, Elsevier, vol. 53(C).
    4. Reder, Maik & Yürüşen, Nurseda Y. & Melero, Julio J., 2018. "Data-driven learning framework for associating weather conditions and wind turbine failures," Reliability Engineering and System Safety, Elsevier, vol. 169(C), pages 554-569.
    5. Marcin Gąsior, 2021. "Environmental Attitudes and Willingness to Purchase Online—Classification Approach," Sustainability, MDPI, vol. 13(15), pages 1-17, August.
    6. Gainbi Park & Zengwang Xu, 2022. "The constituent components and local indicator variables of social vulnerability index," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 110(1), pages 95-120, January.
    7. Roopam Shukla & Ankit Agarwal & Kamna Sachdeva & Juergen Kurths & P. K. Joshi, 2019. "Climate change perception: an analysis of climate change and risk perceptions among farmer types of Indian Western Himalayas," Climatic Change, Springer, vol. 152(1), pages 103-119, January.
    8. Saemi Shin & Won Suck Yoon & Sang-Hoon Byeon, 2022. "Trends in Occupational Infectious Diseases in South Korea and Classification of Industries According to the Risk of Biological Hazards Using K-Means Clustering," IJERPH, MDPI, vol. 19(19), pages 1-19, September.
    9. Song He & Xinyu Song & Xiaoxi Yang & Jijun Yu & Yuqi Wen & Lianlian Wu & Bowei Yan & Jiannan Feng & Xiaochen Bo, 2021. "COMSUC: A web server for the identification of consensus molecular subtypes of cancer based on multiple methods and multi-omics data," PLOS Computational Biology, Public Library of Science, vol. 17(3), pages 1-10, March.
    10. Ana Alina Tudoran, 2022. "A machine learning approach to identifying decision-making styles for managing customer relationships," Electronic Markets, Springer;IIM University of St. Gallen, vol. 32(1), pages 351-374, March.
    11. Wu, Han-Ming, 2011. "On biological validity indices for soft clustering algorithms for gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1969-1979, May.
    12. Jihane El Ouadi & Hanae Errousso & Nicolas Malhene & Siham Benhadou & Hicham Medromi, 2022. "A machine-learning based hybrid algorithm for strategic location of urban bundling hubs to support shared public transport," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(5), pages 3215-3258, October.
    13. Cyril Atkinson-Clement & Eléonore Pigalle, 2021. "What can we learn from Covid-19 pandemic’s impact on human behaviour? The case of France’s lockdown," Palgrave Communications, Palgrave Macmillan, vol. 8(1), pages 1-12, December.
    14. Kreitmair, Ursula & Bower-Bir, Jacob, 2021. "Too different to solve climate change? Experimental evidence on the effects of production and benefit heterogeneity on collective action," Ecological Economics, Elsevier, vol. 184(C).
    15. Getaneh Addis Tessema & Jan van der Borg & Anton Van Rompaey & Steven Van Passel & Enyew Adgo & Amare Sewnet Minale & Kerebih Asrese & Amaury Frankl & Jean Poesen, 2022. "Benefit Segmentation of Tourists to Geosites and Its Implications for Sustainable Development of Geotourism in the Southern Lake Tana Region, Ethiopia," Sustainability, MDPI, vol. 14(6), pages 1-25, March.
    16. Drago, Carlo & Fortuna, Fabio, 2023. "Investigating the Corporate Governance and Sustainability Relationship: A Bibliometric Analysis Using Keyword-Ensemble Community Detection," FEEM Working Papers 336985, Fondazione Eni Enrico Mattei (FEEM).
    17. Young Hyun Kim & Kug Jin Jeon & Chena Lee & Yoon Joo Choi & Hoi-In Jung & Sang-Sun Han, 2021. "Analysis of the mandibular canal course using unsupervised machine learning algorithm," PLOS ONE, Public Library of Science, vol. 16(11), pages 1-13, November.
    18. Titov Sergei & Trachuk Arkady & Linder Natalya & RD Pathak & Danny Samson & Zafar Husain & S Sushil, 2023. "Digital transformation enablers in high-tech and low-tech companies: A comparative analysis," Australian Journal of Management, Australian School of Business, vol. 48(4), pages 801-843, November.
    19. Turati, Pietro & Pedroni, Nicola & Zio, Enrico, 2017. "Simulation-based exploration of high-dimensional system models for identifying unexpected events," Reliability Engineering and System Safety, Elsevier, vol. 165(C), pages 317-330.
    20. Ben Beck & Meghan Winters & Trisalyn Nelson & Chris Pettit & Simone Z Leao & Meead Saberi & Jason Thompson & Sachith Seneviratne & Kerry Nice & Mark Stevenson, 2023. "Developing urban biking typologies: Quantifying the complex interactions of bicycle ridership, bicycle network and built environment characteristics," Environment and Planning B, , vol. 50(1), pages 7-23, January.

    More about this item

    Keywords

    Code Search; Software Repositories; Text Mining; Information Retrieval; Smart Data; YAML; GitHub Search API; Google Analytics; Web Metrics; LSA; GVSM; Cluster Validation; Quality Indices; Validation Pipeline;
    All these keywords.

    JEL classification:

    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software
    • C89 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hum:wpaper:sfb649dp2017-008. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: RDC-Team (email available below). General contact details of provider: https://edirc.repec.org/data/sohubde.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.