IDEAS home Printed from https://ideas.repec.org/p/aeg/report/2018-04.html
   My bibliography  Save this paper

Robustness Analysis of a Website Categorization Procedure based on Machine Learning

Author

Listed:
  • Renato Bruni

    (Department of Computer, Control and Management Engineering Antonio Ruberti (DIAG), University of Rome La Sapienza, Rome, Italy)

  • Gianpiero Bianchi

    (Direzione centrale per la metodologia e disegno dei processi statistici (DCME),Italian National Institute of Statistics Istat, Rome, Italy)

Abstract

Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used to accomplish statistical surveys, saving the cost of the surveys, or to validate already surveyed data. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a dicult task in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. To do so, each data record should summarize the content of an entire website. We generate this kind of records by using web scraping and optical character recognition, followed by a number of automated feature engineering steps. When such records have been produced, we apply to them state-of-the-art classification techniques to categorize the websites according to the aspect of interest. We use Support Vector Machines, Random Forest and Logistic classifiers. Since in many applicative cases the labels available for the training set may be noisy, we analyze the robustness of our procedure with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities.

Suggested Citation

  • Renato Bruni & Gianpiero Bianchi, 2018. "Robustness Analysis of a Website Categorization Procedure based on Machine Learning," DIAG Technical Reports 2018-04, Department of Computer, Control and Management Engineering, Universita' degli Studi di Roma "La Sapienza".
  • Handle: RePEc:aeg:report:2018-04
    as

    Download full text from publisher

    File URL: http://wwwold.dis.uniroma1.it/~bibdis/RePEc/aeg/report/2018-04.pdf
    File Function: First version, 2018
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Wei-Yin Loh, 2014. "Fifty Years of Classification and Regression Trees," International Statistical Review, International Statistical Institute, vol. 82(3), pages 329-348, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Farkas, Sébastien & Lopez, Olivier & Thomas, Maud, 2021. "Cyber claim analysis using Generalized Pareto regression trees with applications to insurance," Insurance: Mathematics and Economics, Elsevier, vol. 98(C), pages 92-105.
    2. Emilio Aguirre & Federico García-Suárez & Gabriela Sicilia, 2021. "Eficiencia técnica en la ganadería de carne bovina pastoril. Medición y exploración de sus determinantes en Uruguay," Documentos de Trabajo (working papers) 1321, Department of Economics - dECON.
    3. Lotfi Boudabsa & Damir Filipovi'c, 2022. "Ensemble learning for portfolio valuation and risk management," Papers 2204.05926, arXiv.org.
    4. Yan, Ran & Wang, Shuaian & Du, Yuquan, 2020. "Development of a two-stage ship fuel consumption prediction and reduction model for a dry bulk ship," Transportation Research Part E: Logistics and Transportation Review, Elsevier, vol. 138(C).
    5. A. Poterie & J.-F. Dupuy & V. Monbet & L. Rouvière, 2019. "Classification tree algorithm for grouped variables," Computational Statistics, Springer, vol. 34(4), pages 1613-1648, December.
    6. Miguel A. Vallejo & Laura Vallejo-Slocker & Martin Offenbaecher & Jameson K. Hirsch & Loren L. Toussaint & Niko Kohls & Fuschia Sirois & Javier Rivera, 2021. "Psychological Flexibility Is Key for Reducing the Severity and Impact of Fibromyalgia," IJERPH, MDPI, vol. 18(14), pages 1-11, July.
    7. Eduardo Rodríguez Sánchez & Eduardo Filemón Vázquez Santacruz & Humberto Cervantes Maceda, 2023. "Effort and Cost Estimation Using Decision Tree Techniques and Story Points in Agile Software Development," Mathematics, MDPI, vol. 11(6), pages 1-31, March.
    8. Suryo Adi Rakhmawan & M. Hafidz Omar & Muhammad Riaz & Nasir Abbas, 2023. "Hotelling T 2 Control Chart for Detecting Changes in Mortality Models Based on Machine-Learning Decision Tree," Mathematics, MDPI, vol. 11(3), pages 1-14, January.
    9. Olga Takacs & Janos Vincze, 2018. "The within-job gender pay gap in Hungary," CERS-IE WORKING PAPERS 1834, Institute of Economics, Centre for Economic and Regional Studies.
    10. Michael Puglia & Adam Tucker, 2020. "Machine Learning, the Treasury Yield Curve and Recession Forecasting," Finance and Economics Discussion Series 2020-038, Board of Governors of the Federal Reserve System (U.S.).
    11. Jiaming Mao & Jingzhi Xu, 2020. "Ensemble Learning with Statistical and Structural Models," Papers 2006.05308, arXiv.org.
    12. Kian Tehranian, 2023. "Can Machine Learning Catch Economic Recessions Using Economic and Market Sentiments?," Papers 2308.16200, arXiv.org.
    13. HOROBEȚ Alexandra & BULAI Vlad Cosmin, 2019. "Assessing the Local Developmental Impact of Hydrocarbon Exploitation in a Mature Region: A Random Forest Approach," European Journal of Interdisciplinary Studies, Bucharest Economic Academy, issue 02, June.
    14. Yu-Shan Shih & Kuang-Hsun Liu, 2019. "Regression trees for detecting preference patterns from rank data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(3), pages 683-702, September.
    15. Osman, Ibrahim H. & Anouze, Abdel Latef & Irani, Zahir & Lee, Habin & Medeni, Tunç D. & Weerakkody, Vishanth, 2019. "A cognitive analytics management framework for the transformation of electronic government services from users’ perspective to create sustainable shared values," European Journal of Operational Research, Elsevier, vol. 278(2), pages 514-532.
    16. Jingfang Liu & Mengshi Shi & Huihong Jiang, 2022. "Detecting Suicidal Ideation in Social Media: An Ensemble Method Based on Feature Fusion," IJERPH, MDPI, vol. 19(13), pages 1-13, July.
    17. Tai, Chung-Ching & Lin, Hung-Wen & Chie, Bin-Tzong & Tung, Chen-Yuan, 2019. "Predicting the failures of prediction markets: A procedure of decision making using classification models," International Journal of Forecasting, Elsevier, vol. 35(1), pages 297-312.
    18. Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
    19. Quan Zhiyu & Valdez Emiliano A., 2018. "Predictive analytics of insurance claims using multivariate decision trees," Dependence Modeling, De Gruyter, vol. 6(1), pages 377-407, December.
    20. Evan B Brooks & John W Coulston & Kurt H Riitters & David N Wear, 2020. "Using a hybrid demand-allocation algorithm to enable distributional analysis of land use change patterns," PLOS ONE, Public Library of Science, vol. 15(10), pages 1-21, October.

    More about this item

    Keywords

    Classification ; Machine Learning ; Feature Engineering ; Text;
    All these keywords.

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:aeg:report:2018-04. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Antonietta Angelica Zucconi (email available below). General contact details of provider: https://edirc.repec.org/data/dirosit.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.