IDEAS home Printed from https://ideas.repec.org/p/iza/izadps/dp13459.html

Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data

Author

Listed:
  • Nicodemo, Catia

    (University of Oxford)

  • Satorra, Albert

    (Universitat Pompeu Fabra)

Abstract

New challenges arise in data visualization when a sizable database is used in the analysis. With many data points, classical scatterplots are non-informative due to the cluttering of points. On the contrary, simple plots such as the boxplot that are of limited use in small samples, offer great potential to facilitate group comparison in the case of an extensive sample. This paper presents Exploratory Data Analysis (EDA) methods that are useful when a large dataset is involved. The EDA methods, (introduced by Tukey in his seminal book of 1977) encompass a set of statistical tools aimed to extract information from data using simple graphical tools. In this paper, some of the EDA methods like the Boxplot and Scatterplot are revisited and enhanced using modern graphical computational devices (as, e.g., the heat-map) and their use illustrated with Spanish Social Security data. We explore how earnings vary across several factors like age, gender, type of occupation and contract and in particular, the gender gap in salaries is visualized in various dimensions relating to the type of occupation. The EDA methods are also applied to assessing competing regressions with earnings as the dependent variable. The methods discussed should be useful to researchers to assess heterogeneity in data, across group-variation, and classical diagnostic plots of residuals from alternative models fits.

Suggested Citation

  • Nicodemo, Catia & Satorra, Albert, 2020. "Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data," IZA Discussion Papers 13459, Institute of Labor Economics (IZA).
  • Handle: RePEc:iza:izadps:dp13459
    as

    Download full text from publisher

    File URL: https://docs.iza.org/dp13459.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Rodgers, G B, 1975. "Nutritionally Based Wage Determination in the Low-Income Labour Market," Oxford Economic Papers, Oxford University Press, vol. 27(1), pages 61-81, March.
    2. Heckman, James J. & Lochner, Lance J. & Todd, Petra E., 2006. "Earnings Functions, Rates of Return and Treatment Effects: The Mincer Equation and Beyond," Handbook of the Economics of Education, in: Erik Hanushek & F. Welch (ed.), Handbook of the Economics of Education, edition 1, volume 1, chapter 7, pages 307-458, Elsevier.
    3. Juan J Dolado & Carlos Garcia--Serrano & Juan F. Jimeno, 2002. "Drawing Lessons From The Boom Of Temporary Jobs In Spain," Economic Journal, Royal Economic Society, vol. 112(721), pages 270-295, June.
    4. Jonathan A. Schwabish, 2014. "An Economist's Guide to Visualizing Data," Journal of Economic Perspectives, American Economic Association, vol. 28(1), pages 209-234, Winter.
    5. Gehrke, Britta & Weber, Enzo, 2018. "Identifying asymmetric effects of labor market reforms," European Economic Review, Elsevier, vol. 110(C), pages 18-40.
    6. Hal R. Varian, 2014. "Big Data: New Tricks for Econometrics," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 3-28, Spring.
    7. Hubert, M. & Vandervieren, E., 2008. "An adjusted boxplot for skewed distributions," Computational Statistics & Data Analysis, Elsevier, vol. 52(12), pages 5186-5201, August.
    8. Blundell, Richard & Meghir, Costas, 1987. "Bivariate alternatives to the Tobit model," Journal of Econometrics, Elsevier, vol. 34(1-2), pages 179-200.
    9. Wilkinson, Leland & Friendly, Michael, 2009. "The History of the Cluster Heat Map," The American Statistician, American Statistical Association, vol. 63(2), pages 179-184.
    10. Dolado, Juan J & Mora, Ricardo, 2014. "Dual Labour Markets and (Lack of) On-The-Job Training: PIAAC Evidence from Spain and Other EU Countries," CEPR Discussion Papers 10246, C.E.P.R. Discussion Papers.
    11. Amuedo-Dorantes Catalina & De la Rica Sara, 2006. "The Role of Segregation and Pay Structure on the Gender Wage Gap: Evidence from Matched Employer-Employee Data for Spain," The B.E. Journal of Economic Analysis & Policy, De Gruyter, vol. 5(1), pages 1-34, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Juan J. Dolado & Etienne Lalé & Nawid Siassi, 2021. "From dual to unified employment protection: Transition and steady state," Quantitative Economics, Econometric Society, vol. 12(2), pages 547-585, May.
    2. Siassi, Nawid & Dolado, Juan J. & Lalé, Etienne, 2015. "Moving Towards a Single Labor Contract: Transition vs. Steady State," VfS Annual Conference 2015 (Muenster): Economic Development - Theory and Policy 112858, Verein für Socialpolitik / German Economic Association.
    3. Garcia-Louzao, Jose & Hospido, Laura & Ruggieri, Alessandro, 2023. "Dual returns to experience," Labour Economics, Elsevier, vol. 80(C).
    4. White-Means, Shelley I. & Osmani, Ahmad Reshad, 2019. "Job Market Prospects of Breast vs. Prostate Cancer Survivors in the US: A Double Hurdle Model of Ethnic Disparities," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 40, pages 282-304.
    5. Felix Holub & Laura Hospido & Ulrich J. Wagner, 2020. "Urban Air Pollution and Sick Leaves: Evidence From Social Security Data," CRC TR 224 Discussion Paper Series crctr224_2020_241, University of Bonn and University of Mannheim, Germany.
    6. G. Guidetti & G. Pedrini, 2015. "Systemic flexibility and human capital development: the relationship between non-standard employment and workplace training," Working Papers wp1019, Dipartimento Scienze Economiche, Universita' di Bologna.
    7. Byunghoon Kang, 2017. "Inference in Nonparametric Series Estimation with Data-Dependent Undersmoothing," Working Papers 170712442, Lancaster University Management School, Economics Department.
    8. Edoardo Di Porto & Cristina Tealdi, 2022. "Heterogeneous Paths to Stability," CSEF Working Papers 644, Centre for Studies in Economics and Finance (CSEF), University of Naples, Italy.
    9. Sophie-Charlotte Klose & Johannes Lederer, 2020. "A Pipeline for Variable Selection and False Discovery Rate Control With an Application in Labor Economics," Papers 2006.12296, arXiv.org, revised Jun 2020.
    10. Kemptner, Daniel & Tolan, Songül, 2018. "The role of time preferences in educational decision making," Economics of Education Review, Elsevier, vol. 67(C), pages 25-39.
    11. Miriam Aparicio, 2021. "Resiliency and Cooperation or Regarding Social and Collective Competencies for University Achievement. An Analysis from a Systemic Perspective," European Journal of Social Sciences Education and Research Articles, Revistia Research and Publishing, vol. 8, ejser_v8_.
    12. Mia Hubert & Peter Rousseeuw & Pieter Segaert, 2015. "Multivariate functional outlier detection," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 24(2), pages 177-202, July.
    13. Puhani, Patrick A. & Sterrenberg, Margret K., 2021. "Effects of Mandatory Military Service on Wages and Other Socioeconomic Outcomes," Hannover Economic Papers (HEP) dp-684, Leibniz Universität Hannover, Wirtschaftswissenschaftliche Fakultät.
    14. Samuel Bentolila & Juan Jose Dolado & Juan F. Jimeno, 2008. "Two-tier Employment Protection Reforms: The Spanish Experience," ifo DICE Report, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, vol. 6(4), pages 49-56, December.
    15. Akinyosoye, Vincent O., 2007. "Demand For Dairy Products In Nigeria: Evidence From The Nigerian," Journal of Rural Economics and Development, University of Ibadan, Department of Agricultural Economics, vol. 16, pages 1-14.
    16. World Bank & The International Bank for Reconstruction and Development, 2024. "Albania - Country Gender Assessment," World Bank Publications - Reports 41900, The World Bank Group.
    17. Yen, Steven T. & Chern, Wen S. & Lee, Hwang-Jaw, "undated". "Effects Of Income Sources On Household Food Expenditures," 1991 Annual Meeting, August 4-7, Manhattan, Kansas 271167, American Agricultural Economics Association (New Name 2008: Agricultural and Applied Economics Association).
    18. Langyintuo, Augustine S. & Mungoma, Catherine, 2008. "The effect of household wealth on the adoption of improved maize varieties in Zambia," Food Policy, Elsevier, vol. 33(6), pages 550-559, December.
    19. Peng, Qiao & McKillop, Donal & Quinn, Barry & Liu, Kailong, 2025. "Modeling and predicting failure in US credit unions," International Journal of Forecasting, Elsevier, vol. 41(3), pages 1237-1259.
    20. Mariona Lozano & Elisenda Rentería, 2019. "Work in Transition: Labour Market Life Expectancy and Years Spent in Precarious Employment in Spain 1986–2016," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 145(1), pages 185-200, August.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;

    JEL classification:

    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • J01 - Labor and Demographic Economics - - General - - - Labor Economics: General
    • J08 - Labor and Demographic Economics - - General - - - Labor Economics Policies
    • Y10 - Miscellaneous Categories - - Data: Tables and Charts - - - Data: Tables and Charts
    • C80 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - General

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:iza:izadps:dp13459. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Holger Hinte (email available below). General contact details of provider: https://edirc.repec.org/data/izaaade.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.