IDEAS home Printed from https://ideas.repec.org/p/iza/izadps/dp13459.html
   My bibliography  Save this paper

Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data

Author

Listed:
  • Nicodemo, Catia

    (University of Oxford)

  • Satorra, Albert

    (Universitat Pompeu Fabra)

Abstract

New challenges arise in data visualization when a sizable database is used in the analysis. With many data points, classical scatterplots are non-informative due to the cluttering of points. On the contrary, simple plots such as the boxplot that are of limited use in small samples, offer great potential to facilitate group comparison in the case of an extensive sample. This paper presents Exploratory Data Analysis (EDA) methods that are useful when a large dataset is involved. The EDA methods, (introduced by Tukey in his seminal book of 1977) encompass a set of statistical tools aimed to extract information from data using simple graphical tools. In this paper, some of the EDA methods like the Boxplot and Scatterplot are revisited and enhanced using modern graphical computational devices (as, e.g., the heat-map) and their use illustrated with Spanish Social Security data. We explore how earnings vary across several factors like age, gender, type of occupation and contract and in particular, the gender gap in salaries is visualized in various dimensions relating to the type of occupation. The EDA methods are also applied to assessing competing regressions with earnings as the dependent variable. The methods discussed should be useful to researchers to assess heterogeneity in data, across group-variation, and classical diagnostic plots of residuals from alternative models fits.

Suggested Citation

  • Nicodemo, Catia & Satorra, Albert, 2020. "Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data," IZA Discussion Papers 13459, Institute of Labor Economics (IZA).
  • Handle: RePEc:iza:izadps:dp13459
    as

    Download full text from publisher

    File URL: https://docs.iza.org/dp13459.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Gehrke, Britta & Weber, Enzo, 2018. "Identifying asymmetric effects of labor market reforms," European Economic Review, Elsevier, vol. 110(C), pages 18-40.
    2. Wilkinson, Leland & Friendly, Michael, 2009. "The History of the Cluster Heat Map," The American Statistician, American Statistical Association, vol. 63(2), pages 179-184.
    3. Heckman, James J. & Lochner, Lance J. & Todd, Petra E., 2006. "Earnings Functions, Rates of Return and Treatment Effects: The Mincer Equation and Beyond," Handbook of the Economics of Education, in: Erik Hanushek & F. Welch (ed.), Handbook of the Economics of Education, edition 1, volume 1, chapter 7, pages 307-458, Elsevier.
    4. Juan J Dolado & Carlos Garcia--Serrano & Juan F. Jimeno, 2002. "Drawing Lessons From The Boom Of Temporary Jobs In Spain," Economic Journal, Royal Economic Society, vol. 112(721), pages 270-295, June.
    5. Jonathan A. Schwabish, 2014. "An Economist's Guide to Visualizing Data," Journal of Economic Perspectives, American Economic Association, vol. 28(1), pages 209-234, Winter.
    6. Hal R. Varian, 2014. "Big Data: New Tricks for Econometrics," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 3-28, Spring.
    7. Cabrales, Antonio & Dolado, Juan J. & Mora, Ricardo, 2014. "Dual Labour Markets and (Lack of) On-the-Job Training: PIAAC Evidence from Spain and Other EU Countries," IZA Discussion Papers 8649, Institute of Labor Economics (IZA).
    8. Hubert, M. & Vandervieren, E., 2008. "An adjusted boxplot for skewed distributions," Computational Statistics & Data Analysis, Elsevier, vol. 52(12), pages 5186-5201, August.
    9. Blundell, Richard & Meghir, Costas, 1987. "Bivariate alternatives to the Tobit model," Journal of Econometrics, Elsevier, vol. 34(1-2), pages 179-200.
    10. Rodgers, G B, 1975. "Nutritionally Based Wage Determination in the Low-Income Labour Market," Oxford Economic Papers, Oxford University Press, vol. 27(1), pages 61-81, March.
    11. Amuedo-Dorantes Catalina & De la Rica Sara, 2006. "The Role of Segregation and Pay Structure on the Gender Wage Gap: Evidence from Matched Employer-Employee Data for Spain," The B.E. Journal of Economic Analysis & Policy, De Gruyter, vol. 5(1), pages 1-34, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Garcia-Louzao, Jose & Hospido, Laura & Ruggieri, Alessandro, 2021. "Dual Returns to Experience," IZA Discussion Papers 14596, Institute of Labor Economics (IZA).
    2. Juan J. Dolado & Etienne Lalé & Nawid Siassi, 2021. "From dual to unified employment protection: Transition and steady state," Quantitative Economics, Econometric Society, vol. 12(2), pages 547-585, May.
    3. Siassi, Nawid & Dolado, Juan J. & Lalé, Etienne, 2015. "Moving Towards a Single Labor Contract: Transition vs. Steady State," VfS Annual Conference 2015 (Muenster): Economic Development - Theory and Policy 112858, Verein für Socialpolitik / German Economic Association.
    4. Felix Holub & Laura Hospido & Ulrich J. Wagner, 2020. "Urban air pollution and sick leaves: evidence from social security data," Working Papers 2041, Banco de España.
    5. Garcia-Louzao, Jose & Hospido, Laura & Ruggieri, Alessandro, 2021. "Dual Returns to Experience," IZA Discussion Papers 14596, Institute of Labor Economics (IZA).
    6. White-Means, Shelley I. & Osmani, Ahmad Reshad, 2019. "Job Market Prospects of Breast vs. Prostate Cancer Survivors in the US: A Double Hurdle Model of Ethnic Disparities," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, pages 282-304.
    7. G. Guidetti & G. Pedrini, 2015. "Systemic flexibility and human capital development: the relationship between non-standard employment and workplace training," Working Papers wp1019, Dipartimento Scienze Economiche, Universita' di Bologna.
    8. di Porto, Edoardo & Tealdi, Cristina, 2022. "Heterogeneous Paths to Stability," IZA Discussion Papers 15246, Institute of Labor Economics (IZA).
    9. Byunghoon Kang, 2017. "Inference in Nonparametric Series Estimation with Data-Dependent Undersmoothing," Working Papers 170712442, Lancaster University Management School, Economics Department.
    10. Sophie-Charlotte Klose & Johannes Lederer, 2020. "A Pipeline for Variable Selection and False Discovery Rate Control With an Application in Labor Economics," Papers 2006.12296, arXiv.org, revised Jun 2020.
    11. Kemptner, Daniel & Tolan, Songül, 2018. "The role of time preferences in educational decision making," Economics of Education Review, Elsevier, vol. 67(C), pages 25-39.
    12. Alicja Grześkowiak, 2016. "Assessment of Participation in Cultural Activities in Poland by Selected Multivariate Methods," European Journal of Social Sciences Education and Research Articles, Revistia Research and Publishing, vol. 3, January -.
    13. Mia Hubert & Peter Rousseeuw & Pieter Segaert, 2015. "Multivariate functional outlier detection," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 24(2), pages 177-202, July.
    14. Puhani, Patrick A. & Sterrenberg, Margret K., 2021. "Effects of Mandatory Military Service on Wages and Other Socioeconomic Outcomes," Hannover Economic Papers (HEP) dp-684, Leibniz Universität Hannover, Wirtschaftswissenschaftliche Fakultät.
    15. Samuel Bentolila & Juan Jose Dolado & Juan F. Jimeno, 2008. "Two-tier Employment Protection Reforms: The Spanish Experience," ifo DICE Report, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, vol. 6(4), pages 49-56, December.
    16. Akinyosoye, Vincent O., 2007. "Demand For Dairy Products In Nigeria: Evidence From The Nigerian," Journal of Rural Economics and Development, University of Ibadan, Department of Agricultural Economics, vol. 16, pages 1-14.
    17. Ichimura, Hidehiko & Todd, Petra E., 2007. "Implementing Nonparametric and Semiparametric Estimators," Handbook of Econometrics, in: J.J. Heckman & E.E. Leamer (ed.), Handbook of Econometrics, edition 1, volume 6, chapter 74, Elsevier.
    18. Patrick Bajari & Victor Chernozhukov & Ali Hortaçsu & Junichi Suzuki, 2019. "The Impact of Big Data on Firm Performance: An Empirical Investigation," AEA Papers and Proceedings, American Economic Association, vol. 109, pages 33-37, May.
    19. Yen, Steven T. & Chern, Wen S. & Lee, Hwang-Jaw, 1991. "Effects Of Income Sources On Household Food Expenditures," 1991 Annual Meeting, August 4-7, Manhattan, Kansas 271167, American Agricultural Economics Association (New Name 2008: Agricultural and Applied Economics Association).
    20. Langyintuo, Augustine S. & Mungoma, Catherine, 2008. "The effect of household wealth on the adoption of improved maize varieties in Zambia," Food Policy, Elsevier, vol. 33(6), pages 550-559, December.

    More about this item

    Keywords

    ggplot; large dataset; EDA Analysis; heat-maps; R;
    All these keywords.

    JEL classification:

    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • J01 - Labor and Demographic Economics - - General - - - Labor Economics: General
    • J08 - Labor and Demographic Economics - - General - - - Labor Economics Policies
    • Y10 - Miscellaneous Categories - - Data: Tables and Charts - - - Data: Tables and Charts
    • C80 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - General

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:iza:izadps:dp13459. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: . General contact details of provider: https://edirc.repec.org/data/izaaade.html .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Holger Hinte (email available below). General contact details of provider: https://edirc.repec.org/data/izaaade.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.