IDEAS home Printed from https://ideas.repec.org/p/iza/izadps/dp13459.html
   My bibliography  Save this paper

Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data

Author

Listed:
  • Nicodemo, Catia

    (University of Oxford)

  • Satorra, Albert

    (Universitat Pompeu Fabra)

Abstract

New challenges arise in data visualization when a sizable database is used in the analysis. With many data points, classical scatterplots are non-informative due to the cluttering of points. On the contrary, simple plots such as the boxplot that are of limited use in small samples, offer great potential to facilitate group comparison in the case of an extensive sample. This paper presents Exploratory Data Analysis (EDA) methods that are useful when a large dataset is involved. The EDA methods, (introduced by Tukey in his seminal book of 1977) encompass a set of statistical tools aimed to extract information from data using simple graphical tools. In this paper, some of the EDA methods like the Boxplot and Scatterplot are revisited and enhanced using modern graphical computational devices (as, e.g., the heat-map) and their use illustrated with Spanish Social Security data. We explore how earnings vary across several factors like age, gender, type of occupation and contract and in particular, the gender gap in salaries is visualized in various dimensions relating to the type of occupation. The EDA methods are also applied to assessing competing regressions with earnings as the dependent variable. The methods discussed should be useful to researchers to assess heterogeneity in data, across group-variation, and classical diagnostic plots of residuals from alternative models fits.

Suggested Citation

  • Nicodemo, Catia & Satorra, Albert, 2020. "Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data," IZA Discussion Papers 13459, Institute of Labor Economics (IZA).
  • Handle: RePEc:iza:izadps:dp13459
    as

    Download full text from publisher

    File URL: https://docs.iza.org/dp13459.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Gehrke, Britta & Weber, Enzo, 2018. "Identifying asymmetric effects of labor market reforms," European Economic Review, Elsevier, vol. 110(C), pages 18-40.
    2. Wilkinson, Leland & Friendly, Michael, 2009. "The History of the Cluster Heat Map," The American Statistician, American Statistical Association, vol. 63(2), pages 179-184.
    3. Heckman, James J. & Lochner, Lance J. & Todd, Petra E., 2006. "Earnings Functions, Rates of Return and Treatment Effects: The Mincer Equation and Beyond," Handbook of the Economics of Education, in: Erik Hanushek & F. Welch (ed.), Handbook of the Economics of Education, edition 1, volume 1, chapter 7, pages 307-458, Elsevier.
    4. Juan J Dolado & Carlos Garcia--Serrano & Juan F. Jimeno, 2002. "Drawing Lessons From The Boom Of Temporary Jobs In Spain," Economic Journal, Royal Economic Society, vol. 112(721), pages 270-295, June.
    5. Jonathan A. Schwabish, 2014. "An Economist's Guide to Visualizing Data," Journal of Economic Perspectives, American Economic Association, vol. 28(1), pages 209-234, Winter.
    6. Hal R. Varian, 2014. "Big Data: New Tricks for Econometrics," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 3-28, Spring.
    7. Cabrales, Antonio & Dolado, Juan J. & Mora, Ricardo, 2014. "Dual Labour Markets and (Lack of) On-the-Job Training: PIAAC Evidence from Spain and Other EU Countries," IZA Discussion Papers 8649, Institute of Labor Economics (IZA).
    8. Hubert, M. & Vandervieren, E., 2008. "An adjusted boxplot for skewed distributions," Computational Statistics & Data Analysis, Elsevier, vol. 52(12), pages 5186-5201, August.
    9. Blundell, Richard & Meghir, Costas, 1987. "Bivariate alternatives to the Tobit model," Journal of Econometrics, Elsevier, vol. 34(1-2), pages 179-200.
    10. Rodgers, G B, 1975. "Nutritionally Based Wage Determination in the Low-Income Labour Market," Oxford Economic Papers, Oxford University Press, vol. 27(1), pages 61-81, March.
    11. Amuedo-Dorantes Catalina & De la Rica Sara, 2006. "The Role of Segregation and Pay Structure on the Gender Wage Gap: Evidence from Matched Employer-Employee Data for Spain," The B.E. Journal of Economic Analysis & Policy, De Gruyter, vol. 5(1), pages 1-34, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Juan J. Dolado & Etienne Lalé & Nawid Siassi, 2021. "From dual to unified employment protection: Transition and steady state," Quantitative Economics, Econometric Society, vol. 12(2), pages 547-585, May.
    2. Siassi, Nawid & Dolado, Juan J. & Lalé, Etienne, 2015. "Moving Towards a Single Labor Contract: Transition vs. Steady State," VfS Annual Conference 2015 (Muenster): Economic Development - Theory and Policy 112858, Verein für Socialpolitik / German Economic Association.
    3. Felix Holub & Laura Hospido & Ulrich J. Wagner, 2020. "Urban air pollution and sick leaves: evidence from social security data," Working Papers 2041, Banco de España.
    4. Garcia-Louzao, Jose & Hospido, Laura & Ruggieri, Alessandro, 2023. "Dual returns to experience," Labour Economics, Elsevier, vol. 80(C).
    5. Shelley I. White-Means & Ahmad Reshad Osmani, 2019. "Job Market Prospects of Breast vs. Prostate Cancer Survivors in the US: A Double Hurdle Model of Ethnic Disparities," Journal of Family and Economic Issues, Springer, vol. 40(2), pages 282-304, June.
    6. di Porto, Edoardo & Tealdi, Cristina, 2022. "Heterogeneous Paths to Stability," IZA Discussion Papers 15246, Institute of Labor Economics (IZA).
    7. G. Guidetti & G. Pedrini, 2015. "Systemic flexibility and human capital development: the relationship between non-standard employment and workplace training," Working Papers wp1019, Dipartimento Scienze Economiche, Universita' di Bologna.
    8. Byunghoon Kang, 2017. "Inference in Nonparametric Series Estimation with Data-Dependent Undersmoothing," Working Papers 170712442, Lancaster University Management School, Economics Department.
    9. Kemptner, Daniel & Tolan, Songül, 2018. "The role of time preferences in educational decision making," Economics of Education Review, Elsevier, vol. 67(C), pages 25-39.
    10. Mia Hubert & Peter Rousseeuw & Pieter Segaert, 2015. "Multivariate functional outlier detection," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 24(2), pages 177-202, July.
    11. Samuel Bentolila & Juan Jose Dolado & Juan F. Jimeno, 2008. "Two-tier Employment Protection Reforms: The Spanish Experience," ifo DICE Report, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, vol. 6(4), pages 49-56, December.
    12. Akinyosoye, Vincent O., 2007. "Demand For Dairy Products In Nigeria: Evidence From The Nigerian," Journal of Rural Economics and Development, University of Ibadan, Department of Agricultural Economics, vol. 16, pages 1-14.
    13. Patrick Bajari & Victor Chernozhukov & Ali Hortaçsu & Junichi Suzuki, 2019. "The Impact of Big Data on Firm Performance: An Empirical Investigation," AEA Papers and Proceedings, American Economic Association, vol. 109, pages 33-37, May.
    14. Langyintuo, Augustine S. & Mungoma, Catherine, 2008. "The effect of household wealth on the adoption of improved maize varieties in Zambia," Food Policy, Elsevier, vol. 33(6), pages 550-559, December.
    15. Mariona Lozano & Elisenda Rentería, 2019. "Work in Transition: Labour Market Life Expectancy and Years Spent in Precarious Employment in Spain 1986–2016," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 145(1), pages 185-200, August.
    16. Adsera, Alicia, 2005. "Differences in Desired and Actual Fertility: An Economic Analysis of the Spanish Case," IZA Discussion Papers 1584, Institute of Labor Economics (IZA).
    17. Elena Casquel & Antoni Cunyat, "undated". "The Welfare Cost of Business Cycles in an Economy with Nonclearing Markets," Working Papers 2005-19, FEDEA.
    18. Kan, Kamhon & Fu, Tsu-Tan, 1997. "Analysis of Housewives' Grocery Shopping Behavior in Taiwan: An Application of the Poisson Switching Regression," Journal of Agricultural and Applied Economics, Cambridge University Press, vol. 29(2), pages 397-407, December.
    19. Mohammed SHARIF, 2000. "Inverted “S”—The complete neoclassical labour-supply function," International Labour Review, International Labour Organization, vol. 139(4), pages 409-435, December.
    20. Gernandt, Johannes & Maier, Michael & Pfeiffer, Friedhelm & Rat-Wirtzler, Julie, 2006. "Distributional effects of the high school degree in Germany," ZEW Discussion Papers 06-088, ZEW - Leibniz Centre for European Economic Research.

    More about this item

    Keywords

    heat-maps; EDA Analysis; large dataset; ggplot; R;
    All these keywords.

    JEL classification:

    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • J01 - Labor and Demographic Economics - - General - - - Labor Economics: General
    • J08 - Labor and Demographic Economics - - General - - - Labor Economics Policies
    • Y10 - Miscellaneous Categories - - Data: Tables and Charts - - - Data: Tables and Charts
    • C80 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - General

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:iza:izadps:dp13459. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Holger Hinte (email available below). General contact details of provider: https://edirc.repec.org/data/izaaade.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.