IDEAS home Printed from https://ideas.repec.org/a/taf/jnlasa/v112y2017i519p1199-1210.html
   My bibliography  Save this article

Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data

Author

Listed:
  • Simon N. Wood
  • Zheyuan Li
  • Gavin Shaddick
  • Nicole H. Augustin

Abstract

We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 104 coefficients to up to 108 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration’s pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.

Suggested Citation

  • Simon N. Wood & Zheyuan Li & Gavin Shaddick & Nicole H. Augustin, 2017. "Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(519), pages 1199-1210, July.
  • Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1199-1210
    DOI: 10.1080/01621459.2016.1195744
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/01621459.2016.1195744
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/01621459.2016.1195744?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. I. D. Currie & M. Durban & P. H. C. Eilers, 2006. "Generalized linear array models with applications to multidimensional smoothing," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 68(2), pages 259-280, April.
    2. Marx, Brian D. & Eilers, Paul H. C., 1998. "Direct generalized additive modeling with penalized likelihood," Computational Statistics & Data Analysis, Elsevier, vol. 28(2), pages 193-209, August.
    3. Ludwig Fahrmeir & Stefan Lang, 2001. "Bayesian inference for generalized additive mixed models based on Markov random field priors," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 50(2), pages 201-220.
    4. Ruppert,David & Wand,M. P. & Carroll,R. J., 2003. "Semiparametric Regression," Cambridge Books, Cambridge University Press, number 9780521780506.
    5. Philip T. Reiss & R. Todd Ogden, 2009. "Smoothing parameter selection for a class of semiparametric linear models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 505-523, April.
    6. Giampiero Marra & Simon N. Wood, 2012. "Coverage Properties of Confidence Intervals for Generalized Additive Model Components," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 39(1), pages 53-74, March.
    7. Simon N. Wood, 2006. "Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models," Biometrics, The International Biometric Society, vol. 62(4), pages 1025-1036, December.
    8. Peter Hall & J. D. Opsomer, 2005. "Theory for penalised spline regression," Biometrika, Biometrika Trust, vol. 92(1), pages 105-118, March.
    9. Simon N. Wood, 2003. "Thin plate regression splines," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 65(1), pages 95-114, February.
    10. Augustin, Nicole H. & Musio, Monica & von Wilpert, Klaus & Kublin, Edgar & Wood, Simon N. & Schumacher, Martin, 2009. "Modeling Spatiotemporal Forest Health Monitoring Data," Journal of the American Statistical Association, American Statistical Association, vol. 104(487), pages 899-911.
    11. Håvard Rue & Sara Martino & Nicolas Chopin, 2009. "Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 319-392, April.
    12. Göran Kauermann & Tatyana Krivobokova & Ludwig Fahrmeir, 2009. "Some asymptotic results on generalized penalized spline smoothing," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 487-503, April.
    13. Ruppert,David & Wand,M. P. & Carroll,R. J., 2003. "Semiparametric Regression," Cambridge Books, Cambridge University Press, number 9780521785167.
    14. Yingxing Li & David Ruppert, 2008. "On the asymptotics of penalized splines," Biometrika, Biometrika Trust, vol. 95(2), pages 415-436.
    15. Gerda Claeskens & Tatyana Krivobokova & Jean D. Opsomer, 2009. "Asymptotic properties of penalized spline estimators," Biometrika, Biometrika Trust, vol. 96(3), pages 529-544.
    16. Peter J. Diggle & Raquel Menezes & Ting‐li Su, 2010. "Geostatistical inference under preferential sampling," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 59(2), pages 191-232, March.
    17. Stefan Lang & Nikolaus Umlauf & Peter Wechselberger & Kenneth Harttgen & Thomas Kneib, 2012. "Multilevel structured additive regression," Working Papers 2012-07, Faculty of Economics and Statistics, Universität Innsbruck.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Frank van Berkum & Katrien Antonio & Michel Vellekoop, 2021. "Quantifying longevity gaps using micro‐level lifetime data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(2), pages 548-570, April.
    2. Jonathan Berrisch & Florian Ziel, 2023. "Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices," Papers 2303.10019, arXiv.org, revised Feb 2024.
    3. Du, Qianqian & Mieno, Taro & Bullock, David & Edge, Brittani, 2021. "Economically Optimal Nitrogen Side-dressing Based on Vegetation Indices from Satellite Images Through On-farm Experiments," Agri-Tech Economics Papers 316596, Harper Adams University, Land, Farm & Agribusiness Management Department.
    4. Simon N. Wood, 2020. "Inference and computation with generalized additive models and their extensions," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(2), pages 307-339, June.
    5. Anne-Sophie Krah & Zoran Nikolić & Ralf Korn, 2020. "Machine Learning in Least-Squares Monte Carlo Proxy Modeling of Life Insurance Companies," Risks, MDPI, vol. 8(1), pages 1-79, February.
    6. Anne-Sophie Krah & Zoran Nikoli'c & Ralf Korn, 2019. "Machine Learning in Least-Squares Monte Carlo Proxy Modeling of Life Insurance Companies," Papers 1909.02182, arXiv.org.
    7. Anna Vážná & Jana Vignerová & Marek Brabec & Jan Novák & Bohuslav Procházka & Antonín Gabera & Petr Sedlak, 2022. "Influence of COVID-19-Related Restrictions on the Prevalence of Overweight and Obese Czech Children," IJERPH, MDPI, vol. 19(19), pages 1-14, September.
    8. Konstantin Sering & Petar Milin & R. Harald Baayen, 2018. "Language comprehension as a multi‐label classification problem," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 72(3), pages 339-353, August.
    9. David L. Miller & Richard Glennie & Andrew E. Seaton, 2020. "Understanding the Stochastic Partial Differential Equation Approach to Smoothing," Journal of Agricultural, Biological and Environmental Statistics, Springer;The International Biometric Society;American Statistical Association, vol. 25(1), pages 1-16, March.
    10. Sonja Greven & Fabian Scheipl, 2020. "Comments on: Inference and computation with Generalized Additive Models and their extensions," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(2), pages 343-350, June.
    11. Du, Qianqian & Mieno, Taro & Bullock, David & Edge, Brittani, 2021. "Economically Optimal Nitrogen Side-dressing Based on Vegetation Indices from Satellite Images Through On-farm Experiments," Land, Farm & Agribusiness Management Department 316596, Harper Adams University, Land, Farm & Agribusiness Management Department.
    12. Oskar Allerbo & Rebecka Jörnsten, 2022. "Flexible, non-parametric modeling using regularized neural networks," Computational Statistics, Springer, vol. 37(4), pages 2029-2047, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lee, Wang-Sheng, 2014. "Big and Tall: Is there a Height Premium or Obesity Penalty in the Labor Market?," IZA Discussion Papers 8606, Institute of Labor Economics (IZA).
    2. Simon N. Wood & Natalya Pya & Benjamin Säfken, 2016. "Smoothing Parameter and Model Selection for General Smooth Models," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1548-1563, October.
    3. repec:wyi:journl:002174 is not listed on IDEAS
    4. Simon N. Wood, 2011. "Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 73(1), pages 3-36, January.
    5. Longhi, Christian & Musolesi, Antonio & Baumont, Catherine, 2014. "Modeling structural change in the European metropolitan areas during the process of economic integration," Economic Modelling, Elsevier, vol. 37(C), pages 395-407.
    6. Christian Schellhase & Göran Kauermann, 2012. "Density estimation and comparison with a penalized mixture approach," Computational Statistics, Springer, vol. 27(4), pages 757-777, December.
    7. Takuma Yoshida, 2016. "Asymptotics and smoothing parameter selection for penalized spline regression with various loss functions," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 70(4), pages 278-303, November.
    8. Luo Xiao & Yingxing Li & David Ruppert, 2013. "Fast bivariate P-splines: the sandwich smoother," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 75(3), pages 577-599, June.
    9. Georgios Gioldasis & Antonio Musolesi & Michel Simioni, 2020. "Model uncertainty, nonlinearities and out-of-sample comparison: evidence from international technology diffusion," Working Papers hal-02790523, HAL.
    10. Lee, Wang-Sheng, 2014. "Is the BMI a Relic of the Past?," IZA Discussion Papers 8637, Institute of Labor Economics (IZA).
    11. I. Gijbels & I. Prosdocimi & G. Claeskens, 2010. "Nonparametric estimation of mean and dispersion functions in extended generalized linear models," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 19(3), pages 580-608, November.
    12. Takuma Yoshida & Kanta Naito, 2014. "Asymptotics for penalised splines in generalised additive models," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 26(2), pages 269-289, June.
    13. Øystein Sørensen & Anders M. Fjell & Kristine B. Walhovd, 2023. "Longitudinal Modeling of Age-Dependent Latent Traits with Generalized Additive Latent and Mixed Models," Psychometrika, Springer;The Psychometric Society, vol. 88(2), pages 456-486, June.
    14. Musolesi Antonio & Mazzanti Massimiliano, 2014. "Nonlinearity, heterogeneity and unobserved effects in the carbon dioxide emissions-economic development relation for advanced countries," Studies in Nonlinear Dynamics & Econometrics, De Gruyter, vol. 18(5), pages 1-21, December.
    15. Georgios Gioldasis & Antonio Musolesi & Michel Simioni, 2020. "Model uncertainty, nonlinearities and out-of-sample comparison: evidence from international technology diffusion," SEEDS Working Papers 0120, SEEDS, Sustainability Environmental Economics and Dynamics Studies, revised Jan 2020.
    16. Wu, Ximing & Sickles, Robin, 2018. "Semiparametric estimation under shape constraints," Econometrics and Statistics, Elsevier, vol. 6(C), pages 74-89.
    17. Simon N. Wood, 2020. "Inference and computation with generalized additive models and their extensions," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(2), pages 307-339, June.
    18. Holland, Ashley D., 2017. "Penalized spline estimation in the partially linear model," Journal of Multivariate Analysis, Elsevier, vol. 153(C), pages 211-235.
    19. Mazzanti, Massimiliano & Musolesi, Antonio, 2013. "Nonlinearity, Heterogeneity and Unobserved Effects in the CO2-income Relation for Advanced Countries," Climate Change and Sustainable Development 162374, Fondazione Eni Enrico Mattei (FEEM).
    20. Strasak, Alexander M. & Umlauf, Nikolaus & Pfeiffer, Ruth M. & Lang, Stefan, 2011. "Comparing penalized splines and fractional polynomials for flexible modelling of the effects of continuous predictor variables," Computational Statistics & Data Analysis, Elsevier, vol. 55(4), pages 1540-1551, April.
    21. Sonja Greven & Ciprian Crainiceanu, 2013. "On likelihood ratio testing for penalized splines," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 97(4), pages 387-402, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1199-1210. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/UASA20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.