IDEAS home Printed from https://ideas.repec.org/a/taf/jnlasa/v112y2017i519p1199-1210.html
   My bibliography  Save this article

Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data

Author

Listed:
  • Simon N. Wood
  • Zheyuan Li
  • Gavin Shaddick
  • Nicole H. Augustin

Abstract

We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 104 coefficients to up to 108 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration’s pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.

Suggested Citation

  • Simon N. Wood & Zheyuan Li & Gavin Shaddick & Nicole H. Augustin, 2017. "Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(519), pages 1199-1210, July.
  • Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1199-1210
    DOI: 10.1080/01621459.2016.1195744
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/01621459.2016.1195744
    Download Restriction: Access to full text is restricted to subscribers.

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. I. D. Currie & M. Durban & P. H. C. Eilers, 2006. "Generalized linear array models with applications to multidimensional smoothing," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 68(2), pages 259-280, April.
    2. Simon N. Wood, 2006. "Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models," Biometrics, The International Biometric Society, vol. 62(4), pages 1025-1036, December.
    3. Peter Hall & J. D. Opsomer, 2005. "Theory for penalised spline regression," Biometrika, Biometrika Trust, vol. 92(1), pages 105-118, March.
    4. Simon N. Wood, 2003. "Thin plate regression splines," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 65(1), pages 95-114, February.
    5. Marx, Brian D. & Eilers, Paul H. C., 1998. "Direct generalized additive modeling with penalized likelihood," Computational Statistics & Data Analysis, Elsevier, vol. 28(2), pages 193-209, August.
    6. Augustin, Nicole H. & Musio, Monica & von Wilpert, Klaus & Kublin, Edgar & Wood, Simon N. & Schumacher, Martin, 2009. "Modeling Spatiotemporal Forest Health Monitoring Data," Journal of the American Statistical Association, American Statistical Association, vol. 104(487), pages 899-911.
    7. Håvard Rue & Sara Martino & Nicolas Chopin, 2009. "Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 319-392, April.
    8. Göran Kauermann & Tatyana Krivobokova & Ludwig Fahrmeir, 2009. "Some asymptotic results on generalized penalized spline smoothing," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 487-503, April.
    9. Ruppert,David & Wand,M. P. & Carroll,R. J., 2003. "Semiparametric Regression," Cambridge Books, Cambridge University Press, number 9780521785167, April.
    10. Ludwig Fahrmeir & Stefan Lang, 2001. "Bayesian inference for generalized additive mixed models based on Markov random field priors," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 50(2), pages 201-220.
    11. Gerda Claeskens & Tatyana Krivobokova & Jean D. Opsomer, 2009. "Asymptotic properties of penalized spline estimators," Biometrika, Biometrika Trust, vol. 96(3), pages 529-544.
    12. Peter J. Diggle & Raquel Menezes & Ting‐li Su, 2010. "Geostatistical inference under preferential sampling," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 59(2), pages 191-232, March.
    13. Stefan Lang & Nikolaus Umlauf & Peter Wechselberger & Kenneth Harttgen & Thomas Kneib, 2012. "Multilevel structured additive regression," Working Papers 2012-07, Faculty of Economics and Statistics, University of Innsbruck.
    14. Ruppert,David & Wand,M. P. & Carroll,R. J., 2003. "Semiparametric Regression," Cambridge Books, Cambridge University Press, number 9780521780506, April.
    15. Philip T. Reiss & R. Todd Ogden, 2009. "Smoothing parameter selection for a class of semiparametric linear models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 505-523, April.
    16. Giampiero Marra & Simon N. Wood, 2012. "Coverage Properties of Confidence Intervals for Generalized Additive Model Components," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 39(1), pages 53-74, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Konstantin Sering & Petar Milin & R. Harald Baayen, 2018. "Language comprehension as a multi‐label classification problem," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 72(3), pages 339-353, August.
    2. David L. Miller & Richard Glennie & Andrew E. Seaton, 2020. "Understanding the Stochastic Partial Differential Equation Approach to Smoothing," Journal of Agricultural, Biological and Environmental Statistics, Springer;The International Biometric Society;American Statistical Association, vol. 25(1), pages 1-16, March.
    3. Anne-Sophie Krah & Zoran Nikolić & Ralf Korn, 2020. "Machine Learning in Least-Squares Monte Carlo Proxy Modeling of Life Insurance Companies," Risks, MDPI, Open Access Journal, vol. 8(1), pages 1-79, February.
    4. Anne-Sophie Krah & Zoran Nikoli'c & Ralf Korn, 2019. "Machine Learning in Least-Squares Monte Carlo Proxy Modeling of Life Insurance Companies," Papers 1909.02182, arXiv.org.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1199-1210. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Chris Longhurst). General contact details of provider: http://www.tandfonline.com/UASA20 .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.