IDEAS home Printed from https://ideas.repec.org/a/jss/jstsof/v055i14.html
   My bibliography  Save this article

Scalable Strategies for Computing with Massive Data

Author

Listed:
  • Kane, Michael
  • Emerson, John W.
  • Weston, Stephen

Abstract

This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.

Suggested Citation

  • Kane, Michael & Emerson, John W. & Weston, Stephen, 2013. "Scalable Strategies for Computing with Massive Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 55(i14).
  • Handle: RePEc:jss:jstsof:v:055:i14
    DOI: http://hdl.handle.net/10.18637/jss.v055.i14
    as

    Download full text from publisher

    File URL: https://www.jstatsoft.org/index.php/jss/article/view/v055i14/v55i14.pdf
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v055i14/bigmemory.4.4.5-1.tar.gz
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v055i14/foreach.1.4.1-1.tar.gz
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v055i14/v55i14-replication.zip
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v055i14/Airline.tar.bz2
    Download Restriction: no

    File URL: https://libkey.io/http://hdl.handle.net/10.18637/jss.v055.i14?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Wickham, Hadley, 2011. "The Split-Apply-Combine Strategy for Data Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i01).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Marin FOTACHE, 2016. "Data Processing Languages for Business Intelligence. SQL vs. R," Informatica Economica, Academy of Economic Studies - Bucharest, Romania, vol. 20(1), pages 48-61.
    2. Fulya Gokalp Yavuz & Barret Schloerke, 2020. "Parallel computing in linear mixed models," Computational Statistics, Springer, vol. 35(3), pages 1273-1289, September.
    3. Junyang Qian & Yosuke Tanigawa & Wenfei Du & Matthew Aguirre & Chris Chang & Robert Tibshirani & Manuel A Rivas & Trevor Hastie, 2020. "A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank," PLOS Genetics, Public Library of Science, vol. 16(10), pages 1-30, October.
    4. Hofert, Marius & Mächler, Martin, 2016. "Parallel and Other Simulations in R Made Easy: An End-to-End Study," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 69(i04).
    5. Aaron T L Lun & Hervé Pagès & Mike L Smith, 2018. "beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types," PLOS Computational Biology, Public Library of Science, vol. 14(5), pages 1-15, May.
    6. Lining Yu & Wolfgang Karl Härdle & Lukas Borke & Thijs Benschop, 2017. "FRM: a Financial Risk Meter based on penalizing tail events occurrence," SFB 649 Discussion Papers SFB649DP2017-003, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.
    7. Bandara, Kanchana & Varpe, Øystein & Ji, Rubao & Eiane, Ketil, 2018. "A high-resolution modeling study on diel and seasonal vertical migrations of high-latitude copepods," Ecological Modelling, Elsevier, vol. 368(C), pages 357-376.
    8. Bischl, Bernd & Lang, Michel & Mersmann, Olaf & Rahnenführer, Jörg & Weihs, Claus, 2015. "BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 64(i11).
    9. Fang, Jianglin, 2023. "A split-and-conquer variable selection approach for high-dimensional general semiparametric models with massive data," Journal of Multivariate Analysis, Elsevier, vol. 194(C).
    10. Lining Yu & Wolfgang Karl Hardle & Lukas Borke & Thijs Benschop, 2020. "An AI approach to measuring financial risk," Papers 2009.13222, arXiv.org.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dolejš Martin & Forejt Michal, 2019. "Franziscean cadastre in landscape structure research: a systematic review," Quaestiones Geographicae, Sciendo, vol. 38(1), pages 131-144, March.
    2. Ravic Nijbroek & Kristin Piikki & Mats Söderström & Bas Kempen & Katrine G. Turner & Simeon Hengari & John Mutua, 2018. "Soil Organic Carbon Baselines for Land Degradation Neutrality: Map Accuracy and Cost Tradeoffs with Respect to Complexity in Otjozondjupa, Namibia," Sustainability, MDPI, vol. 10(5), pages 1-20, May.
    3. Merl, Robert & Stöckl, Thomas & Palan, Stefan, 2023. "Insider trading regulation and shorting constraints. Evaluating the joint effects of two market interventions," Journal of Banking & Finance, Elsevier, vol. 154(C).
    4. Miller, Christine M.F. & Waterhouse, Hannah & Harter, Thomas & Fadel, James G. & Meyer, Deanne, 2020. "Quantifying the uncertainty in nitrogen application and groundwater nitrate leaching in manure based cropping systems," Agricultural Systems, Elsevier, vol. 184(C).
    5. Sarlas, Georgios & Páez, Antonio & Axhausen, Kay W., 2020. "Betweenness-accessibility: Estimating impacts of accessibility on networks," Journal of Transport Geography, Elsevier, vol. 84(C).
    6. Marin FOTACHE & Florin DUMITRU & Valerica GREAVU-SERBAN, 2015. "An Information Systems Master Programme in Romania. Some Commonalities and Specificities," Informatica Economica, Academy of Economic Studies - Bucharest, Romania, vol. 19(3), pages 5-18.
    7. Martijn Van Heel & Dinska Van Gucht & Koen Vanbrabant & Frank Baeyens, 2017. "The Importance of Conditioned Stimuli in Cigarette and E-Cigarette Craving Reduction by E-Cigarettes," IJERPH, MDPI, vol. 14(2), pages 1-18, February.
    8. Sean McKenzie & Hilary Parkinson & Jane Mangold & Mary Burrows & Selena Ahmed & Fabian Menalled, 2018. "Perceptions, Experiences, and Priorities Supporting Agroecosystem Management Decisions Differ among Agricultural Producers, Consultants, and Researchers," Sustainability, MDPI, vol. 10(11), pages 1-19, November.
    9. Milad Abbasiharofteh & Tom Broekel, 2021. "Still in the shadow of the wall? The case of the Berlin biotechnology cluster," Environment and Planning A, , vol. 53(1), pages 73-94, February.
    10. Andee J. Kaplan & Eric R. Hare, 2019. "Putting down roots: a graphical exploration of community attachment," Computational Statistics, Springer, vol. 34(4), pages 1449-1464, December.
    11. Ahmad Alsaber & Jiazhu Pan & Adeeba Al-Herz & Dhary S. Alkandary & Adeeba Al-Hurban & Parul Setiya & on behalf of the KRRD Group, 2020. "Influence of Ambient Air Pollution on Rheumatoid Arthritis Disease Activity Score Index," IJERPH, MDPI, vol. 17(2), pages 1-17, January.
    12. Haunschild, Robin & Bornmann, Lutz, 2023. "Which papers cited which tweets? An exploratory analysis based on Scopus data," Journal of Informetrics, Elsevier, vol. 17(2).
    13. Fulya Gokalp Yavuz & Barret Schloerke, 2020. "Parallel computing in linear mixed models," Computational Statistics, Springer, vol. 35(3), pages 1273-1289, September.
    14. Rebecca Hong & Monica Perkins & Belinda J. Gabbe & Lincoln M. Tracy, 2022. "Comparing Peak Burn Injury Times and Characteristics in Australia and New Zealand," IJERPH, MDPI, vol. 19(15), pages 1-9, August.
    15. Ioannis Politis & Ioannis Fyrogenis & Efthymis Papadopoulos & Anastasia Nikolaidou & Eleni Verani, 2020. "Shifting to Shared Wheels: Factors Affecting Dockless Bike-Sharing Choice for Short and Long Trips," Sustainability, MDPI, vol. 12(19), pages 1-25, October.
    16. Paul J McMurdie & Susan Holmes, 2014. "Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible," PLOS Computational Biology, Public Library of Science, vol. 10(4), pages 1-12, April.
    17. Debanjan Mukherjee & Ângelo Ferreira Chora & Jean-Christophe Lone & Ricardo S. Ramiro & Birte Blankenhaus & Karine Serre & Mário Ramirez & Isabel Gordo & Marc Veldhoen & Patrick Varga-Weisz & Maria M., 2022. "Host lung microbiota promotes malaria-associated acute respiratory distress syndrome," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    18. Eff, Ellis Anthon, 2013. "Settlers and surnames: An atlas illustrating the origins of settlers in 19th century America," MPRA Paper 56296, University Library of Munich, Germany.
    19. Karl E. Bauer & Niklas Bargenda & Rico Schieweck & Christin Illig & Inmaculada Segura & Max Harner & Michael A. Kiebler, 2022. "RNA supply drives physiological granule assembly in neurons," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    20. Lee-Ann Sutherland & Carla Barlagne & Andrew P. Barnes, 2019. "Beyond ‘Hobby Farming’: towards a typology of non-commercial farming," Agriculture and Human Values, Springer;The Agriculture, Food, & Human Values Society (AFHVS), vol. 36(3), pages 475-493, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:jss:jstsof:v:055:i14. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F. Baum (email available below). General contact details of provider: http://www.jstatsoft.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.