IDEAS home Printed from https://ideas.repec.org/a/spr/stabio/v16y2024i1d10.1007_s12561-023-09375-9.html
   My bibliography  Save this article

Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

Author

Listed:
  • Anton Sugolov

    (University of Toronto)

  • Eric Emmenegger

    (University of Toronto)

  • Andrew D. Paterson

    (University of Toronto
    University of Toronto)

  • Lei Sun

    (University of Toronto
    Dalla Lana School of Public Health, University of Toronto)

Abstract

Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain $$\sim$$ ∼ 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.

Suggested Citation

  • Anton Sugolov & Eric Emmenegger & Andrew D. Paterson & Lei Sun, 2024. "Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 16(1), pages 250-264, April.
  • Handle: RePEc:spr:stabio:v:16:y:2024:i:1:d:10.1007_s12561-023-09375-9
    DOI: 10.1007/s12561-023-09375-9
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s12561-023-09375-9
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s12561-023-09375-9?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Vivian G. Cheung & Richard S. Spielman & Kathryn G. Ewens & Teresa M. Weber & Michael Morley & Joshua T. Burdick, 2005. "Mapping determinants of human gene expression by regional and genome-wide association," Nature, Nature, vol. 437(7063), pages 1365-1369, October.
    2. Stephen B. Montgomery & Micha Sammeth & Maria Gutierrez-Arcelus & Radoslaw P. Lach & Catherine Ingle & James Nisbett & Roderic Guigo & Emmanouil T. Dermitzakis, 2010. "Transcriptome genetics using second generation sequencing in a Caucasian population," Nature, Nature, vol. 464(7289), pages 773-777, April.
    3. Barbara E Stranger & Stephen B Montgomery & Antigone S Dimas & Leopold Parts & Oliver Stegle & Catherine E Ingle & Magda Sekowska & George Davey Smith & David Evans & Maria Gutierrez-Arcelus & Alkes P, 2012. "Patterns of Cis Regulatory Variation in Diverse Human Populations," PLOS Genetics, Public Library of Science, vol. 8(4), pages 1-13, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. repec:plo:pgen00:1002078 is not listed on IDEAS
    2. Barbara E Stranger & Stephen B Montgomery & Antigone S Dimas & Leopold Parts & Oliver Stegle & Catherine E Ingle & Magda Sekowska & George Davey Smith & David Evans & Maria Gutierrez-Arcelus & Alkes P, 2012. "Patterns of Cis Regulatory Variation in Diverse Human Populations," PLOS Genetics, Public Library of Science, vol. 8(4), pages 1-13, April.
    3. Alexandra C Nica & Leopold Parts & Daniel Glass & James Nisbet & Amy Barrett & Magdalena Sekowska & Mary Travers & Simon Potter & Elin Grundberg & Kerrin Small & Åsa K Hedman & Veronique Bataille & Jo, 2011. "The Architecture of Gene Regulatory Variation across Multiple Human Tissues: The MuTHER Study," PLOS Genetics, Public Library of Science, vol. 7(2), pages 1-9, February.
    4. repec:plo:pone00:0068141 is not listed on IDEAS
    5. Daria V Zhernakova & Eleonora de Klerk & Harm-Jan Westra & Anastasios Mastrokolias & Shoaib Amini & Yavuz Ariyurek & Rick Jansen & Brenda W Penninx & Jouke J Hottenga & Gonneke Willemsen & Eco J de Ge, 2013. "DeepSAGE Reveals Genetic Variants Associated with Alternative Polyadenylation and Expression of Coding and Non-coding Transcripts," PLOS Genetics, Public Library of Science, vol. 9(6), pages 1-15, June.
    6. Yixin Fang & Yang Feng & Ming Yuan, 2014. "Regularized principal components of heritability," Computational Statistics, Springer, vol. 29(3), pages 455-465, June.
    7. repec:plo:pone00:0046199 is not listed on IDEAS
    8. repec:plo:pone00:0107026 is not listed on IDEAS
    9. Brielin C Brown & Nicolas L Bray & Lior Pachter, 2018. "Expression reflects population structure," PLOS Genetics, Public Library of Science, vol. 14(12), pages 1-15, December.
    10. Kyung-Won Hong & Seok Won Jeong & Myungguen Chung & Seong Beom Cho, 2014. "Association between Expression Quantitative Trait Loci and Metabolic Traits in Two Korean Populations," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-13, December.
    11. Ryan Abo & Gregory D Jenkins & Liewei Wang & Brooke L Fridley, 2012. "Identifying the Genetic Variation of Gene Expression Using Gene Sets: Application of Novel Gene Set eQTL Approach to PharmGKB and KEGG," PLOS ONE, Public Library of Science, vol. 7(8), pages 1-11, August.
    12. repec:plo:pone00:0041815 is not listed on IDEAS
    13. Jin Hyun Ju & Sushila A Shenoy & Ronald G Crystal & Jason G Mezey, 2017. "An independent component analysis confounding factor correction framework for identifying broad impact expression quantitative trait loci," PLOS Computational Biology, Public Library of Science, vol. 13(5), pages 1-26, May.
    14. Jungsoo Gim & Sungho Won & Taesung Park, 2016. "LPEseq: Local-Pooled-Error Test for RNA Sequencing Experiments with a Small Number of Replicates," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-15, August.
    15. Ning Jiang & Minghui Wang & Tianye Jia & Lin Wang & Lindsey Leach & Christine Hackett & David Marshall & Zewei Luo, 2011. "A Robust Statistical Method for Association-Based eQTL Analysis," PLOS ONE, Public Library of Science, vol. 6(8), pages 1-11, August.
    16. Paul C Boutros & Ivy D Moffat & Allan B Okey & Raimo Pohjanvirta, 2011. "mRNA Levels in Control Rat Liver Display Strain-Specific, Hereditary, and AHR-Dependent Components," PLOS ONE, Public Library of Science, vol. 6(7), pages 1-15, July.
    17. Faisal Shahla & Tutz Gerhard, 2017. "Missing value imputation for gene expression data by tailored nearest neighbors," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(2), pages 95-106, April.
    18. Eric O Johnson & Dana B Hancock & Nathan C Gaddis & Joshua L Levy & Grier Page & Scott P Novak & Cristie Glasheen & Nancy L Saccone & John P Rice & Michael P Moreau & Kimberly F Doheny & Jane M Romm &, 2015. "Novel Genetic Locus Implicated for HIV-1 Acquisition with Putative Regulatory Links to HIV Replication and Infectivity: A Genome-Wide Association Study," PLOS ONE, Public Library of Science, vol. 10(3), pages 1-15, March.
    19. Jae Hoon Sul & Buhm Han & Chun Ye & Ted Choi & Eleazar Eskin, 2013. "Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches," PLOS Genetics, Public Library of Science, vol. 9(6), pages 1-13, June.
    20. Hui-Min Wang & Ching-Lin Hsiao & Ai-Ru Hsieh & Ying-Chao Lin & Cathy S J Fann, 2012. "Constructing Endophenotypes of Complex Diseases Using Non-Negative Matrix Factorization and Adjusted Rand Index," PLOS ONE, Public Library of Science, vol. 7(7), pages 1-12, July.
    21. Thanh Nguyen & Asim Bhatti & Samuel Yang & Saeid Nahavandi, 2016. "RNA-Seq Count Data Modelling by Grey Relational Analysis and Nonparametric Gaussian Process," PLOS ONE, Public Library of Science, vol. 11(10), pages 1-18, October.
    22. Heather E Wheeler & Kaanan P Shah & Jonathon Brenner & Tzintzuni Garcia & Keston Aquino-Michaels & GTEx Consortium & Nancy J Cox & Dan L Nicolae & Hae Kyung Im, 2016. "Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues," PLOS Genetics, Public Library of Science, vol. 12(11), pages 1-23, November.
    23. Farnoosh Abbas-Aghababazadeh & Qian Li & Brooke L Fridley, 2018. "Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing," PLOS ONE, Public Library of Science, vol. 13(10), pages 1-21, October.
    24. Josine L Min & Jennifer M Taylor & J Brent Richards & Tim Watts & Fredrik H Pettersson & John Broxholme & Kourosh R Ahmadi & Gabriela L Surdulescu & Ernesto Lowy & Christian Gieger & Chris Newton-Cheh, 2011. "The Use of Genome-Wide eQTL Associations in Lymphoblastoid Cell Lines to Identify Novel Genetic Pathways Involved in Complex Traits," PLOS ONE, Public Library of Science, vol. 6(7), pages 1-14, July.
    25. Kensuke Yamaguchi & Kazuyoshi Ishigaki & Akari Suzuki & Yumi Tsuchida & Haruka Tsuchiya & Shuji Sumitomo & Yasuo Nagafuchi & Fuyuki Miya & Tatsuhiko Tsunoda & Hirofumi Shoda & Keishi Fujio & Kazuhiko , 2022. "Splicing QTL analysis focusing on coding sequences reveals mechanisms for disease susceptibility loci," Nature Communications, Nature, vol. 13(1), pages 1-13, December.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:stabio:v:16:y:2024:i:1:d:10.1007_s12561-023-09375-9. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.