IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v14y2023i1d10.1038_s41467-023-42992-y.html
   My bibliography  Save this article

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Author

Listed:
  • Kangming Li

    (University of Toronto)

  • Daniel Persaud

    (University of Toronto)

  • Kamal Choudhary

    (National Institute of Standards and Technology)

  • Brian DeCost

    (National Institute of Standards and Technology)

  • Michael Greenwood

    (Natural Resources Canada)

  • Jason Hattrick-Simpers

    (University of Toronto
    University of Toronto
    Vector Institute for Artificial Intelligence
    Schwartz Reisman Institute for Technology and Society)

Abstract

Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the “bigger is better” mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Suggested Citation

  • Kangming Li & Daniel Persaud & Kamal Choudhary & Brian DeCost & Michael Greenwood & Jason Hattrick-Simpers, 2023. "Exploiting redundancy in large materials datasets for efficient machine learning with less data," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
  • Handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y
    DOI: 10.1038/s41467-023-42992-y
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-023-42992-y
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-023-42992-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Xiwen Jia & Allyson Lynch & Yuheng Huang & Matthew Danielson & Immaculate Lang’at & Alexander Milder & Aaron E. Ruby & Hao Wang & Sorelle A. Friedler & Alexander J. Norquist & Joshua Schrier, 2019. "Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis," Nature, Nature, vol. 573(7773), pages 251-255, September.
    2. Miao Zhong & Kevin Tran & Yimeng Min & Chuanhao Wang & Ziyun Wang & Cao-Thang Dinh & Phil De Luna & Zongqian Yu & Armin Sedighian Rasouli & Peter Brodersen & Song Sun & Oleksandr Voznyy & Chih-Shan Ta, 2020. "Accelerated discovery of CO2 electrocatalysts using active machine learning," Nature, Nature, vol. 581(7807), pages 178-183, May.
    3. Keith T. Butler & Daniel W. Davies & Hugh Cartwright & Olexandr Isayev & Aron Walsh, 2018. "Machine learning for molecular and materials science," Nature, Nature, vol. 559(7715), pages 547-555, July.
    4. So Takamoto & Chikashi Shinagawa & Daisuke Motoki & Kosuke Nakago & Wenwen Li & Iori Kurata & Taku Watanabe & Yoshihiro Yayama & Hiroki Iriguchi & Yusuke Asano & Tasuku Onodera & Takafumi Ishii & Taka, 2022. "Towards universal neural network potential for material discovery applicable to arbitrary combination of 45 elements," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kihoon Bang & Doosun Hong & Youngtae Park & Donghun Kim & Sang Soo Han & Hyuck Mo Lee, 2023. "Machine learning-enabled exploration of the electrochemical stability of real-scale metallic nanoparticles," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    2. Han Li & Ruotian Zhang & Yaosen Min & Dacheng Ma & Dan Zhao & Jianyang Zeng, 2023. "A knowledge-guided pre-training framework for improving molecular representation learning," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    3. Li, Yi & Liu, Kailong & Foley, Aoife M. & Zülke, Alana & Berecibar, Maitane & Nanini-Maury, Elise & Van Mierlo, Joeri & Hoster, Harry E., 2019. "Data-driven health estimation and lifetime prediction of lithium-ion batteries: A review," Renewable and Sustainable Energy Reviews, Elsevier, vol. 113(C), pages 1-1.
    4. Sarmad Dashti Latif & Ali Najah Ahmed, 2023. "A review of deep learning and machine learning techniques for hydrological inflow forecasting," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 25(11), pages 12189-12216, November.
    5. Niklas W. A. Gebauer & Michael Gastegger & Stefaan S. P. Hessmann & Klaus-Robert Müller & Kristof T. Schütt, 2022. "Inverse design of 3d molecular structures with conditional generative neural networks," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    6. Gang Wang & Shinya Mine & Duotian Chen & Yuan Jing & Kah Wei Ting & Taichi Yamaguchi & Motoshi Takao & Zen Maeno & Ichigaku Takigawa & Koichi Matsushita & Ken-ichi Shimizu & Takashi Toyao, 2023. "Accelerated discovery of multi-elemental reverse water-gas shift catalysts using extrapolative machine learning approach," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    7. Huziel E. Sauceda & Luis E. Gálvez-González & Stefan Chmiela & Lauro Oliver Paz-Borbón & Klaus-Robert Müller & Alexandre Tkatchenko, 2022. "BIGDML—Towards accurate quantum machine learning force fields for materials," Nature Communications, Nature, vol. 13(1), pages 1-16, December.
    8. Sukriti Manna & Troy D. Loeffler & Rohit Batra & Suvo Banik & Henry Chan & Bilvin Varughese & Kiran Sasikumar & Michael Sternberg & Tom Peterka & Mathew J. Cherukara & Stephen K. Gray & Bobby G. Sumpt, 2022. "Learning in continuous action space for developing high dimensional potential energy models," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    9. Ribeiro, Haroldo V. & Lopes, Diego D. & Pessa, Arthur A.B. & Martins, Alvaro F. & da Cunha, Bruno R. & Gonçalves, Sebastián & Lenzi, Ervin K. & Hanley, Quentin S. & Perc, Matjaž, 2023. "Deep learning criminal networks," Chaos, Solitons & Fractals, Elsevier, vol. 172(C).
    10. Xiaojie She & Lingling Zhai & Yifei Wang & Pei Xiong & Molly Meng-Jung Li & Tai-Sing Wu & Man Chung Wong & Xuyun Guo & Zhihang Xu & Huaming Li & Hui Xu & Ye Zhu & Shik Chi Edman Tsang & Shu Ping Lau, 2024. "Pure-water-fed, electrocatalytic CO2 reduction to ethylene beyond 1,000 h stability at 10 A," Nature Energy, Nature, vol. 9(1), pages 81-91, January.
    11. Jin Zhang & Chenxi Guo & Susu Fang & Xiaotong Zhao & Le Li & Haoyang Jiang & Zhaoyang Liu & Ziqi Fan & Weigao Xu & Jianping Xiao & Miao Zhong, 2023. "Accelerating electrochemical CO2 reduction to multi-carbon products via asymmetric intermediate binding at confined nanointerfaces," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    12. Dong Hyeon Mok & Hong Li & Guiru Zhang & Chaehyeon Lee & Kun Jiang & Seoin Back, 2023. "Data-driven discovery of electrocatalysts for CO2 reduction using active motifs-based machine learning," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    13. Xinyu Chen & Yufeng Xie & Yaochen Sheng & Hongwei Tang & Zeming Wang & Yu Wang & Yin Wang & Fuyou Liao & Jingyi Ma & Xiaojiao Guo & Ling Tong & Hanqi Liu & Hao Liu & Tianxiang Wu & Jiaxin Cao & Sitong, 2021. "Wafer-scale functional circuits based on two dimensional semiconductors with fabrication optimized by machine learning," Nature Communications, Nature, vol. 12(1), pages 1-8, December.
    14. Jiawei Li & Hongliang Zeng & Xue Dong & Yimin Ding & Sunpei Hu & Runhao Zhang & Yizhou Dai & Peixin Cui & Zhou Xiao & Donghao Zhao & Liujiang Zhou & Tingting Zheng & Jianping Xiao & Jie Zeng & Chuan X, 2023. "Selective CO2 electrolysis to CO using isolated antimony alloyed copper," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    15. Yufei Cao & Zhu Chen & Peihao Li & Adnan Ozden & Pengfei Ou & Weiyan Ni & Jehad Abed & Erfan Shirzadi & Jinqiang Zhang & David Sinton & Jun Ge & Edward H. Sargent, 2023. "Surface hydroxide promotes CO2 electrolysis to ethylene in acidic conditions," Nature Communications, Nature, vol. 14(1), pages 1-8, December.
    16. Jiawei Zhu & Yu Zhang & Zitao Chen & Zhenbao Zhang & Xuezeng Tian & Minghua Huang & Xuedong Bai & Xue Wang & Yongfa Zhu & Heqing Jiang, 2024. "Superexchange-stabilized long-distance Cu sites in rock-salt-ordered double perovskite oxides for CO2 electromethanation," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    17. Stefan Ringe, 2023. "The importance of a charge transfer descriptor for screening potential CO2 reduction electrocatalysts," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    18. Pessa, Arthur A.B. & Zola, Rafael S. & Perc, Matjaž & Ribeiro, Haroldo V., 2022. "Determining liquid crystal properties with ordinal networks and machine learning," Chaos, Solitons & Fractals, Elsevier, vol. 154(C).
    19. Xiaohan Yu & Yuting Xu & Le Li & Mingzhe Zhang & Wenhao Qin & Fanglin Che & Miao Zhong, 2024. "Coverage enhancement accelerates acidic CO2 electrolysis at ampere-level current with high energy and carbon efficiencies," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    20. Kai Li & Jifeng Wang & Yuanyuan Song & Ying Wang, 2023. "Machine learning-guided discovery of ionic polymer electrolytes for lithium metal batteries," Nature Communications, Nature, vol. 14(1), pages 1-12, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.