Author
Listed:
- Yuanyuan Feng
(Zhejiang Lab)
- Junchao Shi
(Zhejiang Lab)
- Zhanwei Li
(Zhejiang Lab)
- Yongqian Li
(Zhejiang Lab)
- Jiaxi Yang
(Zhejiang Lab)
- Shisheng Huang
(Zhejiang Lab)
- Jinfang Zheng
(Zhejiang Lab)
- Wei Han
(Zhejiang Lab)
- Yunbo Qiao
(Shanghai Jiao Tong University School of Medicine
Shanghai Institute of Precision Medicine)
- Jun Zhang
(Nanjing Medical University)
- Qi Liu
(Tongji University
Tongji University)
- Yao Yang
(Zhejiang Lab)
- Chunyi Hu
(National University of Singapore)
- Lina Wu
(Nanjing Normal University)
- Xiaokang Zhang
(Chinese Academy of Sciences)
- Jin Tang
(Zhejiang Lab)
- Xingxu Huang
(Zhejiang Lab
ShanghaiTech University
Zhejiang University School of Medicine)
- Peixiang Ma
(Shanghai Jiao Tong University School of Medicine
Shanghai Jiao Tong University School of Medicine)
Abstract
CRISPR-Cas systems revolutionize life science. Metagenomes contain millions of unknown Cas proteins. Traditional mining relies on protein sequence alignments. In this work, we employ an evolutionary scale language model (ESM) to learn the information beyond sequences. Trained with CRISPR-Cas data, ESM accurately identifies Cas proteins without alignment. Limited experimental data restricts feature prediction, but integrating with machine learning enables trans-cleavage activity prediction of uncharacterized Cas12a. We discover 7 undocumented Cas12a subtypes with unique CRISPR loci. Structural analyses reveal 8 subtypes of Cas1, Cas2, and Cas4. Cas12a subtypes display distinct 3D-folds. CryoEM analyses unveil unique RNA interactions with the uncharacterized Cas12a. These proteins show distinct double-strand and single-strand DNA cleavage preferences and broad PAM recognition. Finally, we establish a specific detection strategy for the oncogene SNP without traditional Cas12a PAM. This study highlights the potential of language models in exploring undocumented Cas protein function via gene cluster classification.
Suggested Citation
Yuanyuan Feng & Junchao Shi & Zhanwei Li & Yongqian Li & Jiaxi Yang & Shisheng Huang & Jinfang Zheng & Wei Han & Yunbo Qiao & Jun Zhang & Qi Liu & Yao Yang & Chunyi Hu & Lina Wu & Xiaokang Zhang & Jin, 2025.
"Discovery of CRISPR-Cas12a clades using a large language model,"
Nature Communications, Nature, vol. 16(1), pages 1-17, December.
Handle:
RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-63160-4
DOI: 10.1038/s41467-025-63160-4
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-63160-4. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.