IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1014211.html

Evaluating the utility of amino acid similarity-aware kmers to represent TCR repertoires for classification

Author

Listed:
  • Hannah Kockelbergh
  • Shelley C Evans
  • Liam Brierley
  • Peter L Green
  • Andrea L Jorgensen
  • Elizabeth J Soilleux
  • Anna Fowler

Abstract

Insights gained through interpretation of models trained on the T-cell receptor (TCR) repertoire contribute to advances in understanding of immune-mediated disease. This has the potential to improve diagnostic tests and treatments, particularly for autoimmune diseases. However, TCR repertoire datasets with samples from donors of known autoimmune disease status generally include orders of magnitude fewer samples than TCR sequences. Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. In particular, kmer methods demonstrate strong and stable performance for small datasets. We propose a TCR repertoire representation that considers the relationships between amino acids within kmers flexibly and efficiently. XGBoost and logistic regression models are trained and tested on kmer representations of TCR repertoire datasets including samples from patients with coeliac disease as well as donors with previous cytomegalovirus infection. XGBoost models outperform logistic regression, indicating that interactions may be crucial for discriminative ability. We find that a reduced alphabet based on BLOSUM62 can lead to a model with slightly stronger XGBoost testing performance than other kmer features. Though it remains unclear whether there is an amino acid encoding that can substantially improve TCR repertoire classification with reduced alphabet kmers, evidence that this representation enables faster training of XGBoost models in comparison to kmer clusters suggests that our reduced alphabet approach permits wider exploration of amino acid similarity in practice. Finally, we detail motifs which are important in each top-performing XGBoost model and compare them to TCR sequences previously associated with each immune status. We highlight the challenge of interpreting non-linear TCR repertoire classification models trained on kmers which, if overcome, could lead to biomarker discovery for autoimmune diseases.Author summary: TCR repertoire classification models can provide valuable understanding of autoimmune diseases if they can accurately infer autoimmune disease status and are biologically interpretable. Based on a kmer representation of the TCR repertoire, which has been shown to be most appropriate to train classification models on smaller datasets out of three popular approaches, we develop a computationally efficient method of grouping amino acid sequences to add knowledge to immune status classification model inputs. We find that most of the 4mer-based feature types we tested perform well in combination with an XGBoost model, and that applying a halved alphabet of amino acids based on BLOSUM62 may be beneficial or neutral for immune status classification performance. We also consider the effect on models and features on interpretability, and conclude that although some insights may be gained from inspecting feature importance, dedicated explanatory methods are required to truly understand the complex relationships between kmers that are captured by our best-performing XGBoost models. While standard kmer XGBoost models have the shortest training time, our proposed reduced alphabet methodology presents a more efficient alternative to kmer clustering. Future exploration of amino acid similarity with encodings other than those based on Atchley factors or BLOSUM62, as well as length of kmers k, would benefit from our reduced alphabet representation over clustering of kmers.

Suggested Citation

  • Hannah Kockelbergh & Shelley C Evans & Liam Brierley & Peter L Green & Andrea L Jorgensen & Elizabeth J Soilleux & Anna Fowler, 2026. "Evaluating the utility of amino acid similarity-aware kmers to represent TCR repertoires for classification," PLOS Computational Biology, Public Library of Science, vol. 22(4), pages 1-28, April.
  • Handle: RePEc:plo:pcbi00:1014211
    DOI: 10.1371/journal.pcbi.1014211
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1014211
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1014211&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1014211?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1014211. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.