Author
Listed:
- Sepideh Mazrouee
- Susan J Little
- Joel O Wertheim
Abstract
HIV molecular epidemiology estimates the transmission patterns from clustering genetically similar viruses. The process involves connecting genetically similar genotyped viral sequences in the network implying epidemiological transmissions. This technique relies on genotype data which is collected only from HIV diagnosed and in-care populations and leaves many persons with HIV (PWH) who have no access to consistent care out of the tracking process. We use machine learning algorithms to learn the non-linear correlation patterns between patient metadata and transmissions between HIV-positive cases. This enables us to expand the transmission network reconstruction beyond the molecular network. We employed multiple commonly used supervised classification algorithms to analyze the San Diego Primary Infection Resource Consortium (PIRC) cohort dataset, consisting of genotypes and nearly 80 additional non-genetic features. First, we trained classification models to determine genetically unrelated individuals from related ones. Our results show that random forest and decision tree achieved over 80% in accuracy, precision, recall, and F1-score by only using a subset of meta-features including age, birth sex, sexual orientation, race, transmission category, estimated date of infection, and first viral load date besides genetic data. Additionally, both algorithms achieved approximately 80% sensitivity and specificity. The Area Under Curve (AUC) is reported 97% and 94% for random forest and decision tree classifiers respectively. Next, we extended the models to identify clusters of similar viral sequences. Support vector machine demonstrated one order of magnitude improvement in accuracy of assigning the sequences to the correct cluster compared to dummy uniform random classifier. These results confirm that metadata carries important information about the dynamics of HIV transmission as embedded in transmission clusters. Hence, novel computational approaches are needed to apply the non-trivial knowledge collected from inter-individual genetic information to metadata from PWH in order to expand the estimated transmissions. We note that feature extraction alone will not be effective in identifying patterns of transmission and will result in random clustering of the data, but its utilization in conjunction with genetic data and the right algorithm can contribute to the expansion of the reconstructed network beyond individuals with genetic data.1 Author summary: Molecular transmission networks are built by connecting similar HIV-1 drug resistance viral genomes. Using such methods, approximately half of all sequences stay unlinked, and the remaining half fall into categories of many small clusters and a few large clusters showing fragmentary epidemiological relations. However, the unavailability of genetic data for the entire underlying transmission network and challenges of sampling completeness limits the reliance on creating transmission clusters. Here, we take the known relationship from genetic data and transform the problem into a hybrid multi-step unsupervised and supervised learning problem in which we use contextual data to create a model unwrapping the hidden epidemiological relation among positive cases in transmission networks beyond genetic data. The goal is not to provide a static feature set from metadata that can be used for all jurisdictions but can offer a dynamic classifier framework that can reveal the unique dynamics of the spread of HIV in each jurisdiction.
Suggested Citation
Sepideh Mazrouee & Susan J Little & Joel O Wertheim, 2021.
"Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment,"
PLOS Computational Biology, Public Library of Science, vol. 17(9), pages 1-12, September.
Handle:
RePEc:plo:pcbi00:1009336
DOI: 10.1371/journal.pcbi.1009336
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1009336. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.