IDEAS home Printed from
   My bibliography  Save this paper

Multiple Testing and Data Adaptive Regression: An Application to HIV-1 Sequence Data


  • Merrill Birkner

    (Division of Biostatistics, School of Public Health, University of California, Berkeley)

  • Sandra Sinisi

    (Division of Biostatistics, School of Public Health, University of California, Berkeley)

  • Mark van der Laan

    (Division of Biostatistics, School of Public Health, University of California, Berkeley)


Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. We propose using multiple testing to find codons which have significant univariate associations with replication capacity of the virus. We also propose using a data adaptive multiple regression algorithm to obtain multiple predictions of viral replication capacity based on an entire mutant/non-mutant sequence profile. The data set to which these techniques were applied consists of 317 patients, each with 282 sequenced protease and reverse transcriptase codons. Initially, the multiple testing procedure (Pollard and van der Laan, 2003) was applied to the individual specific viral sequence data. A single-step multiple testing procedure method was used to control the family wise error rate (FWER) at the five percent alpha level. Additional augmentation multiple testing procedures were applied to control the generalized family wise error (gFWER) or the tail probability of the proportion of false positives (TPPFP). Finally, the loss-based, cross-validated Deletion/Substitution/Addition regression algorithm (Sinisi and van der Laan, 2004) was applied to the dataset separately. This algorithm builds candidate estimators in the prediction of a univariate outcome by minimizing an empirical risk, and it uses cross-validation to select fine-tuning parameters such as: size of the regression model, maximum allowed order of interaction of terms in the regression model, and the dimension of the vector of covariates. This algorithm also is used to measure variable importance of the codons. Findings from these multiple analyses are consistent with biological findings and could possibly lead to further biological knowledge regarding HIV-1 viral data.

Suggested Citation

  • Merrill Birkner & Sandra Sinisi & Mark van der Laan, 2004. "Multiple Testing and Data Adaptive Regression: An Application to HIV-1 Sequence Data," U.C. Berkeley Division of Biostatistics Working Paper Series 1161, Berkeley Electronic Press.
  • Handle: RePEc:bep:ucbbio:1161 Note:

    Download full text from publisher

    File URL:
    Download Restriction: no

    References listed on IDEAS

    1. van der Laan Mark J. & Dudoit Sandrine & Pollard Katherine S., 2004. "Augmentation Procedures for Control of the Generalized Family-Wise Error Rate and Tail Probabilities for the Proportion of False Positives," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 3(1), pages 1-27, June.
    Full references (including those not matched with items on IDEAS)


    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bep:ucbbio:1161. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Christopher F. Baum). General contact details of provider: .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.