IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v93y2016icp469-482.html
   My bibliography  Save this article

New upper bounds for tight and fast approximation of Fisher’s exact test in dependency rule mining

Author

Listed:
  • Hämäläinen, Wilhelmiina

Abstract

In the dependency rule mining, the goal is to discover the most significant statistical dependencies among all possible collapsed 2×2 contingency tables. Fisher’s exact test is a robust method to estimate the significance and it enables efficient pruning of the search space. The problem is that evaluating the required p-value can be very laborious and the worst case time complexity is O(n), where n is the data size. The traditional solution is to approximate the significance with the χ2-measure, which can be estimated in a constant time. However, the χ2-measure can produce unreliable results (discover spurious dependencies but miss the most significant dependencies). Furthermore, it does not support efficient pruning of the search space. As a solution, a family of tight upper bounds for Fisher’s p is introduced. The new upper bounds are fast to calculate and approximate Fisher’s p-value accurately. In addition, the new approximations are not sensitive to the data size, distribution, or smallest expected counts like the χ2-based approximation. In practice, the execution time depends on the desired accuracy level. According to experimental evaluation, the simplest upper bounds are already sufficiently accurate for dependency rule mining purposes and they can be estimated in 0.004–0.1% of the time needed for exact calculation. For other purposes (testing very weak dependencies), one may need more accurate approximations, but even they can be calculated in less than 1% of the exact calculation time.

Suggested Citation

  • Hämäläinen, Wilhelmiina, 2016. "New upper bounds for tight and fast approximation of Fisher’s exact test in dependency rule mining," Computational Statistics & Data Analysis, Elsevier, vol. 93(C), pages 469-482.
  • Handle: RePEc:eee:csdana:v:93:y:2016:i:c:p:469-482
    DOI: 10.1016/j.csda.2015.08.002
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947315001747
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2015.08.002?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Martin Andres, A. & Mato, A. Silva & Garcia, J. M. Tapia & Quevedo, M. J. Sanchez, 2004. "Comparing the asymptotic power of exact tests in 2x2 tables," Computational Statistics & Data Analysis, Elsevier, vol. 47(4), pages 745-756, November.
    2. Verbeek, Albert & Kroonenberg, Pieter M., 1985. "A survey of algorithms for exact distributions of test statistics in r x c contingency tables with fixed margins," Computational Statistics & Data Analysis, Elsevier, vol. 3(1), pages 159-185, May.
    3. Requena, F. & Ciudad, N. Martin, 2006. "A major improvement to the Network Algorithm for Fisher's Exact Test in 2xc contingency tables," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 490-498, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shao, Xuesi M., 1997. "An efficient algorithm for the exact test on unordered 2 x J contingency tables with equal column sums," Computational Statistics & Data Analysis, Elsevier, vol. 25(3), pages 273-285, August.
    2. Requena, F. & Ciudad, N. Martin, 2000. "Characterization of maximum probability points in the Multivariate Hypergeometric distribution," Statistics & Probability Letters, Elsevier, vol. 50(1), pages 39-47, October.
    3. P. M. Kroonenberg & Albert Verbeek, 2018. "The Tale of Cochran's Rule: My Contingency Table has so Many Expected Values Smaller than 5, What Am I to Do?," The American Statistician, Taylor & Francis Journals, vol. 72(2), pages 175-183, April.
    4. Hirji, Karim F. & Johnson, Timothy D., 1996. "A comparison of algorithms for exact analysis of unordered 2 x K contingency tables," Computational Statistics & Data Analysis, Elsevier, vol. 21(4), pages 419-429, April.
    5. Tammy Harris & James W. Hardin, 2013. "Exact Wilcoxon signed-rank and Wilcoxon Mann–Whitney ranksum tests," Stata Journal, StataCorp LP, vol. 13(2), pages 337-343, June.
    6. Hirji, Karim F., 1997. "A review and a synthesis of the fast Fourier transform algorithms for exact analysis of discrete data," Computational Statistics & Data Analysis, Elsevier, vol. 25(3), pages 321-336, August.
    7. Ivo Molenaar & Herbert Hoijtink, 1990. "The many null distributions of person fit indices," Psychometrika, Springer;The Psychometric Society, vol. 55(1), pages 75-106, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:93:y:2016:i:c:p:469-482. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.