IDEAS home Printed from https://ideas.repec.org/a/spr/jglopt/v68y2017i4d10.1007_s10898-017-0515-z.html
   My bibliography  Save this article

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

Author

Listed:
  • Rundong Du

    (Georgia Institute of Technology)

  • Da Kuang

    (University of California)

  • Barry Drake

    (Georgia Institute of Technology)

  • Haesun Park

    (Georgia Institute of Technology)

Abstract

The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.

Suggested Citation

  • Rundong Du & Da Kuang & Barry Drake & Haesun Park, 2017. "DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling," Journal of Global Optimization, Springer, vol. 68(4), pages 777-798, August.
  • Handle: RePEc:spr:jglopt:v:68:y:2017:i:4:d:10.1007_s10898-017-0515-z
    DOI: 10.1007/s10898-017-0515-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10898-017-0515-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10898-017-0515-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. GILLIS, Nicolas & GLINEUR, François, 2011. "Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization," LIDAM Discussion Papers CORE 2011030, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    2. Daniel D. Lee & H. Sebastian Seung, 1999. "Learning the parts of objects by non-negative matrix factorization," Nature, Nature, vol. 401(6755), pages 788-791, October.
    3. Jingu Kim & Yunlong He & Haesun Park, 2014. "Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework," Journal of Global Optimization, Springer, vol. 58(2), pages 285-319, February.
    4. Da Kuang & Sangwoon Yun & Haesun Park, 2015. "SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering," Journal of Global Optimization, Springer, vol. 62(3), pages 545-574, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Takehiro Sano & Tsuyoshi Migita & Norikazu Takahashi, 2022. "A novel update rule of HALS algorithm for nonnegative matrix factorization and Zangwill’s global convergence," Journal of Global Optimization, Springer, vol. 84(3), pages 755-781, November.
    2. Duy Khuong Nguyen & Tu Bao Ho, 2017. "Accelerated parallel and distributed algorithm using limited internal memory for nonnegative matrix factorization," Journal of Global Optimization, Springer, vol. 68(2), pages 307-328, June.
    3. Srinivas Eswar & Ramakrishnan Kannan & Richard Vuduc & Haesun Park, 2021. "ORCA: Outlier detection and Robust Clustering for Attributed graphs," Journal of Global Optimization, Springer, vol. 81(4), pages 967-989, December.
    4. Gillis, Nicolas & Glineur, François & Tuyttens, Daniel & Vandaele, Arnaud, 2015. "Heuristics for exact nonnegative matrix factorization," LIDAM Discussion Papers CORE 2015006, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    5. Rundong Du & Barry Drake & Haesun Park, 2019. "Hybrid clustering based on content and connection structure using joint nonnegative matrix factorization," Journal of Global Optimization, Springer, vol. 74(4), pages 861-877, August.
    6. Arnaud Vandaele & François Glineur & Nicolas Gillis, 2018. "Algorithms for positive semidefinite factorization," Computational Optimization and Applications, Springer, vol. 71(1), pages 193-219, September.
    7. Andrej Čopar & Blaž Zupan & Marinka Zitnik, 2019. "Fast optimization of non-negative matrix tri-factorization," PLOS ONE, Public Library of Science, vol. 14(6), pages 1-15, June.
    8. Flavia Esposito, 2021. "A Review on Initialization Methods for Nonnegative Matrix Factorization: Towards Omics Data Experiments," Mathematics, MDPI, vol. 9(9), pages 1-17, April.
    9. Da Kuang & Sangwoon Yun & Haesun Park, 2015. "SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering," Journal of Global Optimization, Springer, vol. 62(3), pages 545-574, July.
    10. Jingu Kim & Yunlong He & Haesun Park, 2014. "Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework," Journal of Global Optimization, Springer, vol. 58(2), pages 285-319, February.
    11. Norikazu Takahashi & Ryota Hibi, 2014. "Global convergence of modified multiplicative updates for nonnegative matrix factorization," Computational Optimization and Applications, Springer, vol. 57(2), pages 417-440, March.
    12. Saeedmanesh, Mohammadreza & Geroliminis, Nikolas, 2016. "Clustering of heterogeneous networks with directional flows based on “Snake” similarities," Transportation Research Part B: Methodological, Elsevier, vol. 91(C), pages 250-269.
    13. Norikazu Takahashi & Jiro Katayama & Masato Seki & Jun’ichi Takeuchi, 2018. "A unified global convergence analysis of multiplicative update rules for nonnegative matrix factorization," Computational Optimization and Applications, Springer, vol. 71(1), pages 221-250, September.
    14. Del Corso, Gianna M. & Romani, Francesco, 2019. "Adaptive nonnegative matrix factorization and measure comparisons for recommender systems," Applied Mathematics and Computation, Elsevier, vol. 354(C), pages 164-179.
    15. P Fogel & C Geissler & P Cotte & G Luta, 2022. "Applying separative non-negative matrix factorization to extra-financial data," Working Papers hal-03689774, HAL.
    16. Xiao-Bai Li & Jialun Qin, 2017. "Anonymizing and Sharing Medical Text Records," Information Systems Research, INFORMS, vol. 28(2), pages 332-352, June.
    17. Naiyang Guan & Lei Wei & Zhigang Luo & Dacheng Tao, 2013. "Limited-Memory Fast Gradient Descent Method for Graph Regularized Nonnegative Matrix Factorization," PLOS ONE, Public Library of Science, vol. 8(10), pages 1-10, October.
    18. Spelta, A. & Pecora, N. & Rovira Kaltwasser, P., 2019. "Identifying Systemically Important Banks: A temporal approach for macroprudential policies," Journal of Policy Modeling, Elsevier, vol. 41(1), pages 197-218.
    19. M. Moghadam & K. Aminian & M. Asghari & M. Parnianpour, 2013. "How well do the muscular synergies extracted via non-negative matrix factorisation explain the variation of torque at shoulder joint?," Computer Methods in Biomechanics and Biomedical Engineering, Taylor & Francis Journals, vol. 16(3), pages 291-301.
    20. Markovsky, Ivan & Niranjan, Mahesan, 2010. "Approximate low-rank factorization with structured factors," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 3411-3420, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jglopt:v:68:y:2017:i:4:d:10.1007_s10898-017-0515-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.