IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0238835.html
   My bibliography  Save this article

Analyzing the fine structure of distributions

Author

Listed:
  • Michael C Thrun
  • Tino Gehlert
  • Alfred Ultsch

Abstract

One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.

Suggested Citation

  • Michael C Thrun & Tino Gehlert & Alfred Ultsch, 2020. "Analyzing the fine structure of distributions," PLOS ONE, Public Library of Science, vol. 15(10), pages 1-20, October.
  • Handle: RePEc:plo:pone00:0238835
    DOI: 10.1371/journal.pone.0238835
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238835
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0238835&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0238835?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Levy, Moshe & Solomon, Sorin, 1997. "New evidence for the power-law distribution of wealth," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 242(1), pages 90-94.
    2. Kampstra, Peter, 2008. "Beanplot: A Boxplot Alternative for Visual Comparison of Distributions," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 28(c01).
    3. Ferreira, Jose T.A.S. & Steel, Mark F.J., 2006. "A Constructive Representation of Univariate Skewed Distributions," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 823-829, June.
    4. Glenn Milligan & Martha Cooper, 1988. "A study of standardization of variables in cluster analysis," Journal of Classification, Springer;The Classification Society, vol. 5(2), pages 181-204, September.
    5. Jeff Alstott & Ed Bullmore & Dietmar Plenz, 2014. "powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions," PLOS ONE, Public Library of Science, vol. 9(1), pages 1-11, January.
    6. Racine, Jeffrey S., 2008. "Nonparametric Econometrics: A Primer," Foundations and Trends(R) in Econometrics, now publishers, vol. 3(1), pages 1-88, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Michael C. Thrun & Alfred Ultsch, 2021. "Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 280-312, July.
    2. Marian Lux & Stefanie Rinderle-Ma, 2023. "DDCAL: Evenly Distributing Data into Low Variance Clusters Based on Iterative Feature Scaling," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 106-144, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Vijverberg, Wim P. & Hasebe, Takuya, 2015. "GTL Regression: A Linear Model with Skewed and Thick-Tailed Disturbances," IZA Discussion Papers 8898, Institute of Labor Economics (IZA).
    2. Brzezinski, Michal, 2014. "Do wealth distributions follow power laws? Evidence from ‘rich lists’," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 406(C), pages 155-162.
    3. Geeraert, Joke & Rocha, Luis E.C. & Vandeviver, Christophe, 2024. "The impact of violent behavior on co-offender selection: Evidence of behavioral homophily," Journal of Criminal Justice, Elsevier, vol. 94(C).
    4. Roberto Martino & Phu Nguyen-Van, 2014. "Labour market regulation and fiscal parameters: A structural model for European regions," Working Papers of BETA 2014-19, Bureau d'Economie Théorique et Appliquée, UDS, Strasbourg.
    5. Rubio, F.J. & Steel, M.F.J., 2011. "Inference for grouped data with a truncated skew-Laplace distribution," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3218-3231, December.
    6. Giuseppe RICCIARDO LAMONICA, 2002. "La funzionalita' nelle zone omogenee delle Marche," Working Papers 165, Universita' Politecnica delle Marche (I), Dipartimento di Scienze Economiche e Sociali.
    7. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    8. Dawid Majcherek & Marzenna Anna Weresa & Christina Ciecierski, 2020. "Understanding Regional Risk Factors for Cancer: A Cluster Analysis of Lifestyle, Environment and Socio-Economic Status in Poland," Sustainability, MDPI, vol. 12(21), pages 1-15, October.
    9. Sumeet Kumar & Binxuan Huang & Ramon Alfonso Villa Cox & Kathleen M. Carley, 2021. "An anatomical comparison of fake-news and trusted-news sharing pattern on Twitter," Computational and Mathematical Organization Theory, Springer, vol. 27(2), pages 109-133, June.
    10. Don Harding, 2010. "Applying shape and phase restrictions in generalized dynamic categorical models of the business cycle," NCER Working Paper Series 58, National Centre for Econometric Research.
    11. E. Samanidou & E. Zschischang & D. Stauffer & T. Lux, 2001. "Microscopic Models of Financial Markets," Papers cond-mat/0110354, arXiv.org.
    12. Rama Cont & Jean-Philippe Bouchaud, 1997. "Herd behavior and aggregate fluctuations in financial markets," Science & Finance (CFM) working paper archive 500028, Science & Finance, Capital Fund Management.
    13. Shu Takahashi & Kento Yamamoto & Shumpei Kobayashi & Ryoma Kondo & Ryohei Hisano, 2024. "Dynamic Link and Flow Prediction in Bank Transfer Networks," Papers 2409.08718, arXiv.org, revised Oct 2024.
    14. Marco Raberto & Silvano Cincotti & Sergio Focardi & Michele Marchesi, 2003. "Traders' Long-Run Wealth in an Artificial Financial Market," Computational Economics, Springer;Society for Computational Economics, vol. 22(2), pages 255-272, October.
    15. Li, Heyang & Wu, Meijun & Wang, Yougui & Zeng, An, 2022. "Bibliographic coupling networks reveal the advantage of diversification in scientific projects," Journal of Informetrics, Elsevier, vol. 16(3).
    16. Ferreira, Jose T.A.S. & Steel, Mark F.J., 2007. "Model comparison of coordinate-free multivariate skewed distributions with an application to stochastic frontiers," Journal of Econometrics, Elsevier, vol. 137(2), pages 641-673, April.
    17. Jiaqi Liang & Linjing Li & Daniel Zeng, 2018. "Evolutionary dynamics of cryptocurrency transaction networks: An empirical study," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-18, August.
    18. George Halkos & Nickolaos Tzeremes, 2012. "Measuring German regions’ environmental efficiency: a directional distance function approach," Letters in Spatial and Resource Sciences, Springer, vol. 5(1), pages 7-16, March.
    19. Zhou, Bin & Yan, Xiao-Yong & Xu, Xiao-Ke & Xu, Xiao-Ting & Wang, Nianxin, 2018. "Evolutionary of online social networks driven by pareto wealth distribution and bidirectional preferential attachment," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 507(C), pages 427-434.
    20. Klein, Ingo & Fischer, Matthias J., 2003. "Skewness by splitting the scale parameter," Discussion Papers 55/2003, Friedrich-Alexander University Erlangen-Nuremberg, Chair of Statistics and Econometrics.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0238835. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.