IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v16y2024i2p55-d1335738.html
   My bibliography  Save this article

Automated Identification of Sensitive Financial Data Based on the Topic Analysis

Author

Listed:
  • Meng Li

    (School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China)

  • Jiqiang Liu

    (School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China)

  • Yeping Yang

    (School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China)

Abstract

Data governance is an extremely important protection and management measure throughout the entire life cycle of data. However, there are still data governance issues, such as data security risks, data privacy breaches, and difficulties in data management and access control. These problems lead to a risk of data breaches and abuse. Therefore, the security classification and grading of data has become an important task to accurately identify sensitive data and adopt appropriate maintenance and management measures with different sensitivity levels. This work started from the problems existing in the current data security classification and grading work, such as inconsistent classification and grading standards, difficult data acquisition and sorting, and weak semantic information of data fields, to find the limitations of the current methods and the direction for improvement. The automatic identification method of sensitive financial data proposed in this paper is based on topic analysis and was constructed by incorporating Jieba word segmentation, word frequency statistics, the skip-gram model, K-means clustering, and other technologies. Expert assistance was sought to select appropriate keywords for enhanced accuracy. This work used the descriptive text library and real business data of a Chinese financial institution for training and testing to further demonstrate its effectiveness and usefulness. The evaluation indicators illustrated the effectiveness of this method in the classification of data security. The proposed method addressed the challenge of sensitivity level division in texts with limited semantic information, which overcame the limitations on model expansion across different domains and provided an optimized application model. All of the above pointed out the direction for the real-time updating of the method.

Suggested Citation

  • Meng Li & Jiqiang Liu & Yeping Yang, 2024. "Automated Identification of Sensitive Financial Data Based on the Topic Analysis," Future Internet, MDPI, vol. 16(2), pages 1-17, February.
  • Handle: RePEc:gam:jftint:v:16:y:2024:i:2:p:55-:d:1335738
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/16/2/55/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/16/2/55/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Mark Chiang & Boris Mirkin, 2010. "Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads," Journal of Classification, Springer;The Classification Society, vol. 27(1), pages 3-40, March.
    2. Abraham, Rene & Schneider, Johannes & vom Brocke, Jan, 2019. "Data governance: A conceptual framework, structured review, and research agenda," International Journal of Information Management, Elsevier, vol. 49(C), pages 424-438.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shouxiang Wang & Pengfei Dong & Yingjie Tian, 2017. "A Novel Method of Statistical Line Loss Estimation for Distribution Feeders Based on Feeder Cluster and Modified XGBoost," Energies, MDPI, vol. 10(12), pages 1-17, December.
    2. Pawel Dlotko & Wanling Qiu & Simon Rudkin, 2022. "Topological Data Analysis Ball Mapper for Finance," Papers 2206.03622, arXiv.org.
    3. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    4. Struijk, Mylène, 2023. "IT Governance in the digital era : Insights from meta-organizations," Other publications TiSEM a6f02085-ff68-427f-b65a-8, Tilburg University, School of Economics and Management.
    5. Shamim, Saqib & Yang, Yumei & Ul Zia, Najam & Khan, Zaheer & Shariq, Syed Muhammad, 2023. "Mechanisms of cognitive trust development in artificial intelligence among front line employees: An empirical examination from a developing economy," Journal of Business Research, Elsevier, vol. 167(C).
    6. Sara Dolnicar & Friedrich Leisch, 2017. "Using segment level stability to select target segments in data-driven market segmentation studies," Marketing Letters, Springer, vol. 28(3), pages 423-436, September.
    7. Muhamad Rizki & Muhammad Zudhy Irawan & Puspita Dirgahayani & Prawira Fajarindra Belgiawan & Retno Wihanesta, 2022. "Low Emission Zone (LEZ) Expansion in Jakarta: Acceptability and Restriction Preference," Sustainability, MDPI, vol. 14(19), pages 1-22, September.
    8. Anassaya Chawviang & Supaporn Kiattisin, 2022. "Sustainable Development: Smart Co-Operative Management Framework," Sustainability, MDPI, vol. 14(6), pages 1-25, March.
    9. Aslani, Mehrdad & Faraji, Jamal & Hashemi-Dezaki, Hamed & Ketabi, Abbas, 2022. "A novel clustering-based method for reliability assessment of cyber-physical microgrids considering cyber interdependencies and information transmission errors," Applied Energy, Elsevier, vol. 315(C).
    10. Rene Abraham & Johannes Schneider & Jan vom Brocke, 2023. "A taxonomy of data governance decision domains in data marketplaces," Electronic Markets, Springer;IIM University of St. Gallen, vol. 33(1), pages 1-13, December.
    11. Nunan, Daniel & Di Domenico, MariaLaura, 2022. "Value creation in an algorithmic world: Towards an ethics of dynamic pricing," Journal of Business Research, Elsevier, vol. 150(C), pages 451-460.
    12. J. Fernando Vera & Rodrigo Macías, 2017. "Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data," Psychometrika, Springer;The Psychometric Society, vol. 82(2), pages 275-294, June.
    13. Cristina Tortora & Mireille Gettler Summa & Marina Marino & Francesco Palumbo, 2016. "Factor probabilistic distance clustering (FPDC): a new clustering method," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 10(4), pages 441-464, December.
    14. Jaehong Yu & Hua Zhong & Seoung Bum Kim, 2020. "An Ensemble Feature Ranking Algorithm for Clustering Analysis," Journal of Classification, Springer;The Classification Society, vol. 37(2), pages 462-489, July.
    15. Haoyang Ping & Zhuocheng Li & Xizhu Shen & Haizhen Sun, 2024. "Optimization of Vegetable Restocking and Pricing Strategies for Innovating Supermarket Operations Utilizing a Combination of ARIMA, LSTM, and FP-Growth Algorithms," Mathematics, MDPI, vol. 12(7), pages 1-17, March.
    16. Dogan Gursoy & Anna Maria Parroco & Raffaele Scuderi, 2013. "An Examination of Tourist Arrivals Dynamics Using Short-Term Time Series Data: A Space—Time Cluster Approach," Tourism Economics, , vol. 19(4), pages 761-777, August.
    17. Al-Augby Salam & Majewski Sebastian & Majewska Agnieszka & Nermend Kesra, 2014. "A Comparison Of K-Means And Fuzzy C-Means Clustering Methods For A Sample Of Gulf Cooperation Council Stock Markets," Folia Oeconomica Stetinensia, Sciendo, vol. 14(2), pages 19-36, December.
    18. Ekaterina Kovaleva & Boris Mirkin, 2015. "Bisecting K-Means and 1D Projection Divisive Clustering: A Unified Framework and Experimental Comparison," Journal of Classification, Springer;The Classification Society, vol. 32(3), pages 414-442, October.
    19. Zina Taran & Boris Mirkin, 2020. "Exploring patterns of corporate social responsibility using a complementary K-means clustering criterion," Business Research, Springer;German Academic Association for Business Research, vol. 13(2), pages 513-540, July.
    20. Matteo Farnè & Angelos T. Vouldis, 2021. "Banks’ business models in the euro area: a cluster analysis in high dimensions," Annals of Operations Research, Springer, vol. 305(1), pages 23-57, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:16:y:2024:i:2:p:55-:d:1335738. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.