IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2512.05948.html

Developing synthetic microdata through machine learning for firm-level business surveys

Author

Listed:
  • Jorge Cisneros
  • Timothy Wojan
  • Matthew Williams
  • Jennifer Ozawa
  • Robert Chew
  • Kimberly Janda
  • Timothy Navarro
  • Michael Floyd
  • Christine Task
  • Damon Streat

Abstract

Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

Suggested Citation

  • Jorge Cisneros & Timothy Wojan & Matthew Williams & Jennifer Ozawa & Robert Chew & Kimberly Janda & Timothy Navarro & Michael Floyd & Christine Task & Damon Streat, 2025. "Developing synthetic microdata through machine learning for firm-level business surveys," Papers 2512.05948, arXiv.org, revised Dec 2025.
  • Handle: RePEc:arx:papers:2512.05948
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2512.05948
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Javier Miranda & Lars Vilhuber, 2016. "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," Working Papers 16-10, Center for Economic Studies, U.S. Census Bureau.
    2. Gary Benedetto & Jordan C. Stanley & Evan Totty, 2018. "The Creation and Use of the SIPP Synthetic Beta v7.0," CES Technical Notes Series 18-03, Center for Economic Studies, U.S. Census Bureau.
    3. Ron S. Jarmin & Thomas A. Louis & Javier Miranda, 2014. "Expanding The Role Of Synthetic Data At The U.S. Census Bureau," Working Papers 14-10, Center for Economic Studies, U.S. Census Bureau.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Satkartar K. Kinney & Jerome P. Reiter & Javier Miranda, 2014. "Improving The Synthetic Longitudinal Business Database," Working Papers 14-12, Center for Economic Studies, U.S. Census Bureau.
    2. Nathan Goldschlag & Javier Miranda, 2020. "Business dynamics statistics of High Tech industries," Journal of Economics & Management Strategy, Wiley Blackwell, vol. 29(1), pages 3-30, January.
    3. Daniel H. Weinberg & John M. Abowd & Robert F. Belli & Noel Cressie & David C. Folch & Scott H. Holan & Margaret C. Levenstein & Kristen M. Olson & Jerome P. Reiter & Matthew D. Shapiro & Jolene Smyth, 2017. "Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?," Working Papers 17-59r, Center for Economic Studies, U.S. Census Bureau.
    4. Mohitosh Kejriwal & Xiaoxiao Li & Evan Totty, 2020. "Multidimensional skills and the returns to schooling: Evidence from an interactive fixed‐effects approach and a linked survey‐administrative data set," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 35(5), pages 548-566, August.
    5. Joshua Snoke & Gillian M. Raab & Beata Nowok & Chris Dibben & Aleksandra Slavkovic, 2018. "General and specific utility measures for synthetic data," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 663-688, June.
    6. Mohitosh Kejriwal & Xiaoxiao Li & Evan Totty, 2019. "Multidemsional Skills and Returns to Schooling: Evidence from an Interactive Fixed Effects Aproach and a Linked Survey-Administrative Dataset," Purdue University Economics Working Papers 1316, Purdue University, Department of Economics.
    7. John M. Abowd & Ian M. Schmutte & William N. Sexton & Lars Vilhuber, 2019. "Why the Economics Profession Must Actively Participate in the Privacy Protection Debate," AEA Papers and Proceedings, American Economic Association, vol. 109, pages 397-402, May.
    8. Javier Miranda & Lars Vilhuber, 2016. "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," Working Papers 16-10, Center for Economic Studies, U.S. Census Bureau.
    9. Miranda, Javier & Lars Vilhuber, 2014. "Looking Back On Three Years Of Using The Synthetic Lbd Beta," Working Papers 14-11, Center for Economic Studies, U.S. Census Bureau.
    10. Christine P. Chai, 2022. "Christine P. Chai's contribution to the Discussion of ‘Gaussian Differential Privacy’ by Dong et al," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(1), pages 43-44, February.
    11. Jahangir Alam M. & Dostie Benoit & Drechsler Jörg & Vilhuber Lars, 2020. "Applying data synthesis for longitudinal business data across three countries," Statistics in Transition New Series, Statistics Poland, vol. 21(4), pages 212-236, August.
    12. Melissa C. Chow & Teresa C. Fort & Christopher Goetz & Nathan Goldschlag & James Lawrence & Elisabeth Ruth Perlman & Martha Stinson & T. Kirk White, 2021. "Redesigning the Longitudinal Business Database," NBER Working Papers 28839, National Bureau of Economic Research, Inc.
    13. Hampton, Matt & Totty, Evan, 2023. "Minimum wages, retirement timing, and labor supply," Journal of Public Economics, Elsevier, vol. 224(C).

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2512.05948. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.