IDEAS home Printed from https://ideas.repec.org/p/ifs/cemmap/38-11.html
   My bibliography  Save this paper

Identification, data combination and the risk of disclosure

Author

Listed:
  • Tatiana V. Komarova

    () (Institute for Fiscal Studies and London School of Economics and Political Science)

  • Denis Nekipelov

    (Institute for Fiscal Studies and Berkeley)

  • Evgeny Yakovlev

    (Institute for Fiscal Studies)

Abstract

Businesses routinely rely on econometric models to analyze and predict consumer behavior. Estimation of such models may require combining a firm's internal data with external datasets to take into account sample selection, missing observations, omitted variables and errors in measurement within the existing data source. In this paper we point out that these data problems can be addressed when estimating econometric models from combined data using the data mining techniques under mild assumptions regarding the data distribution. However, data combination leads to serious threats to security of consumer data: we demonstrate that point identification of an econometric model from combined data is incompatible with restrictions on the risk of individual disclosure. Consequently, if a consumer model is point identified, the firm would (implicitly or explicitly) reveal the identity of at least some of consumers in its internal data. More importantly, we provide an argument that unless the firm places a restriction on the individual disclosure risk when combining data, even if the raw combined dataset is not shared with a third party, an adversary or a competitor can gather confidential information regarding some individuals from the estimated model.

Suggested Citation

  • Tatiana V. Komarova & Denis Nekipelov & Evgeny Yakovlev, 2011. "Identification, data combination and the risk of disclosure," CeMMAP working papers CWP38/11, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
  • Handle: RePEc:ifs:cemmap:38/11
    as

    Download full text from publisher

    File URL: http://cemmap.ifs.org.uk/wps/cwp3811.pdf
    Download Restriction: no

    Other versions of this item:

    References listed on IDEAS

    as
    1. Manuel A. Domínguez & Ignacio N. Lobato, 2004. "Consistent Estimation of Models Defined by Conditional Moment Restrictions," Econometrica, Econometric Society, vol. 72(5), pages 1601-1615, September.
    2. Charles F. Manski & Elie Tamer, 2002. "Inference on Regressions with Interval Data on a Regressor or Outcome," Econometrica, Econometric Society, vol. 70(2), pages 519-546, March.
    3. Ridder, Geert & Moffitt, Robert, 2007. "The Econometrics of Data Combination," Handbook of Econometrics,in: J.J. Heckman & E.E. Leamer (ed.), Handbook of Econometrics, edition 1, volume 6, chapter 75 Elsevier.
    4. Avi Goldfarb & Catherine Tucker, 2011. "Online Display Advertising: Targeting and Obtrusiveness," Marketing Science, INFORMS, vol. 30(3), pages 389-404, 05-06.
    5. Calzolari, Giacomo & Pavan, Alessandro, 2006. "On the optimality of privacy in sequential contracting," Journal of Economic Theory, Elsevier, vol. 130(1), pages 168-204, September.
    6. P. Lahiri & Michael D. Larsen, 2005. "Regression Analysis With Linked Data," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 222-230, March.
    7. Horowitz, Joel L & Manski, Charles F, 1995. "Identification and Robustness with Contaminated and Corrupted Data," Econometrica, Econometric Society, vol. 63(2), pages 281-302, March.
    8. Alessandro Acquisti & Hal R. Varian, 2005. "Conditioning Prices on Purchase History," Marketing Science, INFORMS, vol. 24(3), pages 367-381, May.
    9. Amalia R. Miller & Catherine Tucker, 2009. "Privacy Protection and Technology Diffusion: The Case of Electronic Medical Records," Management Science, INFORMS, vol. 55(7), pages 1077-1093, July.
    10. Thierry Magnac & Eric Maurin, 2008. "Partial Identification in Monotone Binary Models: Discrete Regressors and Interval Data," Review of Economic Studies, Oxford University Press, vol. 75(3), pages 835-864.
    11. Horowitz, Joel L. & Manski, Charles F., 2006. "Identification and estimation of statistical functionals using incomplete data," Journal of Econometrics, Elsevier, vol. 132(2), pages 445-459, June.
    12. Curtis R. Taylor, 2004. "Consumer Privacy and the Market for Customer Information," RAND Journal of Economics, The RAND Corporation, vol. 35(4), pages 631-650, Winter.
    13. Satkartar K. Kinney & Jerome P. Reiter & Arnold P. Reznek & Javier Miranda & Ron S. Jarmin & John M. Abowd, 2011. "Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database," International Statistical Review, International Statistical Institute, vol. 79(3), pages 362-384, December.
    14. Tatiana Komarova & Denis Nekipelov & Evgeny Yakovlev, 2015. "Estimation of Treatment Effects from Combined Data: Identification versus Data Security," NBER Chapters,in: Economic Analysis of the Digital Economy, pages 279-308 National Bureau of Economic Research, Inc.
    15. Molinari, Francesca, 2008. "Partial identification of probability distributions with misclassified data," Journal of Econometrics, Elsevier, vol. 144(1), pages 81-117, May.
    16. Philip J. Cross & Charles F. Manski, 2002. "Regressions, Short and Long," Econometrica, Econometric Society, vol. 70(1), pages 357-368, January.
    17. Kim, Gunky & Chambers, Raymond, 2012. "Regression analysis under incomplete linkage," Computational Statistics & Data Analysis, Elsevier, vol. 56(9), pages 2756-2770.
    18. Karr, A.F. & Kohnen, C.N. & Oganian, A. & Reiter, J.P. & Sanil, A.P., 2006. "A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality," The American Statistician, American Statistical Association, vol. 60, pages 224-232, August.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. David Pacini, 2012. "Least Square Linear Prediction with Two-Sample Data," Bristol Economics Discussion Papers 12/631, Department of Economics, University of Bristol, UK.

    More about this item

    JEL classification:

    • C13 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Estimation: General
    • C14 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Semiparametric and Nonparametric Methods: General
    • C25 - Mathematical and Quantitative Methods - - Single Equation Models; Single Variables - - - Discrete Regression and Qualitative Choice Models; Discrete Regressors; Proportions; Probabilities
    • C35 - Mathematical and Quantitative Methods - - Multiple or Simultaneous Equation Models; Multiple Variables - - - Discrete Regression and Qualitative Choice Models; Discrete Regressors; Proportions

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:ifs:cemmap:38/11. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Emma Hyman). General contact details of provider: http://edirc.repec.org/data/cmifsuk.html .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.