IDEAS home Printed from https://ideas.repec.org/a/inm/orisre/v29y2018i1p4-24.html
   My bibliography  Save this article

Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining

Author

Listed:
  • Mochen Yang

    (Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455)

  • Gediminas Adomavicius

    (Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455)

  • Gordon Burtch

    (Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455)

  • Yuqing Rena

    (Information and Decision Sciences, Carlson School of Management, University of Minnesota, Minneapolis, Minnesota 55455)

Abstract

The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites. The online appendix is available at https://doi.org/10.1287/isre.2017.0727 .

Suggested Citation

  • Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
  • Handle: RePEc:inm:orisre:v:29:y:2018:i:1:p:4-24
    DOI: isre.2017.0727
    as

    Download full text from publisher

    File URL: https://doi.org/isre.2017.0727
    Download Restriction: no

    File URL: https://libkey.io/isre.2017.0727?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Lynn Wu, 2013. "Social Network Effects on Productivity and Job Security: Evidence from the Adoption of a Social Networking Tool," Information Systems Research, INFORMS, vol. 24(1), pages 30-51, March.
    2. Dina Mayzlin & Yaniv Dover & Judith Chevalier, 2014. "Promotional Reviews: An Empirical Investigation of Online Review Manipulation," American Economic Review, American Economic Association, vol. 104(8), pages 2421-2455, August.
    3. Kuchenhoff, Helmut & Lederer, Wolfgang & Lesaffre, Emmanuel, 2007. "Asymptotic variance estimation for the misclassification SIMEX," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6197-6211, August.
    4. Sanjiv R. Das & Mike Y. Chen, 2007. "Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web," Management Science, INFORMS, vol. 53(9), pages 1375-1388, September.
    5. Bin Gu & Prabhudev Konana & Rajagopal Raghunathan & Hsuanwei Michelle Chen, 2014. "Research Note —The Allure of Homophily in Social Media: Evidence from Investor Responses on Virtual Communities," Information Systems Research, INFORMS, vol. 25(3), pages 604-617, September.
    6. Gordon Burtch & Anindya Ghose & Sunil Wattal, 2015. "The Hidden Cost of Accommodating Crowdfunder Privacy Preferences: A Randomized Field Experiment," Management Science, INFORMS, vol. 61(5), pages 949-962, May.
    7. Ritu Agarwal & Vasant Dhar, 2014. "Editorial —Big Data, Data Science, and Analytics: The Opportunity and Challenge for IS Research," Information Systems Research, INFORMS, vol. 25(3), pages 443-448, September.
    8. Chrysanthos Dellarocas, 2003. "The Digitization of Word of Mouth: Promise and Challenges of Online Feedback Mechanisms," Management Science, INFORMS, vol. 49(10), pages 1407-1424, October.
    9. Tawei Wang & Karthik N. Kannan & Jackie Rees Ulmer, 2013. "The Association Between the Disclosure and the Realization of Information Security Risk Factors," Information Systems Research, INFORMS, vol. 24(2), pages 201-218, June.
    10. Nikolay Archak & Anindya Ghose & Panagiotis G. Ipeirotis, 2011. "Deriving the Pricing Power of Product Features by Mining Consumer Reviews," Management Science, INFORMS, vol. 57(8), pages 1485-1509, August.
    11. Rohit Aggarwal & Ram Gopal & Alok Gupta & Harpreet Singh, 2012. "Putting Money Where the Mouths Are: The Relation Between Venture Financing and Electronic Word-of-Mouth," Information Systems Research, INFORMS, vol. 23(3-part-2), pages 976-992, September.
    12. Ingrid E. Fisher & Margaret R. Garnsey & Mark E. Hughes, 2016. "Natural Language Processing in Accounting, Auditing and Finance: A Synthesis of the Literature with a Roadmap for Future Research," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 23(3), pages 157-214, July.
    13. Chris Forman & Anindya Ghose & Batia Wiesenfeld, 2008. "Examining the Relationship Between Reviews and Sales: The Role of Reviewer Identity Disclosure in Electronic Markets," Information Systems Research, INFORMS, vol. 19(3), pages 291-313, September.
    14. Anindya Ghose & Panagiotis G. Ipeirotis & Beibei Li, 2012. "Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-Generated and Crowdsourced Content," Marketing Science, INFORMS, vol. 31(3), pages 493-520, May.
    15. Khim-Yong Goh & Cheng-Suang Heng & Zhijie Lin, 2013. "Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of User- and Marketer-Generated Content," Information Systems Research, INFORMS, vol. 24(1), pages 88-107, March.
    16. Antonio Moreno & Christian Terwiesch, 2014. "Doing Business with Strangers: Reputation in Online Service Marketplaces," Information Systems Research, INFORMS, vol. 25(4), pages 865-886, December.
    17. Yang Bao & Anindya Datta, 2014. "Simultaneously Discovering and Quantifying Risk Types from Textual Risk Disclosures," Management Science, INFORMS, vol. 60(6), pages 1371-1391, June.
    18. Seshadri Tirunillai & Gerard J. Tellis, 2012. "Does Chatter Really Matter? Dynamics of User-Generated Content and Stock Performance," Marketing Science, INFORMS, vol. 31(2), pages 198-215, March.
    19. Steven L. Johnson & Hani Safadi & Samer Faraj, 2015. "The Emergence of Online Community Leadership," Information Systems Research, INFORMS, vol. 26(1), pages 165-187, March.
    20. Paul C. Tetlock & Maytal Saar‐Tsechansky & Sofus Macskassy, 2008. "More Than Words: Quantifying Language to Measure Firms' Fundamentals," Journal of Finance, American Finance Association, vol. 63(3), pages 1437-1467, June.
    21. Ajay Agrawal & Christian Catalini & Avi Goldfarb, 2014. "Some Simple Economics of Crowdfunding," Innovation Policy and the Economy, University of Chicago Press, vol. 14(1), pages 63-97.
    22. Gordon Burtch & Anindya Ghose & Sunil Wattal, 2013. "An Empirical Examination of the Antecedents and Consequences of Contribution Patterns in Crowd-Funded Markets," Information Systems Research, INFORMS, vol. 24(3), pages 499-519, September.
    23. Param Vir Singh & Nachiketa Sahoo & Tridas Mukhopadhyay, 2014. "How to Attract and Retain Readers in Enterprise Blogging?," Information Systems Research, INFORMS, vol. 25(1), pages 35-52, March.
    24. Yingda Lu & Kinshuk Jerath & Param Vir Singh, 2013. "The Emergence of Opinion Leaders in a Networked Online Community: A Dyadic Model with Time Dynamics and a Heuristic for Fast Estimation," Management Science, INFORMS, vol. 59(8), pages 1783-1799, August.
    25. Daniel J. Hopkins & Gary King, 2010. "A Method of Automated Nonparametric Content Analysis for Social Science," American Journal of Political Science, John Wiley & Sons, vol. 54(1), pages 229-247, January.
    26. David Godes & Dina Mayzlin, 2004. "Using Online Conversations to Study Word-of-Mouth Communication," Marketing Science, INFORMS, vol. 23(4), pages 545-560, June.
    27. Dellarocas, Chrysanthos, 2003. "The Digitization of Word-of-mouth: Promise and Challenges of Online Feedback Mechanisms," Working papers 4296-03, Massachusetts Institute of Technology (MIT), Sloan School of Management.
    28. Hal R. Varian, 2014. "Big Data: New Tricks for Econometrics," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 3-28, Spring.
    29. Helmut Küchenhoff & Samuel M. Mwalili & Emmanuel Lesaffre, 2006. "A General Method for Dealing with Misclassification in Regression: The Misclassification SIMEX," Biometrics, The International Biometric Society, vol. 62(1), pages 85-96, March.
    30. Bin Gu & Prabhudev Konana & Balaji Rajagopalan & Hsuan-Wei Michelle Chen, 2007. "Competition Among Virtual Communities and User Valuation: The Case of Investing-Related Communities," Information Systems Research, INFORMS, vol. 18(1), pages 68-85, March.
    31. Mingfeng Lin & Henry C. Lucas & Galit Shmueli, 2013. "Research Commentary ---Too Big to Fail: Large Samples and the p -Value Problem," Information Systems Research, INFORMS, vol. 24(4), pages 906-917, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Gordon Burtch & Edward McFowland III & Mochen Yang & Gediminas Adomavicius, 2023. "EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference," Papers 2303.02820, arXiv.org.
    2. Yi Yang & Kunpeng Zhang & Yangyang Fan, 2023. "sDTM: A Supervised Bayesian Deep Topic Model for Text Analytics," Information Systems Research, INFORMS, vol. 34(1), pages 137-156, March.
    3. Xue Bai & James R. Marsden & William T. Ross & Gang Wang, 2020. "A Note on the Impact of Daily Deals on Local Retailers’ Online Reputation: Mediation Effects of the Consumer Experience," Information Systems Research, INFORMS, vol. 31(4), pages 1132-1143, December.
    4. Mochen Yang & Edward McFowland & Gordon Burtch & Gediminas Adomavicius, 2022. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," INFORMS Joural on Data Science, INFORMS, vol. 1(2), pages 138-155, October.
    5. Mengqiang Pan & Nao Li & Xiankai Huang, 2022. "Asymmetrical impact of service attribute performance on consumer satisfaction: an asymmetric impact-attention-performance analysis," Information Technology & Tourism, Springer, vol. 24(2), pages 221-243, June.
    6. Mochen Yang & Edward McFowland III & Gordon Burtch & Gediminas Adomavicius, 2020. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," Papers 2012.10790, arXiv.org.
    7. Benjamin M. Abdel-Karim & Nicolas Pfeuffer & Oliver Hinz, 2021. "Machine learning in information systems - a bibliographic review and open research issues," Electronic Markets, Springer;IIM University of St. Gallen, vol. 31(3), pages 643-670, September.
    8. Mengke Qiao & Ke-Wei Huang, 2021. "Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 32(2), pages 462-480, June.
    9. Hyelim Oh & Khim-Yong Goh & Tuan Q. Phan, 2023. "Are You What You Tweet? The Impact of Sentiment on Digital News Consumption and Social Media Sharing," Information Systems Research, INFORMS, vol. 34(1), pages 111-136, March.
    10. Milan Miric & Nan Jia & Kenneth G. Huang, 2023. "Using supervised machine learning for large‐scale classification in management research: The case for identifying artificial intelligence patents," Strategic Management Journal, Wiley Blackwell, vol. 44(2), pages 491-519, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mochen Yang & Edward McFowland & Gordon Burtch & Gediminas Adomavicius, 2022. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," INFORMS Joural on Data Science, INFORMS, vol. 1(2), pages 138-155, October.
    2. Dominik Gutt & Jürgen Neumann & Steffen Zimmermann & Dennis Kundisch & Jianqing Chen, 2018. "Design of Review Systems - A Strategic Instrument to shape Online Review Behavior and Economic Outcomes," Working Papers Dissertations 42, Paderborn University, Faculty of Business Administration and Economics.
    3. Sheng, Jie & Amankwah-Amoah, Joseph & Wang, Xiaojun, 2017. "A multidisciplinary perspective of big data in management research," International Journal of Production Economics, Elsevier, vol. 191(C), pages 97-112.
    4. Sheng, Jie & Amankwah-Amoah, Joseph & Wang, Xiaojun, 2019. "Technology in the 21st century: New challenges and opportunities," Technological Forecasting and Social Change, Elsevier, vol. 143(C), pages 321-335.
    5. Khim-Yong Goh & Cheng-Suang Heng & Zhijie Lin, 2013. "Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of User- and Marketer-Generated Content," Information Systems Research, INFORMS, vol. 24(1), pages 88-107, March.
    6. Jorge Mejia & Shawn Mankad & Anandasivam Gopal, 2019. "A for Effort? Using the Crowd to Identify Moral Hazard in New York City Restaurant Hygiene Inspections," Information Systems Research, INFORMS, vol. 30(4), pages 1363-1386, December.
    7. Angela Aerry Choi & Daegon Cho & Dobin Yim & Jae Yun Moon & Wonseok Oh, 2019. "When Seeing Helps Believing: The Interactive Effects of Previews and Reviews on E-Book Purchases," Information Systems Research, INFORMS, vol. 30(4), pages 1164-1183, December.
    8. Ana Babić Rosario & Kristine Valck & Francesca Sotgiu, 2020. "Conceptualizing the electronic word-of-mouth process: What we know and need to know about eWOM creation, exposure, and evaluation," Journal of the Academy of Marketing Science, Springer, vol. 48(3), pages 422-448, May.
    9. Mengke Qiao & Ke-Wei Huang, 2021. "Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 32(2), pages 462-480, June.
    10. Juan Feng & Xin Li & Xiaoquan (Michael) Zhang, 2019. "Online Product Reviews-Triggered Dynamic Pricing: Theory and Evidence," Information Systems Research, INFORMS, vol. 30(4), pages 1107-1123, December.
    11. King, Robert Allen & Racherla, Pradeep & Bush, Victoria D., 2014. "What We Know and Don't Know About Online Word-of-Mouth: A Review and Synthesis of the Literature," Journal of Interactive Marketing, Elsevier, vol. 28(3), pages 167-183.
    12. Arvind K. Tripathi & Young-Jin Lee & Amit Basu, 2022. "Analyzing the Impact of Public Buyer–Seller Engagement During Online Auctions," Information Systems Research, INFORMS, vol. 33(4), pages 1264-1286, December.
    13. Heeseung Andrew Lee & Angela Aerry Choi & Tianshu Sun & Wonseok Oh, 2021. "Reviewing Before Reading? An Empirical Investigation of Book-Consumption Patterns and Their Effects on Reviews and Sales," Information Systems Research, INFORMS, vol. 32(4), pages 1368-1389, December.
    14. Pauwels, Koen & Aksehirli, Zeynep & Lackman, Andrew, 2016. "Like the ad or the brand? Marketing stimulates different electronic word-of-mouth content to drive online and offline performance," International Journal of Research in Marketing, Elsevier, vol. 33(3), pages 639-655.
    15. Paulo B. Goes & Mingfeng Lin & Ching-man Au Yeung, 2014. "“Popularity Effect” in User-Generated Content: Evidence from Online Product Reviews," Information Systems Research, INFORMS, vol. 25(2), pages 222-238, June.
    16. Marchand, André & Hennig-Thurau, Thorsten & Wiertz, Caroline, 2017. "Not all digital word of mouth is created equal: Understanding the respective impact of consumer reviews and microblogs on new product success," International Journal of Research in Marketing, Elsevier, vol. 34(2), pages 336-354.
    17. Christoph Schneider & Markus Weinmann & Peter N.C. Mohr & Jan vom Brocke, 2021. "When the Stars Shine Too Bright: The Influence of Multidimensional Ratings on Online Consumer Ratings," Management Science, INFORMS, vol. 67(6), pages 3871-3898, June.
    18. Tingting Nian & Arun Sundararajan, 2022. "Social Media Marketing, Quality Signaling, and the Goldilocks Principle," Information Systems Research, INFORMS, vol. 33(2), pages 540-556, June.
    19. Kick, Markus, 2015. "Social Media Research: A Narrative Review," EconStor Preprints 182506, ZBW - Leibniz Information Centre for Economics.
    20. Tao Lu & May Yuan & Chong (Alex) Wang & Xiaoquan (Michael) Zhang, 2022. "Histogram Distortion Bias in Consumer Choices," Management Science, INFORMS, vol. 68(12), pages 8963-8978, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:orisre:v:29:y:2018:i:1:p:4-24. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.