IDEAS home Printed from https://ideas.repec.org/
MyIDEAS: Log in (now much improved!) to save this paper

Text as Data

Listed author(s):
  • Matthew Gentzkow
  • Bryan T. Kelly
  • Matt Taddy

An ever increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications.

If you experience problems downloading a file, check if you have the proper application to view it first. In case of further problems read the IDEAS help page. Note that these files are not on the IDEAS site. Please be patient as the files may be large.

File URL: http://www.nber.org/papers/w23276.pdf
Download Restriction: Access to the full text is generally limited to series subscribers, however if the top level domain of the client browser is in a developing country or transition economy free access is provided. More information about subscriptions and free access is available at http://www.nber.org/wwphelp.html. Free access is also available to older working papers.

As the access to this document is restricted, you may want to look for a different version under "Related research" (further below) or search for a different version of it.

Paper provided by National Bureau of Economic Research, Inc in its series NBER Working Papers with number 23276.

as
in new window

Length:
Date of creation: Mar 2017
Handle: RePEc:nbr:nberwo:23276
Note: AP CF IO POL
Contact details of provider: Postal:
National Bureau of Economic Research, 1050 Massachusetts Avenue Cambridge, MA 02138, U.S.A.

Phone: 617-868-3900
Web page: http://www.nber.org
Email:


More information through EDIRC

References listed on IDEAS
Please report citation or reference errors to , or , if you are the registered author of the cited work, log in to your RePEc Author Service profile, click on "citations" and make appropriate adjustments.:

as
in new window


  1. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
  2. Joshua D. Angrist & Alan B. Krueger, 2001. "Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments," Journal of Economic Perspectives, American Economic Association, vol. 15(4), pages 69-85, Fall.
  3. Albert Saiz & Uri Simonsohn, 2013. "Proxying For Unobservable Variables With Internet Document-Frequency," Journal of the European Economic Association, European Economic Association, vol. 11(1), pages 137-165, 02.
  4. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(03), pages 267-297, June.
  5. Grimmer, Justin, 2010. "A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases," Political Analysis, Cambridge University Press, vol. 18(01), pages 1-35, December.
  6. Wisniewski, Tomasz Piotr & Lambe, Brendan, 2013. "The role of media in the credit crunch: The case of the banking sector," Journal of Economic Behavior & Organization, Elsevier, vol. 85(C), pages 163-175.
  7. Sanjiv R. Das & Mike Y. Chen, 2007. "Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web," Management Science, INFORMS, vol. 53(9), pages 1375-1388, September.
  8. Stephens-Davidowitz, Seth, 2014. "The cost of racial animus on a black candidate: Evidence using Google search data," Journal of Public Economics, Elsevier, vol. 118(C), pages 26-40.
  9. Carlos M. Carvalho & Nicholas G. Polson & James G. Scott, 2010. "The horseshoe estimator for sparse signals," Biometrika, Biometrika Trust, vol. 97(2), pages 465-480.
  10. Teh, Yee Whye & Jordan, Michael I. & Beal, Matthew J. & Blei, David M., 2006. "Hierarchical Dirichlet Processes," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1566-1581, December.
  11. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
  12. Stephen Eliot Hansen & Michael McMahon & Andrea Prat, 2014. "Transparency and deliberation within the FOMC: A computational linguistics approach," Economics Working Papers 1425, Department of Economics and Business, Universitat Pompeu Fabra.
  13. James H. Stock & Francesco Trebbi, 2003. "Retrospectives: Who Invented Instrumental Variable Regression?," Journal of Economic Perspectives, American Economic Association, vol. 17(3), pages 177-194, Summer.
  14. Hyunyoung Choi & Hal Varian, 2012. "Predicting the Present with Google Trends," The Economic Record, The Economic Society of Australia, vol. 88(s1), pages 2-9, 06.
  15. Chris Hans, 2009. "Bayesian lasso regression," Biometrika, Biometrika Trust, vol. 96(4), pages 835-845.
  16. Scott R. Baker & Nicholas Bloom & Steven J. Davis, 2016. "Measuring Economic Policy Uncertainty," The Quarterly Journal of Economics, Oxford University Press, vol. 131(4), pages 1593-1636.
  17. Jegadeesh, Narasimhan & Wu, Di, 2013. "Word power: A new approach for content analysis," Journal of Financial Economics, Elsevier, vol. 110(3), pages 712-729.
  18. Benjamin Born & Michael Ehrmann & Marcel Fratzscher, 2014. "Central Bank Communication on Financial Stability," Economic Journal, Royal Economic Society, vol. 124(577), pages 701-734, 06.
  19. Joseph E. Engelberg & Christopher A. Parsons, 2011. "The Causal Impact of Media in Financial Markets," Journal of Finance, American Finance Association, vol. 66(1), pages 67-97, 02.
  20. Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
  21. Matthew Gentzkow & Jesse M. Shapiro & Matt Taddy, 2016. "Measuring Polarization in High-Dimensional Data: Method and Application to Congressional Speech," NBER Working Papers 22423, National Bureau of Economic Research, Inc.
  22. repec:fth:prinin:455 is not listed on IDEAS
  23. Bradley Efron, 2004. "The Estimation of Prediction Error: Covariance Penalties and Cross-Validation," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 619-632, January.
  24. Park, Trevor & Casella, George, 2008. "The Bayesian Lasso," Journal of the American Statistical Association, American Statistical Association, vol. 103, pages 681-686, June.
  25. Tim Groseclose & Jeffrey Milyo, 2005. "A Measure of Media Bias," The Quarterly Journal of Economics, Oxford University Press, vol. 120(4), pages 1191-1237.
  26. Paul C. Tetlock, 2007. "Giving Content to Investor Sentiment: The Role of Media in the Stock Market," Journal of Finance, American Finance Association, vol. 62(3), pages 1139-1168, 06.
  27. Werner Antweiler & Murray Z. Frank, 2004. "Is All That Talk Just Noise? The Information Content of Internet Stock Message Boards," Journal of Finance, American Finance Association, vol. 59(3), pages 1259-1294, 06.
  28. Matt Taddy, 2013. "Rejoinder: Efficiency and Structure in MNIR," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 772-774, September.
  29. Matthew Gentzkow & Jesse M. Shapiro, 2010. "What Drives Media Slant? Evidence From U.S. Daily Newspapers," Econometrica, Econometric Society, vol. 78(1), pages 35-71, 01.
  30. Joshua Angrist & Alan Krueger, 2001. "Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments," Working Papers 834, Princeton University, Department of Economics, Industrial Relations Section..
  31. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768.
  32. Matt Taddy, 2013. "Multinomial Inverse Regression for Text Analysis," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 755-770, September.
  33. Tim Loughran & Bill Mcdonald, 2011. "When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‐Ks," Journal of Finance, American Finance Association, vol. 66(1), pages 35-65, 02.
  34. Feng Li, 2010. "The Information Content of Forward-Looking Statements in Corporate Filings-A Naïve Bayesian Machine Learning Approach," Journal of Accounting Research, Wiley Blackwell, vol. 48(5), pages 1049-1102, December.
  35. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320.
  36. Cheryl J. Flynn & Clifford M. Hurvich & Jeffrey S. Simonoff, 2013. "Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 1031-1043, September.
Full references (including those not matched with items on IDEAS)

This item is not listed on Wikipedia, on a reading list or among the top items on IDEAS.

When requesting a correction, please mention this item's handle: RePEc:nbr:nberwo:23276. See general information about how to correct material in RePEc.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ()

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If references are entirely missing, you can add them using this form.

If the full references list an item that is present in RePEc, but the system did not link to it, you can help with this form.

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your profile, as there may be some citations waiting for confirmation.

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

This information is provided to you by IDEAS at the Research Division of the Federal Reserve Bank of St. Louis using RePEc data.