IDEAS home Printed from https://ideas.repec.org/p/hit/hitcei/2021-03.html
   My bibliography  Save this paper

OCRを利用した統計表の体系的なテキストデータ化, Textizing statistical tables using OCR at scale

Author

Listed:
  • 有本, 寛
  • ARIMOTO, Yutaka

Abstract

本稿は,OCRを利用して,統計表を体系的かつ大規模にテキストデータ化するための要件と方法を解説する.統計表をOCRでテキストデータ化するには,高い精度の表レイアウト解析が求められる.筆者が開発しているocrstatsは,バッチ処理,定型的な工程の自動化,外部OCRの利用,実用的な精度の表レイアウト解析を実現し,作業効率の改善を図っている.また,ocrstatsを使って『日本帝国統計年鑑』をテキストデータ化する過程で得られたノウハウや,パネルデータの作成にあたって変数を経年的にリンクする方法も解説する., This paper describes the requirements and methods for textizing statistical tables using OCR at scale. The major challenge of textizing statistical tables by OCR is analyzing the table layout with high accuracy. I develop a Python tookit, ocrstats, that supports the task by providing batch processing, automation of routine processes, use of external OCR, and table layout analysis with practical accuracy. I also explain practical tips learnt from the process of textizing the Japan Imperial Statistical Yearbook using ocrstats.

Suggested Citation

  • 有本, 寛 & ARIMOTO, Yutaka, 2021. "OCRを利用した統計表の体系的なテキストデータ化, Textizing statistical tables using OCR at scale," CEI Working Paper Series 2021-03, Center for Economic Institutions, Institute of Economic Research, Hitotsubashi University.
  • Handle: RePEc:hit:hitcei:2021-03
    Note: 2021年7月21日
    as

    Download full text from publisher

    File URL: https://hermes-ir.lib.hit-u.ac.jp/hermes/ir/re/72013/wp2021-03.pdf
    Download Restriction: no
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hit:hitcei:2021-03. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Reiko Suzuki (email available below). General contact details of provider: https://edirc.repec.org/data/cehitjp.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.