IDEAS home Printed from https://ideas.repec.org/p/arz/wpaper/eres2019_370.html
   My bibliography  Save this paper

Challenges in Machine Learning for Document Classification in the Real Estate Industry

Author

Listed:
  • Mario Bodenbender
  • Björn-Martin Kurzrock

Abstract

Data rooms are becoming more and more important for the real estate industry. They permit the creation of protected areas in which a variety of relevant documents are typically made available to interested parties. In addition to supporting purchase and sales processes, they are used primarily in larger construction projects.The structures and index designations of data rooms have not yet been uniformly regulated on an international basis. Data room indices are created based on different types of approaches and thus the indices also diverge in terms of their depth of detail as well as in the range of topics. In practice, rules already exist for structuring documentation for individual phases, as well as for transferring data between these phases. Since all of the documentation must be transferable when changing to another life cycle phase or participant, the information must always be clearly identified and structured in order to enable the protection, access and administration of this information at all times. This poses a challenge for companies because the documents are subject to several rounds of restructuring during their life cycle, which are not only costly, but also always entail the risk of data loss. The goal of current research is therefore a seamless storage as well as a permanent and unambiguous classification of the documents over the individual life cycle phases.In the field of text classification, machine learning offers considerable potential in the sense of reduced workload, process acceleration and quality improvement. In data rooms, machine learning (in particular document classification) is used to automatically classify the documents contained in the data room or the documents to be imported and assign them to a suitable index point. In this manner, a document is always classified in the class to which it belongs with the greatest probability (ex: due to word frequency). An essential prerequisite for the success of machine learning for document classification is the quality of the document classes as well as the training data. When defining the document classes, it must be guaranteed on the one hand that these do not overlap in terms of their content, so that it is possible to clearly allocate the documents thematically. On the other hand, it must also be possible to consider documents that may appear later and be able to scale the model according to the requirements. For the training and test set, as well as for the documents to be analyzed later, the quality of the respective documents and their readability are also decisive factors. In order to effectively analyze the documents, the content must also be standardized and it must be possible to remove non-relevant content in advance.Based on the empirical analysis of 8,965 digital documents of fourteen properties from eight different owners, the paper presents a model with more than 1,300 document classes as a basis for an automated structuring and migration of documents in the life cycle of real estate. To validate these classes, machine learning algorithms were learned and analyzed to determine under which conditions and how the highest possible accuracy of classification can be achieved. Stemmer and stop word lists used specifically for these analyses were also developed for this purpose. Using these lists, the accuracy of a classification is further increased by machine learning, since they were specifically aligned to terms used in the real estate industry.The paper also shows which aspects have to be taken into account at an early stage when digitizing extensive data/document inventories, since automation using machine learning can only be as good as the quality, legibility and interpretability of the data allow.

Suggested Citation

  • Mario Bodenbender & Björn-Martin Kurzrock, 2019. "Challenges in Machine Learning for Document Classification in the Real Estate Industry," ERES eres2019_370, European Real Estate Society (ERES).
  • Handle: RePEc:arz:wpaper:eres2019_370
    as

    Download full text from publisher

    File URL: https://eres.architexturez.net/doc/oai-eres-id-eres2019-370
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    data room; Digitization; document classification; Machine Learning; real estate data;
    All these keywords.

    JEL classification:

    • R3 - Urban, Rural, Regional, Real Estate, and Transportation Economics - - Real Estate Markets, Spatial Production Analysis, and Firm Location

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arz:wpaper:eres2019_370. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Architexturez Imprints (email available below). General contact details of provider: https://edirc.repec.org/data/eressea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.