IDEAS home Printed from https://ideas.repec.org/a/spr/infosf/v16y2014i3d10.1007_s10796-012-9352-2.html
   My bibliography  Save this article

A schema aware ETL workflow generator

Author

Listed:
  • Naiqiao Du

    (Tsinghua University
    Tsinghua University)

  • Xiaojun Ye

    (Tsinghua University
    Ministry of Education
    Tsinghua National Laboratory for Information Science and Technology (TNList))

  • Jianmin Wang

    (Tsinghua University
    Ministry of Education
    Tsinghua National Laboratory for Information Science and Technology (TNList))

Abstract

Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.

Suggested Citation

  • Naiqiao Du & Xiaojun Ye & Jianmin Wang, 2014. "A schema aware ETL workflow generator," Information Systems Frontiers, Springer, vol. 16(3), pages 453-471, July.
  • Handle: RePEc:spr:infosf:v:16:y:2014:i:3:d:10.1007_s10796-012-9352-2
    DOI: 10.1007/s10796-012-9352-2
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10796-012-9352-2
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10796-012-9352-2?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lila Rao & Kweku-Muata Osei-Bryson, 2008. "An approach for incorporating quality-based cost–benefit analysis in data warehouse design," Information Systems Frontiers, Springer, vol. 10(3), pages 361-373, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Chulhwan Chris Bang, 2015. "Information systems frontiers: Keyword analysis and classification," Information Systems Frontiers, Springer, vol. 17(1), pages 217-237, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:infosf:v:16:y:2014:i:3:d:10.1007_s10796-012-9352-2. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.