IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0175860.html
   My bibliography  Save this article

Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques

Author

Listed:
  • Lisa M Gandy
  • Jordan Gumm
  • Benjamin Fertig
  • Anne Thessen
  • Michael J Kennish
  • Sameer Chavan
  • Luigi Marchionni
  • Xiaoxin Xia
  • Shambhavi Shankrit
  • Elana J Fertig

Abstract

Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85–100%). We further implement Synthesize in an open source web application, Synthesizer (https://github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

Suggested Citation

  • Lisa M Gandy & Jordan Gumm & Benjamin Fertig & Anne Thessen & Michael J Kennish & Sameer Chavan & Luigi Marchionni & Xiaoxin Xia & Shambhavi Shankrit & Elana J Fertig, 2017. "Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques," PLOS ONE, Public Library of Science, vol. 12(4), pages 1-15, April.
  • Handle: RePEc:plo:pone00:0175860
    DOI: 10.1371/journal.pone.0175860
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175860
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0175860&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0175860?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0175860. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.