IDEAS home Printed from https://ideas.repec.org/p/hal/wpaper/hal-05581044.html

From scraping to ethical sharing: Initial considerations for Virtuous Innovative Approaches and Data Use Collaboration in AI Training (VIADUCT)

Author

Listed:
  • Jean Constantin

    (Inria Siège - Inria - Institut National de Recherche en Informatique et en Automatique)

  • Yann Dietrich

    (Atos)

  • Marie Langé

  • Bertrand Monthubert

    (IMT - Institut de Mathématiques de Toulouse UMR5219 - UT Capitole - Université Toulouse Capitole - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - INSA Toulouse - Institut National des Sciences Appliquées - Toulouse - INSA - Institut National des Sciences Appliquées - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - UT2J - Université Toulouse - Jean Jaurès - Comue de Toulouse - Communauté d'universités et établissements de Toulouse - CNRS - Centre National de la Recherche Scientifique - EPE UT - Université de Toulouse - Comue de Toulouse - Communauté d'universités et établissements de Toulouse, Equipe BIOETHICS (CERPOP) - CERPOP - Centre d'Epidémiologie et de Recherche en santé des POPulations - INSERM - Institut National de la Santé et de la Recherche Médicale - EPE UT - Université de Toulouse - Comue de Toulouse - Communauté d'universités et établissements de Toulouse)

Abstract

The rapid development of artificial intelligence (AI) relies on access to vast volumes of data throughout its lifecycle. The sourcing of this data has relied on legally and ethically contentious practices, particularly the indiscriminate scraping of publicly available and often copyrighted content. Popular datasets like CommonCrawl and LAION 5B contain copyrighted works and personal data used without explicit permission or compensation for data holders. This approach has triggered a global backlash, with over 50 lawsuits filed against AI developers and increasing technical barriers against scraper robots. Leaders in the AI industry now warn of "peak data", as public human-generated content will soon be exhausted. This scarcity conflicts with AI's ever-growing appetite for high quality expert data to support increasingly advanced applications. Data for AI is not uniform but spans multiple domains and governance regimes which can evolve or overlap depending on contexts and jurisdictions 1 . Each of these regimes: copyrighted content, personal data, trade secrets, government data, and open data, is constrained by distinct legal and technical restrictions. Copyrighted materials require permission from holders, yet enforcement of opt-out decisions remains inconsistent. Personal data is protected under GDPR, demanding anonymisation and clear legal grounds for processing, while trade secret datasets are shielded by confidentiality agreements. Government data, though mandated to be open, often remains inaccessible due to sensitivity or infrastructure limitations. Open data, while legally permissive, suffers from fragmentation and underinvestment. These disparities create a fragmented landscape where data sharing is hindered by transaction costs, confidentiality requirements, and misaligned incentives.Efforts to address these challenges have produced partial solutions. Opt-out mechanisms like ai.txt and TDMRep allow data holders to declare preferences but lack standardisation. Privacy preserving techniques enable secure data processing but at high computational cost. Licensing agreements can bring legal clarifications but are hindered by contractual complexity. Data attribution models, designed to compensate data holders, remain impractical at scale. No single solution suffices, highlighting the need for context specific approaches that balance innovation with data holders' interests. Fostering ethical data sharing is not trivial and requires addressing multiple technical, economic and legal obstacles. The VIADUCT initiative proposes an experimental approach, engaging with data holders and AI developers to characterize constraints and explore innovative data sharing approaches.

Suggested Citation

  • Jean Constantin & Yann Dietrich & Marie Langé & Bertrand Monthubert, 2025. "From scraping to ethical sharing: Initial considerations for Virtuous Innovative Approaches and Data Use Collaboration in AI Training (VIADUCT)," Working Papers hal-05581044, HAL.
  • Handle: RePEc:hal:wpaper:hal-05581044
    Note: View the original document on HAL open archive server: https://hal.science/hal-05581044v1
    as

    Download full text from publisher

    File URL: https://hal.science/hal-05581044v1/document
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:wpaper:hal-05581044. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.