AI Training Datasets Harbor Disturbing Levels of Child Sexual Abuse Material

2 mins read December 20, 2023

Child Sexual Abuse

A Stanford study finds AI models, like Stable Diffusion, trained on datasets with child sexual abuse material, raising ethical concerns.
Researchers act swiftly, reporting abusive image URLs to NCMEC and C3P, emphasizing the need for responsible AI data handling.
The SIO investigation highlights challenges in cleaning open datasets, urging future precautions, and collaboration with child safety organizations.

In a recent investigation conducted by the Stanford Internet Observatory (SIO), hundreds of known images of child sexual abuse material (CSAM) were identified in an open dataset utilized for training popular AI text-to-image generation models, including Stable Diffusion. The findings shed light on the disturbing use of openly available datasets in the development of advanced artificial intelligence (AI) models.

Uncovering disturbing training data sources

The SIO investigation unveiled that these AI models were trained directly on CSAM present in the LAION-5B dataset, which comprises billions of images sourced from various platforms, including mainstream social media websites and popular adult video sites. The revelation raises concerns about the inadvertent perpetuation of child exploitation through the use of datasets tainted with illegal and harmful content.

Swift actions to address the issue

Upon identifying the source material, researchers initiated the removal process by reporting image URLs to the National Center for Missing and Exploited Children (NCMEC) in the U.S. and the Canadian Centre for Child Protection (C3P). The use of hashing tools, such as PhotoDNA, played a crucial role in matching image fingerprints with databases maintained by nonprofits dedicated to combating online child sexual exploitation and abuse.

Challenges in cleaning open datasets

While there are methods to minimize the presence of CSAM in training datasets, the report underscores the challenges in cleaning or halting the distribution of open datasets lacking a central authority. The absence of a hosting entity for these datasets complicates efforts to ensure their integrity and safety. The study emphasizes the need for proactive measures to prevent the inadvertent inclusion of illegal content in AI training data.

Safety recommendations for future dataset handling

In light of these findings, the report outlines safety recommendations for collecting datasets, training models, and hosting models trained on scraped datasets. It advocates for thorough checks of images against known lists of CSAM using detection tools like Microsoft’s PhotoDNA. Collaboration with child safety organizations, such as NCMEC and C3P, is also recommended to ensure the ethical and lawful use of AI technology.

As AI continues to advance, the responsible handling of training datasets becomes paramount to prevent unintentional contributions to illicit activities. The SIO’s investigation serves as a wake-up call for the AI community, urging stakeholders to adopt stringent measures in dataset curation, model training, and collaboration with relevant child protection agencies.

In response to these revelations, the AI community is prompted to reevaluate its ethical standards and take decisive actions to address the unintentional use of CSAM in training datasets. By implementing the recommended safety measures, the industry can contribute to the development of AI technology in a responsible and ethical manner, safeguarding against the unintended consequences of unchecked dataset sources.

The findings of the SIO investigation underscore the importance of vigilance in an era where technological advancements must be accompanied by a strong commitment to ethical AI development. Collaboration between researchers, industry leaders, and child protection organizations is essential to ensuring that AI technology progresses in a manner that aligns with societal values and prioritizes the well-being of vulnerable individuals.

If you're reading this, you’re already ahead. Stay there with our newsletter.

Share this article

Disclaimer. The information provided is not trading advice. Cryptopolitan.com holds no liability for any investments made based on the information provided on this page. We strongly recommend independent research and/or consultation with a qualified professional before making any investment decision.

Derrick Clinton

Derrick is a freelance writer with an interest in blockchain and cryptocurrency. He works mostly on crypto projects’ problems and solutions, offering a market outlook for investments. He applies his analytical talents to theses.

TABLE OF CONTENT

1. Uncovering disturbing training data sources

2. Swift actions to address the issue

3. Challenges in cleaning open datasets

4. Safety recommendations for future dataset handling

Share this article

MORE … NEWS

SHOW ALL

What Is Base? The Ethereum Layer-2 Network Launched by Coinbase

October 21, 2025 Learn Crypto: Beginner Guides
Dogecoin vs. Bitcoin: Key Technical Differences

October 20, 2025 Learn Crypto: Beginner Guides
What Is TVL (Total Value Locked) in Crypto?

October 14, 2025 Learn Crypto: Beginner Guides
How to Read a Crypto Whitepaper?

October 13, 2025 Learn Crypto: Beginner Guides
Ripple vs. XRP vs. XRP Ledger: What’s the Difference?

October 13, 2025 Learn Crypto: Beginner Guides
What Is a Multisig Wallet in Crypto?

October 10, 2025 Learn Crypto: Beginner Guides

DEEP CRYPTO
CRASH COURSE

Which cryptocurrencies can make you money
How to boost your security with a wallet (and which ones are actually worth using)
Little-known investment strategies that the pros use
How to get started investing in crypto (which exchanges to use, the best crypto to buy etc)

AI Training Datasets Harbor Disturbing Levels of Child Sexual Abuse Material

Uncovering disturbing training data sources

Swift actions to address the issue

Challenges in cleaning open datasets

Safety recommendations for future dataset handling

5 Ingenious Applications of ChatGPT And What You Should Do About Them

93% Business Leaders Favor AI-Powered Solutions for Brand Sustainability Management, Reuters

Here’s How Macron Supports France’s Vibrant and Productive AI Ecosystem

Bloomberg Estimates the Generative AI Market to Reach $1.3 Trillion by 2032

One sharp brief.
Every day.

AI Training Datasets Harbor Disturbing Levels of Child Sexual Abuse Material

Uncovering disturbing training data sources

Swift actions to address the issue

Challenges in cleaning open datasets

Safety recommendations for future dataset handling

5 Ingenious Applications of ChatGPT And What You Should Do About Them

93% Business Leaders Favor AI-Powered Solutions for Brand Sustainability Management, Reuters

Here’s How Macron Supports France’s Vibrant and Productive AI Ecosystem

Bloomberg Estimates the Generative AI Market to Reach $1.3 Trillion by 2032

One sharp brief.Every day.

One sharp brief.
Every day.