PAF
Member
- Joined
- Feb 26, 2012
- Messages
- 13,559
Aug 06, 2025
The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.
Meta has scraped data from the most-trafficked domains on the internet —including news organizations, education platforms, niche forums, personal blogs, and even revenge porn sites—to train its artificial intelligence models, according to a leaked list obtained by Drop Site News. By scraping data from roughly 6 million unique websites, including 100,000 of the top-ranked domains, Meta has generated millions of pages of content to use for Meta’s AI-training pipeline.
.
.
A list of the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models. The list is from a query run directly on the Meta database. The software used to do this internally is called Spidermate. The list has been reformatted for source protection purposes.
The scraping of data to train AI models has become a major controversy in recent years, with publishers filing lawsuits against budding AI companies accusing them of effectively stealing their content to build their AI platforms. Meta itself has been targeted by lawsuits from authors who accused the company of copyright infringement for using their work in its models. AI models require a tremendous amount of data for their training data to work effectively. In one notorious instance that raised the alarms of privacy experts, a startup known as Clearview AI, founded in 2017, scraped the internet for over 3 billion images taken from social media to develop a facial recognition tool used by intelligence and law enforcement agencies. The company was later hit with a wave of lawsuits for invasion of privacy.
A lack of transparency about the inputs that companies use to develop their AI programs, including fears that extreme or illegal content could be shaping these models, has been added to the already existing ethical issues over potentially stealing content from writers, publishers, and ordinary people simply sharing content online. An investigation by Stanford University from 2023 found that the popular Stable Diffusion text-to-image AI platform had been trained on hundreds of images of child exploitation, raising major ethical questions about data use and output from its models.
.
.
Continue to full article:
www.dropsitenews.com
The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.
Meta has scraped data from the most-trafficked domains on the internet —including news organizations, education platforms, niche forums, personal blogs, and even revenge porn sites—to train its artificial intelligence models, according to a leaked list obtained by Drop Site News. By scraping data from roughly 6 million unique websites, including 100,000 of the top-ranked domains, Meta has generated millions of pages of content to use for Meta’s AI-training pipeline.
.
.
A list of the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models. The list is from a query run directly on the Meta database. The software used to do this internally is called Spidermate. The list has been reformatted for source protection purposes.
The scraping of data to train AI models has become a major controversy in recent years, with publishers filing lawsuits against budding AI companies accusing them of effectively stealing their content to build their AI platforms. Meta itself has been targeted by lawsuits from authors who accused the company of copyright infringement for using their work in its models. AI models require a tremendous amount of data for their training data to work effectively. In one notorious instance that raised the alarms of privacy experts, a startup known as Clearview AI, founded in 2017, scraped the internet for over 3 billion images taken from social media to develop a facial recognition tool used by intelligence and law enforcement agencies. The company was later hit with a wave of lawsuits for invasion of privacy.
A lack of transparency about the inputs that companies use to develop their AI programs, including fears that extreme or illegal content could be shaping these models, has been added to the already existing ethical issues over potentially stealing content from writers, publishers, and ordinary people simply sharing content online. An investigation by Stanford University from 2023 found that the popular Stable Diffusion text-to-image AI platform had been trained on hundreds of images of child exploitation, raising major ethical questions about data use and output from its models.
.
.
Continue to full article:

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI
The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal.
