Court documents have shown that Meta staffers have discussed using copyrighted content to train their artificial intelligence models. According to recent court documents, most of the discussions were related to content obtained from legally questionable means.
The documents were made available by the plaintiffs in the Kadrey v. Meta case, one of the many AI copyright cases that have passed through the United States courts. According to Meta’s claims, the use of copyrighted works, especially books, to train its models is under the fair used policy. However, the plaintiffs, led by Sarah Silverman and Ta–Nehisi Coates, have disagreed with the company’s claims.
According to previously submitted documents, Meta CEO Mark Zuckerberg approved the company’s AI team to use copyrighted content to train its models. The documents further went on to show that the company also cut its data licensing talks with book publishers.
Meta allegedly uses copyright content to train its AI models
According to new filings made available at the court, internal work chats shared between workers at Meta have shown a clear picture of how the company may have used copyrighted data to train its AI models, including most of the models in the Llama family.
One of the chats involved a senior executive of Meta, Melanie Kambadur, who is the Senior Manager for the Llama research team. In her chat, she talked about training the AI models on content that was not legally justified.
“My opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to execs so they make the call,” Meta research engineer Xavier Martinet said in a chat dated February 2023, according to the filings. “This is why they set up this gen ai org for [sic]: so we can be less risk averse,” he added.
Martinet mentioned that the company could just buy e-books at retail prices to build the data training set, instead of entering into licensing deals with different publishers. While another employee cautioned the use of such content and legal ramifications, Martinet mentioned that other firms were probably also using pirated books for training.
In the same chat, Kambadur mentioned that the company was in talks with some platforms for licenses, but noted that while using publicly available data will require approvals, the company’s lawyers were less conservative than they had been in the past. “Difference now is we have more money, more lawyers, more bizdev help, ability to fast track/escalate for speed, and lawyers are being a bit less conservative on approvals,” Kambadur said.
Employees discuss using Libgen
According to the filing, another work chat shows where Kambadur talked about using Libgen, an aggregator website that provides links to copyrighted content from publishers, as a data source that Meta can license. Libgen has been sued on numerous occasions, with the platform ordered to shut down its services due to claims of copyright infringement.
While another colleague in the chat posted a picture and followed it with “No, Libgen is not legal.”, it looked like some of the executives at the top felt like failing to use Libgen could hamper the company’s competitiveness in the AI race. In an email sent from Meta’s director of product management Sony Theakanath to Meta AI VP Joelle Pineau, he noted that Libgen was important to meet state-of-the-art (SOTA) numbers across all categories.
Theakanath also talked about several ways that the company could reduce legal exposure, including removing data that has been marked as stolen/pirated, and not citing other usage publicly. “We would not disclose the use of Libgen datasets used to train,” he said. In practice, the move meant that the company would first go through the Libgen files to check for “stolen or pirated” works.
Court documents reveal other infringements
In one of the work chats, Kambadur also suggested that the Meta AI team should tune models to “avoid risky IP prompts”, which will configure the models to refuse to give answers to users trying to know the e-books the models were trained on.
The filings also revealed other details, with a revelation showing that Meta may have used Reddit data to train its model to mimic the behavior of a third-party application called Pushshift. Reddit mentioned in a statement in April 2023 that it will start charging AI firms to access data to train their models.
The plaintiffs in the current case have amended their complaint many times since the lawsuit began in 2023. The filing was done at the US District Court for the Northern District of California, San Francisco. In the latest amendment, the plaintiffs claimed that Meta cross-referenced pirated books with copyrighted ones to determine if it would be ideal to pursue a licensing deal. Meta, on its part, sees the case as a high-stakes legal issue, moving to add two Supreme Court litigators to its defense team.
Cryptopolitan Academy: Coming Soon - A New Way to Earn Passive Income with DeFi in 2025. Learn More