Leaked data shows China is using large language models to boost its censorship machine.
A recent leak reveals that a sophisticated AI system, built on 133,000 examples of user content, is designed to flag any material deemed sensitive by the government.
The dataset, discovered by a security researcher and seen by TechCrunch, shows that China is taking steps to extend its online control far beyond topics such as the Tiananmen Square massacre.
China is using modern tech to filter online content
The leak, which dates to entries as recent as December 2024, is a clear sign that Chinese authorities or their affiliates are using new technology to filter online content.
The database includes complaints about poverty in rural China, news reports on corrupt Communist Party members, and cries for help over corrupt cops shaking down entrepreneurs.
Each piece of content is fed into a large language model (LLM) that scans for topics that might stir up public dissent.
Xiao Qiang, a researcher at UC Berkeley who studies Chinese censorship, told TechCrunch that the leaked data is “clear evidence” of the government’s intent to use LLMs to improve repression.
Qiang explained that, unlike traditional methods which rely on human labor for keyword filtering and manual review, an LLM can quickly and accurately identify even subtle criticism, thus making state-led information control more efficient and far-reaching.
The system is not only used to censor political topics but also extends to sensitive areas in social life and military affairs. According to the details in the leaked dataset, any content related to pollution, food safety scandals, financial fraud, and labor disputes is given “highest priority” for censorship.
The data shows that topics like the Shifang anti-pollution protests of 2012 are carefully monitored to prevent public unrest. Even political satire and historical analogies aimed at current political figures are instantly flagged. Content relating to Taiwan politics is also targeted, with military matters – including reports of movements, exercises, and details of weaponry – drawing close scrutiny.
‘Taiwan’ appears 15,000 times in China’s censorship dataset
A notable detail in the leaked content is that the Chinese word for Taiwan (台湾) appears over 15,000 times, underlining the focus on any discussion that might challenge the official narrative.
Other sensitive content in the dataset includes commentary about Taiwan’s military capabilities and details regarding a new Chinese jet fighter. Even subtle forms of dissent are not spared; one example found in the database is an anecdote about the fleeting nature of power using the popular Chinese idiom “When the tree falls, the monkeys scatter.”
Security researcher NetAskari uncovered the dataset, which was stored in an unsecured Elasticsearch database on a Baidu server.
“Public opinion work” is a term used for the censorship and propaganda efforts overseen by the powerful Cyberspace Administration of China (CAC). Michael Caster, the Asia program manager for rights organization Article 19, explained that this work is designed to ensure that the government’s narratives remain dominant online.
A report from OpenAI last month also revealed that an unidentified actor, likely operating from China, used generative AI to monitor social media conversations – particularly those calling for human rights protests – and forwarded the information to the Chinese government. The same report noted that the technology was used to generate comments highly critical of prominent Chinese dissident Cai Xia.
Traditional censorship in China has often relied on basic algorithms that automatically block content containing blacklisted terms such as “Tiananmen massacre” or “Xi Jinping.” Users have experienced this firsthand with tools like DeepSeek. However, newer systems can detect even subtle criticism at a large scale, and they improve as they are fed more data.
Cryptopolitan Academy: Coming Soon - A New Way to Earn Passive Income with DeFi in 2025. Learn More