Google has open-sourced the SynthID text watermarking tool, a technology that enables users to detect whether text is original or AI-generated easily.
According to Google, the AI-generated text detector can be downloaded from the AI platform Hugging Face and Google’s updated Responsible GenAI Toolkit. Watermarks have become significant with the boom in generative AI in the past two years as LLMs are being manipulated to spread misinformation and disinformation, as well as non-consensual sexual content and for malicious purposes.
The development comes as there is urgency to develop such tools with the European Union Law Enforcement Agency concerned that 90% of online text could be synthetic by 2026, making propaganda, fraud, and deception rife.
Google researchers explain their watermarking tool
In a post on X platform, the search engine giant revealed that it is open-sourcing its SynthID Text watermarking tool saying it will be “available freely to businesses and developers as it will help them identify their AI-generated content.”
Pushmeet Kohli, vice president of research at Google DeepMind and a co-author of a research paper from Google DeepMind said: “The system does not compromise the functions of the AI models but just makes them better.”
Images and videos have been central in discussions for content credentials and watermarks have been earmarked as the solution needed to combat deepfakes.
The Coalition for Content Provenance and Authenticity (C2PA), a collaboration by technology companies and major media outlets to work out a system for attaching encrypted metadata to indicate AI-generated image and video files has also been central to these discussions.
According to Google DeepMinds’ research paper, SynthID Text interferes during the generation as it alters some words put out by a chatbot to be clear to a SynthID detector but nearly invisible to humans.
“Modifications like these bring in a statistical signature into the AI generated text and during the watermark detection phase, the signature can be measured to see if the text was from an AI model by the watermarked LLM,” the researchers wrote in the paper.
By generating sentences word by word, that is how the LLMs that power chatbots work, as they choose the likely next word depending on the context of what came before. Through randomly allocating number scores to candidate words, the LLM produces words with a higher score. Watermarks will be put on a piece of text that would have been analyzed by the detector, and having been found to have a higher score.
Today, we’re open-sourcing our SynthID text watermarking tool through an updated Responsible Generative AI Toolkit.
Available freely to developers and businesses, it will help them identify their AI-generated content. 🔍
Find out more → https://t.co/n2aYoeJXqn pic.twitter.com/4uRKYaz57Y
— Google DeepMind (@GoogleDeepMind) October 23, 2024
Industry experts commend Google for step in the right direction
Despite the DeepMind system performing better than other tools at watermarking text, the researchers acknowledged in their paper that the tool still has flaws. For example if you alter a Gemini-generated text, the detector will be fooled.
“While SynthID is not a silver bullet for identifying AI-generated content, it is an important building block for developing more reliable AI identification tools.”
– Kohli.
If users alter texts significantly or use another chatbot to summarize text, the detector will be led astray in watermarking the AI-generated text.
Google claims that SynthID Text does not compromise the quality, and accuracy, or speed of text generation as tested by the system integrated within its Gemini models. The company also added that it does work even on text that has been cropped, modified or paraphrased.
“Detection is a particular problem when one starts to factor in implementation in real situations, as there are problems with the review of text in the wild, where one will have to know which model of watermarking has been applied to know and where to locate the signal,” explained Bruce MacCormack, a member of the C2PA steering committee.
Besides Google, OpenAI has also been working on AI text watermarking technology for years but has delayed their release over technical and commercial viability.
But industry experts have generally commended Google’s initiative as a step in the right direction.
“It holds promise for improving the use of durable content credentials from C2PA for documents and raw text” said Andrew Jenks, Microsoft’s director of media provenance and executive chair of the C2PA.
MacCormack also added that while Google researchers still have a lot to do to make this a practical case, it remains a great initiative, and “the first step in the marathon ahead.”