Large language models (LLMs) are becoming a part of almost every industry. Developing LLMs for natural language applications has many stages. One of them is making sure that LLMs do not produce dangerous responses or toxic content. To solve this issue, developers use a human red team which is essentially a group of people who produce prompts that make LLMs spit out dangerous output.
The problem with using a human red team is that recruiting them is expensive and they consume plenty of time. That’s why researchers at MIT discovered a new method to test natural language LLM applications through using another LLM. This approach is called curiosity driven red teaming (CRT) and uses machine learning as a foundation. The research was published as a conference paper at ICLR 2024 and is available online.
Curiosity driven red teaming (CRT) is better
At first, the approach of automating human red teaming work was done through creating a red team model and training it using reinforcement learning (RL). After testing the red team model, the outcome was successful but with a low number of effective results.
This means the target LLM will not be evaluated accurately since many prompts that can produce a toxic output are not included. The reason behind having a low number of effective results is because the red team model is trained to produce highly toxic and similar results. The rewards system scores the provocative prompts based on their effectiveness or toxicity. There’s no incentive to consider every possible prompt that will trigger the target LLM.
Using curiosity driven red teaming (CRT) on the other hand is more powerful. CRT produces a large number of prompts that are capable of provoking highly intelligent models. This is because CRT focuses on the consequences of each prompt. It will aim to use different words and sentences, resulting in a broader coverage of toxic output. The reward system in the reinforcement learning model focuses on the similarity of words while the CRT model is rewarded for avoiding similarities and using different words and patterns.
Testing on LLaMA2 for toxic output
The researchers applied curiosity driven red teaming (CRT) on LLaMA2, an open-source LLM model. CRT managed to output 196 prompts that generated toxic content from the open-source model. LLaMA2 is fine-tuned by human experts to overcome producing harmful content. The researchers conducted this experiment using GPT2 which is considered to be a small model with 137M parameters. The team concluded that CRT could be a critical component in automating red teaming work. The CRT code is available on github.
“We are seeing a surge of models, which is only expected to rise. Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and it’s important that they are verified before released for public consumption. Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future,” says Agrawal.
The future of creating safe LLM models looks bright. With continuous research, the goal of creating safe LLMs for any purpose could be achieved efficiently. The researchers behind this paper published other related work in areas like automated red teaming and adversarial attack in language models.
From Zero to Web3 Pro: Your 90-Day Career Launch Plan