Toxicity, Bias, and Bad Actors: Three Things to Consider When Using LLMs

Editor's Note: This article follows natural language processing techniques that improve data quality with LLMs.

Large language models (LLMs) have revolutionized the field of artificial intelligence by enabling machines to generate human-like responses based on intensive training on massive amounts of data. When using LLMs, managing toxicity, bias, and bad actors is essential to achieving reliable results. Let’s take a look at what organizations should be thinking about when addressing these important areas.

Understanding toxicity and bias in LLMs

With the impressive capabilities of LLMs come significant challenges, such as learning and inadvertently spreading toxic and biased language. Toxicity refers to the generation of harmful, abusive, or inappropriate content, while bias involves the reinforcement of unfair prejudices or stereotypes. Both can lead to discriminatory outcomes and negatively impact individuals and communities.

Identify and manage toxicity and bias

One of the barriers to addressing toxicity and bias is the lack of transparency into the data used to pre-train many LLMs. Without visibility into the training data, it can be difficult to understand the extent of these issues in models. Since it is necessary to expose off-the-shelf models to domain-specific data to address business-related use cases, organizations have an opportunity to do their due diligence and ensure that the data they feed into the LLM does not compound or exacerbate the problem.

While many LLM vendors offer APIs and content moderation tools to mitigate the effects of toxicity and bias, they may not be enough. In my previous post, I introduced LITI, SAS’s natural language processing powerhouse. Beyond addressing data quality issues, LITI can play a critical role in identifying and pre-filtering content for toxicity and bias. By combining LITI with SAS’s exploratory natural language processing techniques such as topic analysis, organizations can gain a deeper understanding of potentially problematic content in their text data. This proactive approach allows them to mitigate issues before integrating the data into LLMs through retrieval augmented generation (RAG) or fine-tuning.

The models used to pre-filter content can also act as an intermediary between the LLM and the end user, detecting and preventing exposure to problematic content. This two-tiered protection not only improves the quality of results, but also protects users from potential harm. The ability to target specific types of language related to things like hate speech, threats, or obscenities adds an extra layer of security and gives organizations the flexibility to address potential concerns that may be unique to their business. Because these models can account for nuances in language, they can also be used to detect more subtle and targeted biases, like political dog-whistling.

Bias and toxicity are important areas where it is important to continue to rely on humans to provide oversight. Automated tools can significantly reduce the incidence of toxicity and bias, but they are not foolproof. Continuous monitoring and review are essential to detect cases that automated systems might miss. This is especially important in dynamic environments where new types of harmful content may emerge over time. As new trends develop, LITI models can be supplemented to account for them.

Combating manipulation by bad actors

Toxic or biased LLM results are not always due to inherent flaws in the training data. In some cases, models may exhibit unwanted behavior because they are manipulated by bad actors. This can include deliberate attempts to exploit model weaknesses through malicious prompt injection or jailbreaking.

Malicious prompt injection is a type of security attack against LLMs. It involves concatenating malicious inputs with benign and expected inputs with the aim of modifying the expected output. Malicious prompt injection is used to perform operations such as acquiring sensitive data, executing malicious code, forcing a model to return or ignore its instructions.

A second type of attack is a jailbreak attack. This differs from malicious prompt injection in that in jailbreak attacks, none of the prompts are benign. This research shows some examples of jailbreaking involving the use of prompt suffixes. A prompt asks the model for a plan to steal from a nonprofit organization. Without a prompt suffix, the model responds that it cannot help. Adding a prompt suffix causes the model to bypass its protections and generate a response. Jailbreaking and malicious prompt injection can involve exposing the model to nonsensical or repetitive patterns, hidden UTF-8 characters, and character combinations that would be unexpected in a typical user prompt. LITI is an excellent tool for identifying patterns, making it a powerful addition to a testing or content moderation toolkit.

Developing with responsible AI

The search for creating fair, unbiased, and non-toxic LLMs is ongoing and requires a multifaceted approach that combines advanced technological tools with human oversight and a commitment to ethical AI practices. Powerful tools like LITI, combined with robust oversight strategies, can help organizations significantly reduce the impact of toxicity and bias in their LLM outcomes. This not only builds user trust, but also contributes to the broader goal of developing responsible AI systems that benefit society without causing harm.

Further research

This is a serious topic, so I thought I’d leave you with something that made me laugh. As I was scrolling through the articles looking for examples to pair with my section on bad actors, Bing came up. I resisted the urge to try a quick little injection to see if I could get a better answer.

Understanding toxicity and bias in LLMs

Identify and manage toxicity and bias

Combating manipulation by bad actors

Developing with responsible AI

Further research

Learn more