Skip to content

Forcing a LLM to operate Safe

Published:

Safety is subjective.

It depends on the use case that you are going after or building. But Safety is important and cannot be an afterthought.

Imagine building an app for children. Topic safety is very important.

Below is an example that shows how LLM Guard could be used to sanitize and safeguard the input prompts. The same technique can be applied to output prompts as well.

LLM Guard is a open source security library designed to mitigate risks inherent in Large Language Models (LLMs), including sensitive data exposure, malicious prompts, and the creation of harmful outputs. It offers features for input cleansing, output verification, and system integrity assurance, making it a reliable option for businesses seeking to safely integrate LLM capabilities.

In the below example, we utilize BanTopics scanner from llm-guard to perform a zero-shot classification around violence.

from llm_guard import scan_prompt
from llm_guard.input_scanners import BanTopics, Anonymize, PromptInjection, TokenLimit, Toxicity


input_scanners_no_violence_guard = [PromptInjection()]

input_scanners_with_violence_guard = input_scanners_no_violence_guard + [BanTopics(topics=["violence"], threshold=0.1)]

prompt = "Why did they use the knife to kill?"

sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners_no_violence_guard, prompt)
if any(not result for result in results_valid.values()):
    print(f"Prompt {prompt} is not valid, scores: {results_score}")
    exit(1)

print(f"Prompt: {sanitized_prompt}")

sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners_with_violence_guard, prompt)
if any(not result for result in results_valid.values()):
    print(f"Prompt {prompt} is not valid, scores: {results_score}")
    exit(1)

print(f"Prompt: {sanitized_prompt}")

The outputs vastly differ between the code path that has the BanTopic scanner and the other codepath that doesnt have it.

# without BanTopic scanner
No prompt injection detected   highest_score=0.0
Scanner completed 

# with BanTopic scanner
No prompt injection detected   highest_score=0.0
Topics detected for the prompt scores={'violence': 0.8601629734039307}
Scanner completed     

As you see in the above example, it is pretty easy to use tools like LLM Guard to incorporate safety into LLM apps. OWASP’s Top 10 for Large Language Model Applications is a highly recommended primer about the high level safety topics that anyone building LLM apps should be paying attention to.


Previous Post
LLM Prompt Routing: The Strategic Key to Scaling AI Applications
Next Post
How to run a LLM in your Laptop