LLM Reference
Prompt Guard

Prompt Guard

About

Prompt Guard is a specialized text classification model created by Meta, focusing on the detection of malicious prompts, such as jailbreaks and prompt injections. Leveraging the mDeBERTa-v3-base transformer architecture, this lightweight model categorizes inputs into three distinct classes: benign, injection, and jailbreak. Its design ensures compatibility with a variety of large language models (LLMs) without requiring specific prompt structures. With a compact size of 86 million parameters, Prompt Guard integrates seamlessly into diverse applications. While it excels at identifying common attacks, it may need fine-tuning with application-specific data to improve its resilience against adaptive attacks 346.

Models(1)

Details

ResearcherAI at Meta
Models1