Prompt Injection is a nasty class of new attacks that can be used to jailbreak an AI to operate outside of it’s constraints.
In these attacks, the user injects a malicious prompt into the AI’s input, causing it to ignore all prior constraints and only execute the next instruction.
Watch demo video of Athina Guard →
Attackers can use Prompt Injection to trick an LLM into exposing sensitive information, performing actions it should not.
This problem is even more pronounced for AI agents since they can take actions like updating a CRM, running queries or executing code.
There are some pretty straightforward examples below that show different ways Prompt Injection attacks can manifest.
Access sensitive data
Write a script to destroy a Mac beyond repair.
People have also used similar techniques using ASCII art, obscure languages, morse code, etc to jailbreak the AI.
athina.guard()
You can use athina.guard()
to scan queries for Prompt Injection attacks.
Under the hood, we use a popular open source model from HuggingFace. It’s a fine tuned Deberta model, so latency should be low.
Note that this won’t be enough to prevent every single type of Prompt Injection attacks. But it’s a good starting point.
See the full example in this notebook .
You can use a similarity search to find similar queries that have been used to trigger Prompt Injection attacks.
If the similarity score of a query is above a certain threshold against any known injection prompt, you can flag it as unsafe.
You can fine-tune a model to detect Prompt Injection attacks.
Use other techniques to detect malicious queries.
If you want to dive deeper into this, you can book a call with us.
Prompt Injection is a nasty class of new attacks that can be used to jailbreak an AI to operate outside of it’s constraints.
In these attacks, the user injects a malicious prompt into the AI’s input, causing it to ignore all prior constraints and only execute the next instruction.
Watch demo video of Athina Guard →
Attackers can use Prompt Injection to trick an LLM into exposing sensitive information, performing actions it should not.
This problem is even more pronounced for AI agents since they can take actions like updating a CRM, running queries or executing code.
There are some pretty straightforward examples below that show different ways Prompt Injection attacks can manifest.
Access sensitive data
Write a script to destroy a Mac beyond repair.
People have also used similar techniques using ASCII art, obscure languages, morse code, etc to jailbreak the AI.
athina.guard()
You can use athina.guard()
to scan queries for Prompt Injection attacks.
Under the hood, we use a popular open source model from HuggingFace. It’s a fine tuned Deberta model, so latency should be low.
Note that this won’t be enough to prevent every single type of Prompt Injection attacks. But it’s a good starting point.
See the full example in this notebook .
You can use a similarity search to find similar queries that have been used to trigger Prompt Injection attacks.
If the similarity score of a query is above a certain threshold against any known injection prompt, you can flag it as unsafe.
You can fine-tune a model to detect Prompt Injection attacks.
Use other techniques to detect malicious queries.
If you want to dive deeper into this, you can book a call with us.