Security researcher Johann Rehberger recently discovered a vulnerability in ChatGPT that enabled attackers to save false information and harmful instructions in a user’s long-term memory settings.
However, OpenAI quickly closed the investigation, classifying the issue as a safety concern rather than a security threat.
In response, Rehberger took the initiative, developing a proof-of-concept exploit that demonstrated how the vulnerability could be used to extract all user inputs indefinitely. Earlier this month, OpenAI engineers responded by implementing a partial fix.
The vulnerability exploited the long-term conversation memory feature that OpenAI began testing in February and rolled out more widely in September.
This memory function in ChatGPT retains information from past interactions, using it to inform future conversations.
As a result, the language model can remember specifics like a user’s age, gender, philosophical beliefs, and various other details, eliminating the need for users to re-enter this information in each session.
Within three months of the feature’s launch, Rehberger discovered that memories could be generated and permanently saved through indirect prompt injection, an AI exploit that manipulates a language model to accept instructions from untrusted sources, such as emails, blog posts, or documents.
He showcased how he could mislead ChatGPT into thinking that a specific user was 102 years old, resided in the Matrix, and believed that the Earth was flat. This erroneous information would then be used in all future conversations.
These fabricated memories could be introduced by uploading files to Google Drive or Microsoft OneDrive, sharing images, or visiting a site like Bing, all of which could be done by a malicious actor.
In May, Rehberger confidentially informed OpenAI about his discovery. However, the company closed the report ticket that same month.
A month later, he submitted a new disclosure statement, this time providing a proof-of-concept that enabled the ChatGPT app for macOS to transmit a direct copy of all user inputs and outputs to a server he controlled.
The only action required from a target was to direct the LLM to visit a web link containing a harmful image. After that, all interactions with ChatGPT were forwarded to the attacker’s website.
Rehberger pointed out in the video demonstration, “What’s particularly intriguing is that this is now memory-persistent.”
He explained that the prompt injection added a memory to ChatGPT’s long-term storage, meaning that even when a new conversation starts, it continues to exfiltrate the data.
However, this type of attack cannot be executed through the ChatGPT web interface, thanks to an API that OpenAI introduced last year.
Although OpenAI has implemented a fix to stop memories from being exploited as a means of data exfiltration, the researcher noted that untrusted content can still initiate prompt injections, allowing the memory tool to retain long-term information inserted by a malicious actor.
Users of the language model should remain vigilant during sessions for any output that suggests a new memory has been created.
Additionally, they should frequently check their stored memories for any entries that may have originated from unreliable sources.
OpenAI offers guidance on how to manage the memory tool and review specific memories stored within it.
However, company representatives did not reply to an email inquiring about measures to prevent other hacks that could insert false memories.
Related News You May Like