Is it feasible to have a secure AI assistant?

EXECUTIVE SUMMARY

AI agents present substantial risks. Even within the confines of a chatbox, LLMs can err and act improperly. However, once equipped with tools that enable interaction with the external environment, like web browsers and email accounts, the repercussions of their errors escalate significantly.

This might clarify why the pioneering breakthrough LLM personal assistant emerged not from the prominent AI research institutions, which must be mindful of their reputations and liability, but from an independent developer, Peter Steinberger. In November 2025, Steinberger released his tool, now named OpenClaw, on GitHub, and by late January, the project was a sensation.

OpenClaw utilizes existing LLMs to enable users to construct their own custom assistants. For many users, this entails providing extensive personal data, ranging from years of emails to hard drive contents. This has alarmed security experts considerably. The dangers associated with OpenClaw are so vast that it would likely require nearly a week for someone to review all the security blog posts that have emerged recently. The Chinese government has taken the step of issuing a public caution regarding OpenClaw’s security weaknesses.

In light of these issues, Steinberger expressed on X that non-technical individuals should refrain from utilizing the software. (He did not respond to a request for comment for this article.) Nevertheless, there is a strong demand for what OpenClaw offers, extending beyond those capable of conducting their own software security assessments. AI companies aspiring to enter the personal assistant market must devise a system that ensures the safety and security of user data. Achieving this will likely require drawing from cutting-edge agent security research methodologies.

Risk management

Essentially, OpenClaw serves as a mechanical suit for LLMs. Users have the freedom to select any LLM to operate as the pilot; this LLM gains enhanced memory functions and the ability to perform repetitive tasks at set intervals. In contrast to the agentic solutions from major AI firms, OpenClaw agents are designed to operate 24-7, with users able to interact with them via WhatsApp or similar messaging platforms. This capability enables them to function as highly capable personal assistants who can provide a tailored task list each morning, organize holidays while you carry on with work, and develop new applications during their downtime.

However, such power comes with implications. If you wish for your AI personal assistant to oversee your inbox, you must grant it access to your email—and all the confidential information contained within. If you want it to handle purchases on your behalf, access to your credit card information is necessary. Moreover, for it to perform tasks on your computer, such as coding, some access to your local files is required.

There are several ways this could go awry. The first is that the AI assistant could err, as evidenced by a case where a user’s Google Antigravity coding assistant allegedly erased his complete hard drive. The second possibility is that an individual might exploit the agent through traditional hacking methods to either extract sensitive data or execute harmful code. Since OpenClaw became popular, security researchers have uncovered numerous vulnerabilities that put security-naïve users in jeopardy.

Both of these threats can be managed: Some users are opting to operate their OpenClaw agents on distinct computers or in the cloud, which safeguards data on their hard drives from potential erasure, while additional vulnerabilities could be addressed using established security methods.

However, the specialists I consulted for this article were particularly concerned with a more insidious security risk known as prompt injection. Prompt injection effectively constitutes LLM hijacking: By simply posting malicious text or images on a website that an LLM might access, or sending them to an inbox that an LLM monitors, attackers can manipulate it to their advantage.

If that LLM has access to any of its user’s private information, the outcomes could be catastrophic. “Utilizing something like OpenClaw is akin to handing your wallet to a stranger on the street,” remarks Nicolas Papernot, a professor of electrical and computer engineering at the University of Toronto. Whether the leading AI companies can confidently offer personal assistants may hinge upon the robustness of the defenses they can develop against such assaults.

It’s crucial to underscore that prompt injection has not yet resulted in any disasters, or at least none have been publicly acknowledged. Yet, with likely hundreds of thousands of OpenClaw agents circulating on the internet, prompt injection may soon appear as a much more enticing tactic for cybercriminals. “Tools like this are incentivizing malicious individuals to target a significantly broader audience,” says Papernot.

Establishing guardrails

The term “prompt injection” was introduced by well-known LLM blogger Simon Willison in 2022, several months prior to the debut of ChatGPT. Even at that time, it was clear that LLMs would bring forth an entirely new category of security vulnerabilities once they achieved widespread adoption. LLMs are unable to differentiate between the commands they receive from users and the information they utilize to execute those commands, such as emails and web search results—to an LLM, everything is merely text. Thus, if an attacker embeds a few sentences within an email and the LLM misinterprets them as a directive from its user, the attacker can compel the LLM to act as desired.

Prompt injection represents a significant challenge, and it appears unlikely to vanish shortly. “Currently, we don’t possess a definitive solution,” states Dawn Song, a UC Berkeley computer science professor. Nevertheless, there exists a vigorous academic community tackling the issue, having devised strategies that could eventually render AI personal assistants safe.

From a technical standpoint, it is feasible to utilize OpenClaw today without exposing it to the risk of prompt injection: Simply refrain from connecting it to the internet. However, restricting OpenClaw from accessing emails, managing calendars, or conducting online research undermines much of the value of having an AI assistant. The key to safeguarding against prompt injection lies in enabling the LLM to resist hijacking attempts while still granting it the latitude to fulfill its functions.

One technique involves training the LLM to ignore prompt injections. A significant component of the LLM development process, known as post-training, entails taking a model capable of producing realistic text and refining it into a functional assistant by “rewarding” it for appropriately answering queries and “penalizing” it when it fails to do so. These rewards and penalties are metaphorical, yet the LLM learns from them similarly to an animal. Using this method, it is possible to train an LLM not to react to particular instances of prompt injection.

Yet, there exists a delicate balance: overly training an LLM to dismiss injected commands may lead it to also reject legitimate user requests. Moreover, given the inherent randomness in LLM behavior, even a well-trained LLM to combat prompt injection will likely falter occasionally.

A different approach focuses on intercepting the prompt injection attack before it reaches the LLM. Typically, this requires employing a specialized detector LLM to ascertain whether the data being sent to the primary LLM contains any prompt injections. In a recent study, however, even the highest-performing detector utterly failed to recognize certain categories of prompt injection attacks.

The third strategy is more complex. Rather than regulating the inputs to an LLM by determining the presence of prompt injection, the aim is to establish a policy that directs the LLM’s outputs—its actions—and prevents it from engaging in harmful activities. Some defenses in this regard are quite straightforward: If an LLM is restricted to emailing only a select few pre-approved addresses, for instance, it certainly will not share its user’s credit card details with an attacker. Yet such a policy would hinder the LLM from accomplishing numerous beneficial tasks, such as researching and contacting potential business associates on behalf of its user.

“The challenge lies in accurately defining those policies,” asserts Neil Gong, a Duke University electrical and computer engineering professor. “It’s a balance between functionality and security.”

On a broader scale, the entire agentic domain is grappling with this trade-off: When will agents be secure enough to be truly beneficial? Experts are divided. Song, whose startup, Virtue AI, creates an agent security platform, believes it is feasible to safely implement an AI personal assistant at this point. Conversely, Gong states, “We’re not there yet.”

Even if AI agents are not yet fully shielded from prompt injection, there are certainly methods to reduce the risks. It’s conceivable that some of those strategies could be applied within OpenClaw. Last week, during the inaugural ClawCon event in San Francisco, Steinberger announced that he had brought on a security specialist to enhance the tool.

At present, OpenClaw remains susceptible, yet this has not deterred its vast number of enthusiastic users. George Pickett, a volunteer maintainer of the OpenClaw GitHub repository and an admirer of the tool, states he’s implemented various security precautions to ensure his safety while using it: He operates it in the cloud, sparing him the anxiety of potentially erasing his hard drive, and has established protocols to prevent others from accessing his assistant.

However, he has not taken specific measures against prompt injection. He acknowledges the risk but claims not to have encountered any reports of it occurring with OpenClaw. “Perhaps my perspective is misguided, but I doubt I’ll be the initial victim of a hack,” he remarks.

Risk management

Establishing guardrails

Our Company

About Links

Useful Links

Newsletter

Latest Posts

Is it feasible to have a secure AI assistant?

Risk management

Establishing guardrails

‘It was horrifying’: Tumbler Ridge’s close-knit community in disbelief following shooting

Byte magazine artist Robert Tinney, who captured the origins of personal computers, dies at 78

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Latest Posts