The old prompt-injection problem was text.
The new one is everything a model can read.
That matters because enterprise AI is moving quickly into screenshots, PDFs, scanned forms, dashboard images, product photos, contracts, support attachments and browser-agent views. These are not decorative inputs. They are often the work. A model reads them, summarizes them, routes them, extracts fields from them, and sometimes acts on them.
The security mistake is treating those files as preprocessing residue.
They are instruction surfaces.
The Problem
Text prompt injection was already awkward. A model cannot reliably separate instructions from data when both arrive as language. That is why a malicious webpage, email or retrieved document can tell a model to ignore prior instructions, leak context, or take an action the user did not request.
Multimodal AI makes the boundary worse.
Now the injected instruction does not need to be typed into a chat box. It can be rendered inside an image, hidden in a screenshot, placed in a scanned PDF, embedded in a chart, printed on signage, or carried through a document workflow that nobody thinks of as a prompt channel.
OWASP’s prompt-injection materials explicitly include multimodal patterns, including malicious instructions hidden in images, documents or other non-text inputs processed by multimodal systems (OWASP Cheat Sheet Series, OWASP Foundation). That is the practical warning. If the model can perceive it, the attacker can try to make it instructional.
The reason is boring and severe. Vision-language models do not merely store an image as a file object. They interpret it. They perform OCR-like reading, infer context, resolve layout and convert visual content into tokens or internal representations the model can reason over. Once that happens, text in the image can compete with the user’s actual request.
The document stopped being evidence. It became part of the conversation.
The Analysis
Cloud Security Alliance’s March 2026 research note on image-based prompt injection describes the core issue directly: adversarial instructions can be embedded in images rather than text, bypassing text-layer sanitization that does not inspect pixel-encoded instructions. It also warns that agentic systems browsing the web, processing documents or analyzing untrusted images are especially exposed because one malicious image can propagate instructions through a workflow (Cloud Security Alliance).
That is the enterprise angle.
This is not only about a user uploading a joke image to a chatbot. The serious case is the automated workflow: a claims agent reading accident photos, a finance agent extracting invoice data, a legal assistant reviewing exhibits, a SOC assistant summarizing screenshots, or a browser agent looking at a vendor portal.
The attacker does not need to compromise the model provider. They need to place adversarial content where the model is expected to look.
Academic work is catching up with the same point. A March 2026 arXiv paper on image-based prompt injection studies black-box attacks where adversarial instructions are visually embedded into natural images to override model behavior (arXiv). A later cross-modal paper studies image-only perturbations that steer how large vision-language models interpret both textual and visual inputs (arXiv).
The technical details vary. Visible text. Low-contrast text. Adversarial perturbations. Layout tricks. Steganographic variants. Physical-world signs. The common business problem is simpler: the input review process is now weaker than the input capability.
Enterprises already have document intake controls. They scan attachments for malware. They classify documents by sensitivity. They OCR invoices. They redact some fields. They archive originals. Those controls were built for files as content.
Multimodal agents treat files as context.
That difference changes the threat model. A malicious PDF no longer needs an exploit against the PDF parser. It can exploit the model’s instruction-following behavior after the PDF is rendered and read. A screenshot does not need to contain malware. It can contain a command. A product image does not need to break the vision encoder. It can insert a high-priority instruction into the model’s perceived scene.
This is why “sanitize the prompt” is not enough.
By the time the text appears inside the model’s visual understanding, the normal text input gate may never have seen it. Even if OCR is run separately, the model may still infer text, layout or intent from pixels that a rule-based scanner missed. A visual attack can also be contextual: the instruction may be meaningless until combined with the user’s text prompt, a retrieved page, or the agent’s tool access.
Security teams should also be careful with the word “hidden.” Hidden from whom?
Some attacks are visually obvious if a human knows to look. Some are hidden through low contrast, image transformations or perturbations. But the important mismatch is not always visibility. It is authority. A human may see text in a document as content to summarize. The model may treat it as an instruction to follow.
That is a control failure, not a perception trick.
The Implications
The first change is architectural. Multimodal intake should be treated as untrusted input, not as a neutral preprocessing stage.
That means files and images entering AI workflows need provenance, source labeling and trust boundaries. A screenshot uploaded by a customer should not enter the same instruction context as a system prompt. A vendor PDF should not get to shape tool permissions. A browser agent should not treat visual text from a webpage as privileged guidance.
The second change is operational. Visual prompt-injection testing should become part of AI red-team work. If a system accepts screenshots, test screenshots. If it processes invoices, test invoices. If it reads scanned forms, test scans. If it browses pages, test pages with adversarial visual content. Text-only prompt-injection tests miss the channel that the system was built to use.
The third change is permissions. The safest response is not trying to detect every malicious pixel. It is reducing what a successful injection can do.
Agents that process untrusted visual content should run with narrow tool scopes, explicit confirmation for external actions, constrained retrieval access, and output checks before data leaves the system. If a document-processing agent can read customer files and send email, a visual injection becomes more than a weird model behavior. It becomes a workflow compromise.
This is also a product problem. AI vendors like to sell multimodal input as friction removal. Upload the PDF. Drop in the screenshot. Point the camera. Let the agent work. Fine. But every extra input type is a new route for instructions, not just data.
Procurement should ask vendors specific questions. Do you threat-model image-based prompt injection? Do you separate visual content from instruction hierarchy? Do you log OCR-like extracted text? Can customers disable tool actions when the triggering context includes untrusted visual inputs? Do you test against document-level and screenshot-level injections, not just chat prompts?
The answer does not need to be perfect. The current state of the field is not perfect. But “we scan text prompts” is not an answer for a system that reads images.
The useful mental model is simple.
If an AI system can read a document, the document can talk back.
That is the boundary now.
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.