Break One Model, Break Them All

Adversarial Attacks Transfer Across Architecturally Different LLMs and No One Understands Why

digital

Problem Statement

Adversarial attacks on large language models (LLMs) — input strings that cause models to bypass safety guardrails, disclose training data, or generate harmful content — transfer across architecturally different models despite those models having different training data, architectures, parameter counts, and alignment procedures. The same exploit string can work on GPT-4, Claude, LLaMA, and Gemini, and no theoretical framework exists to explain why. This cross-model adversarial transferability means that discovering an exploit for one model immediately threatens all deployed models, making the current approach of model-specific safety alignment structurally inadequate. Additionally, alignment training (RLHF, Constitutional AI, red-teaming) remains consistently circumventable by determined adversaries.

Why This Matters

LLMs are being deployed in security-sensitive contexts — code generation, medical triage, legal analysis, customer service with access to private data, autonomous agent systems — where adversarial manipulation can cause real harm. Systematic evaluation shows only approximately 20% success rate when LLMs attempt to fix security vulnerabilities in code, and reverse engineering accuracy of only 53%, yet these models are being trusted with security-critical tasks. The transferability phenomenon suggests a universal structural vulnerability in transformer-based language models — not a flaw in any individual model's training — which means no amount of model-specific defense can provide reliable security.

What’s Been Tried

Safety alignment (RLHF, Constitutional AI, red-teaming) adds behavioral guardrails but these are consistently circumvented through jailbreaking techniques. Adversarial training — adding known attack examples to training data — provides model-specific defense but does not prevent novel attacks or transferred attacks from models trained differently. Input filtering and prompt shields block known attack patterns but are trivially evaded by rephrasing or encoding attacks in unusual formats. Watermarking approaches for detecting AI-generated content "need refinement" and can often be removed. The fundamental problem is that LLM safety is defined behaviorally (what the model outputs in response to known prompts) rather than structurally (what the model's internal representations guarantee), and no methods exist for structural safety verification of neural language models.

What Would Unlock Progress

Theoretical understanding of why adversarial transferability occurs across architecturally different models — potentially revealing universal properties of how transformer-based language models represent and process language that create shared vulnerability surfaces. Structural safety verification methods that can make guarantees about model behavior across input distributions, not just on specific test cases. Standardized LLM security evaluation frameworks, analogous to Common Criteria for traditional software, that enable systematic comparison of model security properties and define acceptable risk levels for different deployment contexts.

Entry Points for Student Teams

A student team could systematically test a set of published adversarial attack strings across multiple open-source LLMs (LLaMA, Mistral, Falcon, Qwen) to map transferability patterns — which attacks transfer, which don't, and what architectural or training features correlate with susceptibility. Alternatively, teams could design evaluation benchmarks for measuring LLM security properties in a specific application domain (e.g., code generation safety, medical information accuracy under adversarial prompting). Relevant disciplines: computer science, machine learning, cybersecurity, natural language processing.

Genome Tags

Constraint

technical

Domain

digital

Scale

global

Failure

theoretical-gap

Breakthrough

algorithmknowledge-integration

Stakeholders

institutional

Temporal

newly-created

Tractability

research-contribution

Source Notes

Distinct from `digital-ai-trustworthiness-heterogeneous-verification` (which covers verification of AI systems using heterogeneous evidence — testing, formal methods, simulation) and `digital-safe-rl-exploration-guarantees` (which covers safety during reinforcement learning training). This brief addresses a specific unexplained empirical phenomenon — adversarial transferability across architecturally different LLMs — that represents both a practical security threat and a fundamental gap in understanding transformer-based language models. The NASEM workshop specifically flagged the open question of "how the same string could be used as an exploit in several different models despite each model having different training, architecture, and initial model weights." Source-bias note: NASEM cybersecurity workshop surfaced this among several LLM security challenges; the binding constraint is genuinely a theoretical gap (no explanation for cross-model transferability), not institutional coordination or policy.