LLM Hypno v2

Almog Hilel

Log in to write a review.

LLM Hypno v2

Authors: Almog Hilel

Source: Upload 2507.02850v2.pdf

Published: N/A

Added: 2026-03-09 04:58 UTC

Open Paper Back to Library

Abstract / Extracted Text

No abstract or extracted text saved.

Latest Summary

Key Findings

A vulnerability exists in LLMs trained with user feedback: a single user can inject persistent, unauthorized knowledge or behaviors into the model solely by selecting prompts and upvoting/downvoting outputs.
The LLM can be manipulated to:
Insert novel factual knowledge not present pre-training.
Alter code-generation patterns to introduce security vulnerabilities.
Inject fabricated news or misinformation about real-world entities.
The attack uses prompts that cause the model to stochastically pick between a benign and a ‘poisoned’ (malicious) response, with repeated positive feedback reinforcing the poisoned option.
Even with access limited to ordinary user interfaces (no direct control over model outputs), hundreds of targeted feedback examples can significantly alter model behavior, despite being diluted among much larger sets of benign feedback.
Behavior changes generalize: altered knowledge and harmful behaviors emerge not just in the attack context, but in unrelated prompts and settings.
General model capabilities, as measured by standard benchmarks (TinyMMLU), are not significantly affected by poisoning; thus, detection via routine metrics is unlikely.
Adding additional clean user feedback does not meaningfully mitigate the impact of poisoning.
Attack is sample-efficient: high attack success (65%+ poisoned accuracy) can be obtained with as few as a few hundred poisoned examples among thousands of ordinary examples.
The risk is not specific to one domain: attacks succeeded for fabricated entities, fake news, and code vulnerabilities.
The method exploits generalization in preference-tuning (e.g., RLHF/KTO), challenging assumptions about the safety and limited scope of end-user feedback.
Existing defenses in commercial LLM pipelines are largely unknown; thus, real-world risk remains undetermined.

Practical Takeaways

Do not assume that collecting and aggregating user feedback for LLM preference tuning is safe against adversarial manipulation, even if malicious feedback is a small fraction of total data.
Providers should critically evaluate and monitor the impact of user feedback, particularly in pipelines where untrusted feedback may influence preference-tuning or RLHF steps.
Filtering, auditing, and attribution of feedback data are crucial to limiting injection attacks; arbitrary upvoting/downvoting of generated content should not directly steer model behavior without inspection.
Feedback-based models can acquire and propagate false, dangerous, or insecure behaviors that generalize beyond the feedback context—potentially introducing system-wide risks.
Routine capability benchmarks will not detect targeted poisoning attacks, as general performance impact is minimal even when malicious knowledge is injected at scale.
Attackers can influence security and safety-critical behaviors (e.g., code generation for APIs) with subtle and scalable preference-based attacks.
Security strategies must extend to all interfaces allowing user feedback, not just to underlying training data.
Research and deployment should prioritize transparency in feedback collection and model update processes to enable community scrutiny and the development of effective countermeasures.
Caution is advised when considering scaling or automating learning from unfiltered user feedback.
Immediate assessment and potential redesign of user-feedback-driven model tuning is recommended for safety-critical or widely-deployed LLMs.

Community Reviews

No reviews yet.