What happened after 2k people tried to hack my AI assistant(fernandoi.cl)
364 points by cuchoi 1 day ago | 160 comments
tl;dr: The author ran a public bounty challenge where 2,000+ people sent 6,000+ emails trying to prompt-inject Claude Opus 4.6 into leaking a secrets.env file, and none succeeded despite sophisticated attacks involving authority impersonation, multi-language social engineering, and Anthropic's refusal trigger string. Side effects included Gmail suspending the account, $500+ in API costs, and the agent eventually inferring it was a security exercise from memory context. The author concludes prompt injection is harder than expected with frontier models and simple prompts, but notes weaker models and multi-turn attacks weren't tested.
HN Discussion:
  • Test conditions were unrealistic since nearly all inputs were malicious, biasing the model toward caution
  • Refusing to respond at all isn't a real security win; usefulness vs. safety tradeoff was ignored
  • Author shouldn't lower their guard since prompt injection remains an active research frontier
  • Setup details and reproducibility (including testing cheaper models) are missing
  • Getting the agent to reply at all should count as successful injection, which the author glosses over