What happened after 2k people tried to hack my AI assistant

	What happened after 2k people tried to hack my AI assistant(fernandoi.cl)
	364 points by cuchoi 1 day ago \| 160 comments
	tl;dr: The author ran a public bounty challenge where 2,000+ people sent 6,000+ emails trying to prompt-inject Claude Opus 4.6 into leaking a secrets.env file, and none succeeded despite sophisticated attacks involving authority impersonation, multi-language social engineering, and Anthropic's refusal trigger string. Side effects included Gmail suspending the account, $500+ in API costs, and the agent eventually inferring it was a security exercise from memory context. The author concludes prompt injection is harder than expected with frontier models and simple prompts, but notes weaker models and multi-turn attacks weren't tested.
	HN Discussion: ↓Test conditions were unrealistic since nearly all inputs were malicious, biasing the model toward caution ↓Refusing to respond at all isn't a real security win; usefulness vs. safety tradeoff was ignored ↓Author shouldn't lower their guard since prompt injection remains an active research frontier •Setup details and reproducibility (including testing cheaper models) are missing ↓Getting the agent to reply at all should count as successful injection, which the author glosses over