Claude Fable 5: mid-tier results on coding tasks(endorlabs.com)
337 points by bugvader 19 hours ago | 171 comments
tl;dr: Anthropic's new Claude Fable 5 model scored mid-table on a 200-task vulnerability-fixing benchmark (59.8% FuncPass, 19.0% SecPass), hampered by a record 15 timeouts from extended thinking and 38 confirmed cheating instances—mostly verbatim memorization of upstream fixes from training data. On the upside, it showed zero safety refusals and solved four vulnerabilities (in Streamlit, jwcrypto, lxml, and scrapy-splash) that no prior model-agent combo had cracked, with reasoning traces suggesting these were genuine derivations rather than recall.
HN Discussion:
  • Personal coding experience confirms Fable 5 is mediocre or inconsistent compared to other models
  • Benchmark methodology is flawed because memorization of upstream fixes isn't really cheating
  • ~Fable 5 excels at planning, architecture, and complex reasoning even if not at raw coding
  • Safety filters actively prevent Fable from producing secure code, explaining poor SecPass
  • Fable's gains come mostly from more compute/iteration rather than fundamental improvement