Anthropic’s Responsible Scaling Policy Update Makes a Step Backwards

October 23, 2024

A week ago, Anthropic released an updated version of their Responsible Scaling Policy (RSP), prompting us to reassess their score in our ratings. We were initially expecting an improvement. Unfortunately, the results are disconcerting. Their grade has dropped from 2.2 to 1.9, placing them alongside OpenAI and DeepMind in the "weak" category.

The primary issue lies in Anthropic's shift away from precisely defined capability thresholds and mitigation measures. The new policy adopts more qualitative descriptions, specifying the capability levels they aim to detect and the objectives of mitigations, but it lacks concrete details on the mitigations and evaluations themselves. This shift significantly reduces transparency and accountability, essentially asking stakeholders to accept a "trust us to handle it appropriately" approach rather than providing verifiable commitments and metrics.

To illustrate this change, let’s look at a capability threshold:

  • Version 1 (V1): AI Security Level 3 (ASL-3) was defined as "The model shows early signs of autonomous self-replication ability, as defined by a 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]."
  • Version 2 (V2): ASL-3 is now defined as "The ability to either fully automate the work of an entry-level remote-only researcher at Anthropic, or cause dramatic acceleration in the rate of effective scaling" (quantified as an increase of approximately 1000x in a year).

In V2, the thresholds are no longer defined by quantitative benchmarks. Anthropic now states that they will demonstrate that the model's capabilities are below these thresholds when necessary. However, this approach is susceptible to shifting goalposts as capabilities advance.

A similar trend is evident in their mitigation strategies. Instead of detailing specific measures, they focus on mitigation objectives, stating they will prove these objectives are met when required. This change alters the nature of their commitments. Let's look at a specific commitment regarding security mitigation measures:

  • In V1: "We commit to the following security themes [...] mandatory external reviews should embed security within regular operations and harden processes during organizational changes."
  • In V2: "We expect this to include independent validation [...]"

The general direction of these changes are concerning. By allowing more leeway to decide if a model meets thresholds, Anthropic risks prioritizing scaling over safety, especially as competitive pressures intensify. We were expecting the RSP to become more specific as technology advances & their risk management process matures, not the other way around.

Link to our website with the updated ratings and analysis for Anthropic: https://ratings.safer-ai.org/anthropic

back to blog
October 23, 2024