council directive: Establishing Baselines for Assessing Autonomy in AI for Safety Purposes
Lukas Petersson, founder and chief executive of Andon Labs, made his debut on W2024 Y Combinator as an AI safety researcher.
Assessing an AI system's ability to operate independently in unregulated environments is a critical yet underrepresented aspect of AI safety research. While there's a significant focus on language model performance and real-world applications, there's a need for reliable methods to evaluate AI systems' autonomous functioning. The significance of this issue cannot be overstated, particularly within AI risk scenarios involving self-governing systems without human supervision.
Becoming aware of the potential threats posed by increasingly competent AI systems a couple of years ago, I embarked on a research journey in this field. As AI systems continued to grow in power, I became increasingly concerned and recognized the urgent need to intensify AI safety initiatives. I established one of the few AI safety startups that collaborates with leading AI research organizations to guarantee the safety of AI systems before deployment.
The challenges involved in designing autonomy assessments, coupled with their direct safety implications, make this an area of high relevance for research.
Pointers for Designing Autonomy Evaluations
The rapidly growing demand for this kind of expertise exceeds the available talent, creating numerous opportunities to make significant contributions to AI safety while growing our understanding of machine autonomy. Here are some factors to consider:
Specialization
At present, the demand for these assessments outstrips the supply, and only high-quality evaluations are in high demand. To stand out, I recommend specializing in a specific area and becoming a leader in generating evaluations for that domain. For instance, METR selected the "AI R&D" niche, while Apollo Research focused on AI deception.
Automated Scoring
Evaluations can take a considerable amount of time due to their intricate nature or slow tools associated with the agent. Consequently, multiple tasks are performed concurrently, making it impractical to monitor each agent step by step. The best solution is to produce a final deliverable that can be automatically scored.
For example, a task might involve developing an ML classifier—a process that could take several hours—but the output is simply a list of the classifier's predictions on a test set, which can be easily compared to the correct answers (ground truth) for automatic scoring.
Language Model Scoring
In instances where producing deliverables isn't feasible, such as when evaluating the agent's approach style, alternative assessment methods may be necessary. While deterministic heuristics (fixed rules) can sometimes be utilised for evaluation, the most effective method is often to have a language model (judge-LLM) make the assessment.
However, achieving consistency across assessments can be challenging. Judge-LLMs are proficient at ranking examples but may struggle to ensue calibration across samples. To address these issues, consider using a highly specific scoring rubric and providing examples that specify the expected score for a given example.
Benchmarks
In some cases, your scoring metric may not provide absolute performance indicators. Comparing two agents based on their scores does not necessarily mean that either is competent. To address this issue, establish human performance benchmarks or utilize standardized agent frameworks.
Human involvement not only offers a performance benchmark but also helps identify bugs in your implementation and sheds light on the task's importance. If an agent can automate a task that usually requires many hours from a human, this could yield significant business or societal benefits.
Contamination
The goal is to measure an AI model's ability to act autonomously and not simply replicate learned patterns from its training data. For example, asking a model to create an MNIST classifier should not be considered a trusted evaluation, as numerous online tutorials may have been incorporated into the model's training data. Estimating contamination in the task can be challenging, but with practice, we can develop a better understanding of the issue.
A good starting point is to comprehensively search the problem online. Another method is to determine if the model can complete the task without any initial iterations (zero-shot). If the task can be completed quickly, it may be contaminated. Keep in mind that this is a moving goalpost; what is uncontaminated now could change as new information becomes available online.
Task Difficulty
The ideal tasks should be challenging for the agent, allowing for the demonstration of truly exceptional capabilities. While current models may not pose significant risks, future models—or those currently being developed—might. For academic merit in AI safety, the evaluation should offer a fine-grained, continuous score where current models perform poorly, opening up room for improvement.
Subtasks
By breaking down hard tasks into smaller components, we can assess the potential for danger while also using the evaluations to predict when that might occur. To achieve this, our evaluations need to provide meaningful feedback on the current model's capabilities. Ideally, the evaluation should offer a fine-grained, continuous score where the current models score low, leaving room for improvement.
One approach is to create subtasks by dissecting the primary task into smaller subtasks. These subtasks can be designed so they build upon each other, producing a multitude of task combinations. Alternatively, provide a list of hints and study how many the agent requires to complete the task.
The downside of these methods is that they assume there is a definitive solution. If we aim to truly measure excellence, the task should be sufficiently open-ended to encourage creative solutions that have not yet been envisioned.
Frequently, you can't directly gauge the skill you're interested in. In such situations, you need to devise a task that acts as a proxy or specific manifestation of that skill. For example, if you wish to examine if models can handle resources, assessing how they play Monopoly can offer some insight, but it's merely a representation of managing real-world resources.
Similarly, if you aim to examine whether models excel at Machine Learning (ML) research, you could construct an evaluation requiring the model to accomplish a single ML task. Although this doesn't fully determine if models shine at ML research, a single instance can serve as a useful proxy. It's crucial to contemplate if your proxy accurately represents the skill you're striving to measure.
*The Exclusive Tech Leaders Circle is an exclusive gathering for elite CIOs, CTOs, and technology executives by invitation only. Am I eligible?*
Lukas Petersson, the founder and chief executive of Andon Labs, has a strong background in AI safety research, having made his debut in this field as a researcher at W2024 Y Combinator.
In his research journey, Lukas Petersson, renowned for his work in AI safety, established one of the few startups that collaborates with leading AI research organizations to ensure the safety of AI systems before deployment.