Task Success is Not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

COLM 2024 (Conference on Language Modeling)

Lin Guan¹, Yifan Zhou¹, Denis Liu¹, Yantian Zha², Heni Ben Amor¹, Subbarao Kambhampati¹

¹Arizona State University ²University of Maryland

ArXiv Code and Video Data

Abstract

Large-scale generative models are shown to be useful for sampling meaningful candidate solutions, yet they often overlook task constraints and user preferences. Their full power is better harnessed when the models are coupled with external verifiers and the final solutions are derived iteratively or progressively according to the verification feedback. In the context of embodied AI, verification often solely involves assessing whether goal conditions specified in the instructions have been met. Nonetheless, for these agents to be seamlessly integrated into daily life, it is crucial to account for a broader range of constraints and preferences beyond bare task success (e.g., a robot should avoid pointing the blade at a human when handing a knife to the person). However, given the unbounded scope of robot tasks, it is infeasible to construct scripted verifiers akin to those used for explicit-knowledge tasks like the game of Go and theorem proving. This begs the question: when no sound verifier is available, can we use large vision and language models (VLMs), which are approximately omniscient, as scalable Behavior Critics to catch undesirable robot behaviors in videos? To answer this, we first construct a benchmark that contains diverse cases of goal-reaching yet undesirable robot policies. Then, we comprehensively evaluate VLM critics to gain a deeper understanding of their strengths and failure modes. Based on the evaluation, we provide guidelines on how to effectively utilize VLM critiques and showcase a practical way to integrate the feedback into an iterative process of policy refinement.

Examples of Success Cases and Failure Cases

As the core contribution of this paper, we construct a benchmark which consists of diverse video clips demonstrating suboptimal yet goal-reaching policies in various household tasks.

Success Cases

Pick up a bag of chips

hello

hello

Pour coke into the glass

hello

hello

Move carrot to the plate

hello

hello

Hand scissors to human

hello

hello

Pick up red screwdriver

hello

hello

Place knife on board

hello

hello

Serve orange juice

hello

hello

Open cabinet door

hello

hello

Place facial cleanser

hello

hello

Failure Cases

Place vessel onto burner

hello

hello

Move spoon to bowl

hello

hello

Take scissors out of container

hello

hello

Understanding Output Patterns of GPT-4V Critic

We introduce a set of metrics and a taxonomy for characterizing the strengths and failure modes of VLM critics. Detailed explanation can be found in the paper. Here are some key observations:

Our evaluation indicates that GPT-4V can identify a significant portion of undesirable robot behaviors (with a recall rate of 69%).

However, it also generates critiques that contain a considerable amount of hallucinated information (resulting in a precision rate of 62%), mainly due to limited visual-grounding capability.

We demonstrate that, in an ideal case, by providing GPT-4V with "perfect" grounding feedback that verifies the occurrences of events mentioned in the critiques, it can "refine" its outputs and achieve a precision rate of over 98% (with a minor impact on the recall).

We also assess the feasibility of complementing verbal critiques with preference labels (see Sec. 4.1). Our results show that (a) when contrasting a negative sample with a positive sample, in 95.61% of time, GPT-4V manages to prefer the positive one; (b) when comparing pairs of negative samples or pairs of positive samples, GPT-4V always establishes invalid rankings within the pairs by "hallucinating" that one sample is perfect and one is flawed, as seen in its justifications. The positive aspect is that GPT-4V can accurately select near-optimal behaviors over suboptimal ones even though it tends to establish unnecessary orderings in other cases..

recall-precision-acc

error-types

Integrating Critiques into a Closed-loop System

While this work does not take a strong stance on how the critiques should be integrated into closed-loop policy-generation systems, we do present a candidate framework using a real robot in five household scenarios, wherein a Code-as-Policies (CaP) agent iteratively refines the policy according to VLM critiques on the rollouts. We consider 5 tasks within table-top setups: (a) scissor handover; (b) lifting an opened bag; (c) bread grasping; (d) knife placing; and (e) spoon picking. The CaP agent manages to eliminate undesirable behaviors according to VLM critiques in 4 of the tasks.