Judge

This guide covers the judge agent concept in Maia.

Idea

In multi-agent AI tests, you will often face a situation where a simple assertion or validation is not enough to evaluate the result. This happens because AI agents might respond in various ways, but all of them might be correct. However, this means that creating a proper assertion is nearly impossible because there is no guarantee that the final response will be the same.

This is where the Judge Agent steps in. The idea is to introduce a specialized agent that gets the whole context and history of messages. The task for such an agent is to decide if the result of the test is correct or wrong.

Requirements

Your tests can be very different, so Maia provides the possibility to pass requirements to the Judge Agent. The requirements make the Judge Agent more specific, so it can focus on checking different things. The requirement describes what is expected from the final response in a free-text format. An example of a requirement is: "The recipe is for cookies.", which means that the Judge Agent needs to check if the final response gives a recipe for cookies (not for a birthday cake, for instance).

You can pass as many requirements as you want.

Scoring

Because the Judge Agent is still an AI Agent, the result from every "judging" can vary a little. For debugging purposes, the whole test and all requirements are also scored. This means that the Judge Agent will not only produce the success or failure, but also will give you a score for the overall test and for every requirement.

Note:

Soon, Maia will support configuration of scoring to let you define which score means success or failure.

Judge Agent in Framework

Below you can find various examples of how you can use the Judge Agent.

class TestJudgeAgent(MaiaTest):

    def setup_agents(self):
        self.create_agent(
            name="RecipeBot",
            provider=self.get_provider("ollama"),
            system_message="You are a helpful assistant that provides recipes.",
        )

    @pytest.mark.asyncio
    async def test_judge_successful_conversation(self):
        """Tests that the JudgeAgent correctly identifies a successful conversation."""
        judge_agent = JudgeAgent(self.get_provider("ollama"))
        session = self.create_session(["RecipeBot"], judge_agent=judge_agent)

        await session.user_says("Can you give me a simple recipe for pancakes?")
        await session.agent_responds("RecipeBot")

    @pytest.mark.xfail(reason="Conversation should be judged as failure.")
    @pytest.mark.asyncio
    async def test_judge_failed_conversation(self):
        """Tests that the JudgeAgent correctly identifies a failed conversation."""
        judge_agent = JudgeAgent(self.get_provider("ollama"))
        session = self.create_session(["RecipeBot"], judge_agent=judge_agent)

        await session.user_says("What is the capital of France?")
        await session.agent_responds("RecipeBot")

    @pytest.mark.xfail(reason="Conversation should be judged as failure.")
    @pytest.mark.asyncio
    async def test_judge_with_requirements(self):
        """Tests that the JudgeAgent can evaluate requirements manually."""
        requirements = [
            "The recipe is for cookies.",
            "The recipe is for a birthday cake." # This should fail.
        ]
        judge_with_reqs = JudgeAgent(self.get_provider("ollama"), requirements=requirements)
        
        session = self.create_session(["RecipeBot"], judge_agent=judge_with_reqs)

        await session.user_says("Give me a recipe for chocolate chip cookies.")
        await session.agent_responds("RecipeBot")

Assigning a Judge Agent to a session means that the result of the test will be checked at teardown, so you do not need to execute it directly.

Note:

Please be aware that the whole conversation is passed to the AI Agent, so if you have a very long conversation, it might exceed the input token limit for your AI provider.

Visualization

You can visualize the results from "judged" test cases using the Maia Dashboard. Here is the example:

Judge result

Tool

Complex scenarios