How to Build a Multimodal AI Exam Proctor (Python + Next.js + LiveKit)

In this tutorial, we are going to build a real-time AI Exam Proctor.

Most voice AI agents are "blind"—they can hear you, but they can't see what you are doing. This agent is different. It uses Computer Vision to watch a student's screen in real-time, detects if they try to cheat by switching tabs, and even reads their final score off the page to congratulate them.

We will achieve this by combining LiveKit (for real-time video/audio), OpenAI GPT-4o (for vision and logic), and Anam.ai (for a realistic human avatar).

The Architecture

Our agent operates in three distinct phases running in parallel:

  1. The Eyes: The Python backend receives and buffers the user's screen share video track.

  2. The Hands: The agent uses Remote Procedure Calls (RPC) to remotely trigger the quiz popup on the user's frontend.

  3. The Brain: An asynchronous background loop captures screen frames every 2 seconds and sends them to GPT-4o to analyze for cheating or quiz completion.

Prerequisites

Step 1: Project Setup

We will need two repositories running simultaneously: a Next.js frontend (for the user) and a Python backend (for the agent).

1. Initialize the Agent (Backend)

We'll start by using the LiveKit Voice Agent CLI to generate our boilerplate.

Bash

# Initialize the Python project
uv init voice-agent --bare
cd voice-agent

# Install dependencies
uv add livekit-agents livekit-plugins-openai livekit-plugins-silero livekit-plugins-anam python-dotenv

Create a .env.local file in your root directory and add your API keys:

Code snippet

LIVEKIT_URL=...
LIVEKIT_API_KEY=...
LIVEKIT_API_SECRET=...
OPENAI_API_KEY=...
ANAM_API_KEY=...

2. Initialize the Frontend

Clone the LiveKit Next.js starter kit to get a working room interface quickly.

Bash

git clone https://github.com/livekit/agents-starter-nextjs.git frontend
cd frontend
npm install
npm run dev

Step 2: Building "The Eyes" (Video Plumbing)

Our first goal is to let the Python agent "see." By default, voice agents only listen to audio. We need to explicitly subscribe to the video track when a user shares their screen.

In your agent.py, we define the on_enter hook to listen for new tracks.

Python

class ProctorAgent(Agent):
    def __init__(self, session: AgentSession, vision_llm: openai.LLM):
        super().__init__()
        self._session = session
        self._latest_screen_frame = None

    async def on_enter(self):
        """Subscribe to screen share tracks when the agent joins."""
        @self.room.on("track_subscribed")
        def on_track_subscribed(track, publication, participant):
            # Filter for video tracks that are screen shares
            if (track.kind == rtc.TrackKind.KIND_VIDEO and 
                publication.source == rtc.TrackSource.SOURCE_SCREENSHARE):
                self._create_screen_stream(track)

    def _create_screen_stream(self, track: rtc.Track):
        """Buffer the latest frame in a background task."""
        stream = rtc.VideoStream(track)
        async def read_stream():
            async for event in stream:
                # Always keep the most recent frame in memory
                self._latest_screen_frame = event.frame
        
        asyncio.create_task(read_stream())

How it works: We create a background read_stream loop that constantly updates self._latest_screen_frame. This ensures that whenever our "Brain" wants to check the screen, it has the freshest image available instantly.

Step 3: Building "The Hands" (Remote Control)

We want the agent to control the exam flow. Instead of asking the user to "please open the quiz," the agent will use a tool to open it for them.

Backend: The Tool Definition

We define a @function_tool that the LLM can call. Inside, we use perform_rpc to send a signal to the frontend.

Python

@function_tool()
async def show_quiz_link(self, context: RunContext) -> str:
    """Called when the user confirms screen share. Opens the quiz popup."""
    
    # 1. Provide verbal feedback
    await self._session.say("Perfect! I'm setting up your quiz now.", allow_interruptions=False)
    
    # 2. Start the monitoring loop (The Brain)
    self._monitor_task = asyncio.create_task(self._monitor_screen())
    
    # 3. Send the RPC signal to the frontend
    await self.room.local_participant.perform_rpc(
        destination_identity=user_identity,
        method="frontend.showQuizLink",
        payload=""
    )
    return "Quiz link displayed."

Frontend: The RPC Handler

In your Next.js app, we need a component to listen for this signal and render the popup.

TypeScript

// RpcHandlers.tsx
export function RpcHandlers() {
  const room = useRoomContext();
  const [showPopup, setShowPopup] = useState(false);

  useEffect(() => {
    if (!room) return;
    
    // Listen for the "frontend.showQuizLink" method
    room.registerRpcMethod("frontend.showQuizLink", async () => {
      setShowPopup(true);
      return "Popup displayed";
    });
  }, [room]);

  if (showPopup) return <QuizPopup />;
  return null;
}

Step 4: Building "The Brain" (Vision Logic)

Now for the core logic. We need a loop that runs every few seconds, looks at the _latest_screen_frame, and decides if the user is cheating or finished.

We use a separate instance of openai.LLM for this loop to avoid blocking the main conversation.

Python

async def _monitor_screen(self):
    """Async loop that checks the screen every 2 seconds."""
    while True:
        await asyncio.sleep(2.0)
        
        if not self._latest_screen_frame: continue
        
        # Construct the vision prompt
        chat_ctx = ChatContext()
        chat_ctx.add_message(role="system", content=(
            "You are a proctor. Look at the screen and return ONLY one string:\n"
            "- 'TAB_SWITCH' if the user is NOT on the quiz page.\n"
            "- 'X out of Y' if you see a final score (e.g., '4 out of 4').\n"
            "- 'ON_QUIZ' if they are taking the quiz normally."
        ))
        chat_ctx.add_message(role="user", content=[ImageContent(image=self._latest_screen_frame)])
        
        # Get analysis from GPT-4o
        response = await self._vision_llm.chat(chat_ctx=chat_ctx)
        
        # Logic Handling
        if "TAB_SWITCH" in response:
            await self._session.say("I see you switched tabs. Please go back.", add_to_chat_ctx=False)
        
        elif "out of" in response:
            await self._session.say(f"Congratulations! You scored {response}.", allow_interruptions=False)
            break  # Stop monitoring when finished

Step 5: Adding "The Face" (Avatar)

Finally, to make the experience more immersive, we replace the standard voice with an Anam.ai avatar.

In your main entrypoint function:

Python

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    session = AgentSession(...)

    # Initialize the Avatar
    avatar = anam.AvatarSession(
        persona_config=anam.PersonaConfig(
            name="Quiz Proctor",
            avatarId="your-avatar-id", 
        ),
    )
    # Start the avatar before the session
    await avatar.start(session, room=ctx.room)
    
    await session.start(...)

Conclusion

We have now built a fully multimodal agent that goes far beyond simple chatbots. By decoupling the Video Stream (Eyes), Tool Execution (Hands), and Vision Logic (Brain), we created a system that feels intelligent and responsive.

You can find the full source code for this project on GitHub: github.com/jb-akp/lk-proctor

Happy coding!