- The Conversational Edge
- Posts
- How to Build a Multimodal AI Exam Proctor (Python + Next.js + LiveKit)
How to Build a Multimodal AI Exam Proctor (Python + Next.js + LiveKit)

In this tutorial, we are going to build a real-time AI Exam Proctor.
Most voice AI agents are "blind"—they can hear you, but they can't see what you are doing. This agent is different. It uses Computer Vision to watch a student's screen in real-time, detects if they try to cheat by switching tabs, and even reads their final score off the page to congratulate them.
We will achieve this by combining LiveKit (for real-time video/audio), OpenAI GPT-4o (for vision and logic), and Anam.ai (for a realistic human avatar).
The Architecture
Our agent operates in three distinct phases running in parallel:
The Eyes: The Python backend receives and buffers the user's screen share video track.
The Hands: The agent uses Remote Procedure Calls (RPC) to remotely trigger the quiz popup on the user's frontend.
The Brain: An asynchronous background loop captures screen frames every 2 seconds and sends them to GPT-4o to analyze for cheating or quiz completion.
Prerequisites
Python 3.9+ and Node.js 18+ installed.
A LiveKit Cloud account (free tier works).
An OpenAI API Key.
An Anam.ai API Key (optional, for the avatar).
Step 1: Project Setup
We will need two repositories running simultaneously: a Next.js frontend (for the user) and a Python backend (for the agent).
1. Initialize the Agent (Backend)
We'll start by using the LiveKit Voice Agent CLI to generate our boilerplate.
Bash
# Initialize the Python project
uv init voice-agent --bare
cd voice-agent
# Install dependencies
uv add livekit-agents livekit-plugins-openai livekit-plugins-silero livekit-plugins-anam python-dotenv
Create a .env.local file in your root directory and add your API keys:
Code snippet
LIVEKIT_URL=...
LIVEKIT_API_KEY=...
LIVEKIT_API_SECRET=...
OPENAI_API_KEY=...
ANAM_API_KEY=...
2. Initialize the Frontend
Clone the LiveKit Next.js starter kit to get a working room interface quickly.
Bash
git clone https://github.com/livekit/agents-starter-nextjs.git frontend
cd frontend
npm install
npm run dev
Step 2: Building "The Eyes" (Video Plumbing)
Our first goal is to let the Python agent "see." By default, voice agents only listen to audio. We need to explicitly subscribe to the video track when a user shares their screen.
In your agent.py, we define the on_enter hook to listen for new tracks.
Python
class ProctorAgent(Agent):
def __init__(self, session: AgentSession, vision_llm: openai.LLM):
super().__init__()
self._session = session
self._latest_screen_frame = None
async def on_enter(self):
"""Subscribe to screen share tracks when the agent joins."""
@self.room.on("track_subscribed")
def on_track_subscribed(track, publication, participant):
# Filter for video tracks that are screen shares
if (track.kind == rtc.TrackKind.KIND_VIDEO and
publication.source == rtc.TrackSource.SOURCE_SCREENSHARE):
self._create_screen_stream(track)
def _create_screen_stream(self, track: rtc.Track):
"""Buffer the latest frame in a background task."""
stream = rtc.VideoStream(track)
async def read_stream():
async for event in stream:
# Always keep the most recent frame in memory
self._latest_screen_frame = event.frame
asyncio.create_task(read_stream())
How it works: We create a background read_stream loop that constantly updates self._latest_screen_frame. This ensures that whenever our "Brain" wants to check the screen, it has the freshest image available instantly.
Step 3: Building "The Hands" (Remote Control)
We want the agent to control the exam flow. Instead of asking the user to "please open the quiz," the agent will use a tool to open it for them.
Backend: The Tool Definition
We define a @function_tool that the LLM can call. Inside, we use perform_rpc to send a signal to the frontend.
Python
@function_tool()
async def show_quiz_link(self, context: RunContext) -> str:
"""Called when the user confirms screen share. Opens the quiz popup."""
# 1. Provide verbal feedback
await self._session.say("Perfect! I'm setting up your quiz now.", allow_interruptions=False)
# 2. Start the monitoring loop (The Brain)
self._monitor_task = asyncio.create_task(self._monitor_screen())
# 3. Send the RPC signal to the frontend
await self.room.local_participant.perform_rpc(
destination_identity=user_identity,
method="frontend.showQuizLink",
payload=""
)
return "Quiz link displayed."
Frontend: The RPC Handler
In your Next.js app, we need a component to listen for this signal and render the popup.
TypeScript
// RpcHandlers.tsx
export function RpcHandlers() {
const room = useRoomContext();
const [showPopup, setShowPopup] = useState(false);
useEffect(() => {
if (!room) return;
// Listen for the "frontend.showQuizLink" method
room.registerRpcMethod("frontend.showQuizLink", async () => {
setShowPopup(true);
return "Popup displayed";
});
}, [room]);
if (showPopup) return <QuizPopup />;
return null;
}
Step 4: Building "The Brain" (Vision Logic)
Now for the core logic. We need a loop that runs every few seconds, looks at the _latest_screen_frame, and decides if the user is cheating or finished.
We use a separate instance of openai.LLM for this loop to avoid blocking the main conversation.
Python
async def _monitor_screen(self):
"""Async loop that checks the screen every 2 seconds."""
while True:
await asyncio.sleep(2.0)
if not self._latest_screen_frame: continue
# Construct the vision prompt
chat_ctx = ChatContext()
chat_ctx.add_message(role="system", content=(
"You are a proctor. Look at the screen and return ONLY one string:\n"
"- 'TAB_SWITCH' if the user is NOT on the quiz page.\n"
"- 'X out of Y' if you see a final score (e.g., '4 out of 4').\n"
"- 'ON_QUIZ' if they are taking the quiz normally."
))
chat_ctx.add_message(role="user", content=[ImageContent(image=self._latest_screen_frame)])
# Get analysis from GPT-4o
response = await self._vision_llm.chat(chat_ctx=chat_ctx)
# Logic Handling
if "TAB_SWITCH" in response:
await self._session.say("I see you switched tabs. Please go back.", add_to_chat_ctx=False)
elif "out of" in response:
await self._session.say(f"Congratulations! You scored {response}.", allow_interruptions=False)
break # Stop monitoring when finished
Step 5: Adding "The Face" (Avatar)
Finally, to make the experience more immersive, we replace the standard voice with an Anam.ai avatar.
In your main entrypoint function:
Python
@server.rtc_session()
async def entrypoint(ctx: JobContext):
session = AgentSession(...)
# Initialize the Avatar
avatar = anam.AvatarSession(
persona_config=anam.PersonaConfig(
name="Quiz Proctor",
avatarId="your-avatar-id",
),
)
# Start the avatar before the session
await avatar.start(session, room=ctx.room)
await session.start(...)
Conclusion
We have now built a fully multimodal agent that goes far beyond simple chatbots. By decoupling the Video Stream (Eyes), Tool Execution (Hands), and Vision Logic (Brain), we created a system that feels intelligent and responsive.
You can find the full source code for this project on GitHub: github.com/jb-akp/lk-proctor
Happy coding!