What We're Building
You call a phone number. An AI agent answers. You have a conversation. In the browser, you watch an avatar lip-sync everything the agent says in real time.
Tech stack:
VideoSDK - handles the telephony (phone calls)
Gemini 2.5 Flash Native Audio - the AI brain
Anam AI - real-time lip-syncing avatar
Prerequisites
Before writing any code you need:
VideoSDK account - buy a native phone number, set up an inbound gateway, and create a routing rule named
MyTelephonyAgentGoogle AI Studio API key - free tier works fine
Anam AI account - grab your API key and an avatar ID from Anam Lab
Python 3.10+
Step 1: Install Dependencies
pip install videosdk-agents==0.0.65 videosdk-plugins-google==0.0.65 videosdk-plugins-anam==0.0.65 python-dotenv==1.2.1 requests==2.31.0
Step 2: Set Up Your .env
VIDEOSDK_AUTH_TOKEN=your_token
GOOGLE_API_KEY=your_google_api_key
ANAM_API_KEY=your_anam_api_key
ANAM_AVATAR_ID=your_anam_avatar_id
Step 3: Write the Agent
import asyncio
import traceback
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob, Options
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.plugins.anam import AnamAvatar
from dotenv import load_dotenv
import os
import logging
logging.basicConfig(level=logging.INFO)
load_dotenv()
class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful AI assistant that answers phone calls. Keep your responses concise and friendly.",
)
async def on_enter(self) -> None:
await self.session.say("Hello! I'm your real-time assistant. How can I help you today?")
async def on_exit(self) -> None:
await self.session.say("Goodbye! It was great talking with you!")
async def start_session(context: JobContext):
model = GeminiRealtime(
model="gemini-2.5-flash-native-audio-preview-12-2025",
api_key=os.getenv("GOOGLE_API_KEY"),
config=GeminiLiveConfig(
voice="Leda",
response_modalities=["AUDIO"],
),
)
anam_avatar = AnamAvatar(
api_key=os.getenv("ANAM_API_KEY"),
avatar_id=os.getenv("ANAM_AVATAR_ID"),
)
pipeline = RealTimePipeline(model=model, avatar=anam_avatar)
session = AgentSession(agent=MyVoiceAgent(), pipeline=pipeline)
try:
await context.connect()
await session.start()
await asyncio.Event().wait()
finally:
await session.close()
await context.shutdown()
def make_context() -> JobContext:
return JobContext(room_options=RoomOptions())
if __name__ == "__main__":
try:
options = Options(
agent_id="MyTelephonyAgent",
register=True,
max_processes=10,
host="localhost",
port=8081,
)
job = WorkerJob(entrypoint=start_session, jobctx=make_context, options=options)
job.start()
except Exception:
traceback.print_exc()
How It Works
When you run the script, your agent registers with VideoSDK under the name MyTelephonyAgent. When someone calls your VideoSDK phone number, VideoSDK routes the call to your agent. The agent connects Gemini for conversation and Anam for the avatar simultaneously. A playground URL appears in your terminal - open that in the browser to see the avatar.
From there: you speak on the phone → Gemini processes your voice → audio response goes back to your phone → that same audio drives Anam's lip-sync in the browser.
Step 4: Run It
python3 main.py
Call your VideoSDK number. When the playground URL appears in the terminal, open it in your browser and mute your browser mic to avoid feedback. Talk on the phone, watch the avatar respond.
Notes
The H264 decoder warnings in the logs at startup are normal - the first few video frames get dropped before the keyframe arrives. It resolves itself.
If audio cuts out, check your network responsiveness with
networkQualityin your terminal. WebRTC is sensitive to latency.Gemini 2.5 Flash Native Audio is free tier on Google AI Studio, with a limit of ~1000 requests/day.
GitHub
Full source code: github.com/jb-akp/VideoSDK-PythonAgent