What We're Building

You call a phone number. An AI agent answers. You have a conversation. In the browser, you watch an avatar lip-sync everything the agent says in real time.

Tech stack:

  • VideoSDK - handles the telephony (phone calls)

  • Gemini 2.5 Flash Native Audio - the AI brain

  • Anam AI - real-time lip-syncing avatar

Prerequisites

Before writing any code you need:

  1. VideoSDK account - buy a native phone number, set up an inbound gateway, and create a routing rule named MyTelephonyAgent

  2. Google AI Studio API key - free tier works fine

  3. Anam AI account - grab your API key and an avatar ID from Anam Lab

  4. Python 3.10+

Step 1: Install Dependencies

pip install videosdk-agents==0.0.65 videosdk-plugins-google==0.0.65 videosdk-plugins-anam==0.0.65 python-dotenv==1.2.1 requests==2.31.0

Step 2: Set Up Your .env

VIDEOSDK_AUTH_TOKEN=your_token
GOOGLE_API_KEY=your_google_api_key
ANAM_API_KEY=your_anam_api_key
ANAM_AVATAR_ID=your_anam_avatar_id

Step 3: Write the Agent

import asyncio
import traceback
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob, Options
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.plugins.anam import AnamAvatar
from dotenv import load_dotenv
import os
import logging

logging.basicConfig(level=logging.INFO)
load_dotenv()

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful AI assistant that answers phone calls. Keep your responses concise and friendly.",
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello! I'm your real-time assistant. How can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye! It was great talking with you!")

async def start_session(context: JobContext):
    model = GeminiRealtime(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        api_key=os.getenv("GOOGLE_API_KEY"),
        config=GeminiLiveConfig(
            voice="Leda",
            response_modalities=["AUDIO"],
        ),
    )

    anam_avatar = AnamAvatar(
        api_key=os.getenv("ANAM_API_KEY"),
        avatar_id=os.getenv("ANAM_AVATAR_ID"),
    )

    pipeline = RealTimePipeline(model=model, avatar=anam_avatar)
    session = AgentSession(agent=MyVoiceAgent(), pipeline=pipeline)

    try:
        await context.connect()
        await session.start()
        await asyncio.Event().wait()
    finally:
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    return JobContext(room_options=RoomOptions())

if __name__ == "__main__":
    try:
        options = Options(
            agent_id="MyTelephonyAgent",
            register=True,
            max_processes=10,
            host="localhost",
            port=8081,
        )
        job = WorkerJob(entrypoint=start_session, jobctx=make_context, options=options)
        job.start()
    except Exception:
        traceback.print_exc()

How It Works

When you run the script, your agent registers with VideoSDK under the name MyTelephonyAgent. When someone calls your VideoSDK phone number, VideoSDK routes the call to your agent. The agent connects Gemini for conversation and Anam for the avatar simultaneously. A playground URL appears in your terminal - open that in the browser to see the avatar.

From there: you speak on the phone → Gemini processes your voice → audio response goes back to your phone → that same audio drives Anam's lip-sync in the browser.

Step 4: Run It

python3 main.py

Call your VideoSDK number. When the playground URL appears in the terminal, open it in your browser and mute your browser mic to avoid feedback. Talk on the phone, watch the avatar respond.

Notes

  • The H264 decoder warnings in the logs at startup are normal - the first few video frames get dropped before the keyframe arrives. It resolves itself.

  • If audio cuts out, check your network responsiveness with networkQuality in your terminal. WebRTC is sensitive to latency.

  • Gemini 2.5 Flash Native Audio is free tier on Google AI Studio, with a limit of ~1000 requests/day.

GitHub

Keep Reading