KP
All writing
AI Systems

Shipping AI Voice Agents That Resolve Calls in Under 90 Seconds

KPKarey Powell·July 1, 2026·6 min read

Putting a large language model on a live phone call sounds simple until the first real caller is three seconds into silence, wondering if anyone is there. Voice is unforgiving. There is no spinner, no "thinking…" indicator — just dead air that erodes trust with every passing moment.

The system in question qualifies inbound calls for home services brands: it answers, screens the caller through a three-question flow — what service do you need, is the property residential, are you the owner — and then either transfers a qualified lead to a live agent or politely ends the call. The target: resolve the call in under ninety seconds.

Here is the first thing shipping it taught me: you don't hit ninety seconds by making the model faster. You hit it by making the conversation shorter.

Ninety seconds is a design constraint, not a benchmark

Most of the budget was won before the caller said a word:

  • Route by the dialed number. Every brand has dedicated tracking numbers, so the DNIS identifies the brand before "hello." The webhook resolves phone number → brand config → assistant deterministically, in one lookup. No "which company are you trying to reach?" preamble — the agent answers already knowing who it is.
  • One assistant per brand, not one mega-prompt. Each brand+category combination gets its own assistant with its own system prompt, exclusion list, and FAQ bank. Shorter prompts mean faster inference and isolated blast radius when one brand's rules change.
  • Three questions, scripted. The qualification flow is a fixed sequence with explicit branch conditions. The LLM's job is to navigate the script under messy real-world input, not to improvise structure.
  • Hard caps. maxDurationSeconds: 180 on every call. A capped worst case is a feature, not an admission of failure.

The conversation design is the latency optimization. Everything else is milliseconds; this is tens of seconds.

Where the milliseconds go

The streaming stack is what you'd expect — Deepgram Nova-3 for STT with keyterm boosting on brand names, GPT-4o at temperature 0.3, ElevenLabs' turbo model with streaming-latency optimization maxed out. Two less obvious choices mattered:

  • Cap the output tokens. maxTokens: 250 isn't a cost control, it's a latency control. Every extra sentence the model writes is another second of synthesized speech the caller has to sit through.
  • Turn-taking is the hardest problem in the system. Pause-based endpointing fails badly on long-winded callers: if someone never pauses, the agent literally never gets a turn. We had an "interject politely" rule in the prompt for months that could not fire — the orchestration layer never yielded the floor, so the model never got the chance. The fix wasn't better prompting; it was swapping in smart endpointing (a LiveKit model with a tuned sigmoid wait curve) that detects a caller winding down a thought rather than waiting for dead air.

That last one generalizes: some behaviors cannot be prompted, because they live below the prompt — in the audio layer, in who holds the floor. Knowing which layer owns a problem is most of voice engineering.

Silence is a failure mode with a budget

Pocket dials, callers who wander off, dead lines — at volume, silent calls are real minutes and real money. We handle silence in stages, and every stage speaks before it acts:

hooks: [
  {
    on: "customer.speech.timeout",
    options: { timeoutSeconds: 10, triggerResetMode: "onUserSpeech" },
    do: [{ type: "say", exact: "Hello, is anyone on the line?" }],
  },
  {
    on: "customer.speech.timeout",
    options: { timeoutSeconds: 15, triggerResetMode: "onUserSpeech" },
    do: [
      { type: "say", exact: SILENCE_TERMINATION_MESSAGE },
      { type: "tool", tool: { type: "endCall" } },
    ],
  },
],
// safety net: if the hooks fail to fire (it happens), the hard 30s
// cutoff still speaks the message instead of hanging up silently
messagePlan: { silenceTimeoutMessage: SILENCE_TERMINATION_MESSAGE },
silenceTimeoutSeconds: 30,

Ten seconds of caller silence gets a presence check. Fifteen gets a scripted goodbye and a clean hangup. And because the platform's hooks occasionally fail to trigger, a thirty-second hard cutoff backstops them — with the same spoken message, because hanging up on a customer in silence is never acceptable, even in a failure path.

Fallbacks are not optional

The same belt-and-suspenders thinking runs through every layer:

  • Voice fallback chain. If ElevenLabs is down, the call continues on the platform's built-in voice, then OpenAI's. A TTS outage degrades voice quality, not availability.
  • "When in doubt, transfer." A borderline caller transferred to a human costs a few minutes of agent time. A valid lead rejected by an overconfident model costs the actual business. Every ambiguous branch — unintelligible after two attempts, a caller insisting on a human — resolves toward a person.
  • Retry caps everywhere. Clarifying questions are capped at two per call; "we don't offer that — anything else?" retries are capped at two. An LLM in a loop with a confused caller will happily loop forever unless the script forbids it.
  • Timeouts and circuit breakers on every vendor call. The platform side wraps every voice-API request in a 30-second timeout with a circuit breaker that fast-fails after five consecutive errors, and outbound webhook delivery retries with exponential backoff behind per-endpoint breakers.

The LLM is the fast path. It is never the only path.

The prompt is a program

The system prompt reads less like instructions and more like a state machine written in English: exact scripted lines, explicit branch conditions, hard-terminator FAQ entries, rules about when the model may and may not invoke the transfer tool. So we treat it like code. Brand configs are JSON; a prompt builder compiles them into assistants; deploys go through a pipeline that can diff what's in version control against what's live and flag drift.

That discipline is what made iteration safe. When a client wants a new exclusion ("we don't service mobile homes"), it's a one-line config change, regenerated and redeployed in minutes — not a hand-edit to a prompt someone pasted into a dashboard six weeks ago.

Observability before trust

The agent didn't touch real traffic until we could answer three questions about any call, after the fact: what did the caller say, what did the model decide, and why. That started as structured JSON logs and end-of-call reports — transcript, outcome, duration, every webhook event. As the system grew into a platform, it became Prometheus histograms on every request path, alert rules for webhook error rates and p95 breaches, Sentry for exceptions, and Slack alerts for deploy and queue failures.

The payoff wasn't just reliability — it was iteration speed. The turn-taking fix above didn't come from intuition; it came from reading transcripts of calls that ran long and noticing the agent never spoke during monologues. Once you can see exactly where calls go sideways, improving the system becomes an empirical exercise instead of a guessing game.

What I'd tell my past self

Build the boring parts first. The model is the easy, exciting 20%. The endpointing math, the staged silence hooks, the voice fallback chain, the retry caps, the circuit breakers, the structured logs — that unglamorous 80% is the difference between a demo that wows in a meeting and a system you'd put your name on in production. And when a behavior won't budge no matter how you rewrite the prompt, stop rewriting the prompt: the problem probably lives in a layer the prompt can't see.

Vapi.aiGPT-4oDeepgramElevenLabsLatencyObservability
KP

Karey Powell

Staff Engineer & AI Systems Architect. 14+ years building production fintech and AI systems across the Caribbean. Currently Lead Solutions Architect at MZ Holdings.

Keep reading