11.7 C
New York
Wednesday, March 26, 2025

How one can construct a multilingual voice agent utilizing the Operai SDK agent?


Openai’s SDK agent has taken issues a bit with the launch of his voice agent operate, permitting him to create clever functions, in actual time and speech based mostly. Whether or not you’re constructing a language tutor, a digital assistant or a help bot, this new capability supplies a very new stage of interplay: pure, dynamic and human. We’re going to break down and stroll for what it’s, the way it works and how one can construct a multilingual voice agent your self.

What’s a voice agent?

A voice agent is a system that listens to his voice, understands what he’s saying, thinks of a solution after which responds out loud. Magic works with a mix of discourse to textual content, language fashions and textual content applied sciences to voice.

He Operai SDK agent It makes this extremely accessible by way of one thing referred to as Voicepipeline, a structured 3 -step course of:

  1. Speech-To-Textual content (STT): Seize and convert your phrases spoken into textual content.
  2. Agent logic: That is your code (or your agent), which discovers the suitable response.
  3. Textual content to voice (TTS): Converts the textual content response of the agent into audio that’s spoken out loud.

Select appropriate structure

Relying in your use case, it would be best to select one of many two important architectures suitable with OpenAi:

1. Voice -to -voice structure (multimodal)

That is the real-time audio method that makes use of fashions similar to GPT-4O-Realtime-Previa. As an alternative of translating to textual content behind the scene, the mannequin processes and generates speech immediately.

Why use this?

  • Low latency interplay, in actual time
  • Emotion and understanding of vocal tone
  • Pure and mushy dialog move

Good for:

  • Language tutoring
  • Dwell dialog brokers
  • Interactive narration or studying functions
Strengths Higher for
Low latency Interactive and unstructured dialogue
Multimodal understanding (voice, tone, pauses) Actual -time dedication
Aware responses of emotion Customer support, digital companions

This method makes conversations really feel fluid and human, however they could want extra consideration in circumstances of edge similar to registration or precise transcripts.

2. Chained structure

The chained methodology is extra conventional: speech turns into textual content, the LLM processes that textual content after which the reply turns into speech. The beneficial fashions listed here are:

  • GPT-4O-Transcribe (for STT)
  • GPT-4O (for logic)
  • GPT-4O-MIN-TTS (for TTS)

Why use this?

  • You want transcripts for audit/registration
  • Have structured workflows similar to customer support or lead ranking
  • Needs predictable and controllable conduct

Good for:

  • Help bots
  • Gross sales brokers
  • Particular homework assistants
Strengths Higher for
Excessive management and transparency Structured workflows
Dependable textual content -based processing Functions that want transcripts
Predictable outputs Flows with buyer -oriented scripts

That is simpler to purify and an ideal place to begin if you’re new in voice brokers.

How does the voice agent work?

We configure a Voice voice With a personalised workflow. This workflow executes an agent, however can even set off particular solutions if it says a secret phrase.

That is what occurs if you discuss:

  1. Audio goes to the voice of voice whilst you communicate.
  2. Whenever you cease speaking, the pipe goes into motion.
  3. The pipe then:
    • Transcribe your speech to the textual content.
    • Ship the transcription to the workflow, which executes the agent’s logic.
    • Transmit the agent’s response to a voice textual content mannequin (TTS).
    • Play the audio generated for you.

It’s in actual time, interactive and clever sufficient to react otherwise if it slips right into a hidden phrase.

Settings of a pipe

When configuring a voice pipe, there are some key parts that may customise:

  • Workflow: That is the logic that’s executed each time the brand new audio is transcribed. Outline how the agent processes and responds.
  • STT and TTS MODELS: Select which voice fashions to textual content and textual content to voice that can use its pipe.
  • Configuration settings: That is the place you regulate how your pipe behaves:
    • Mannequin provider: A mapping system that hyperlinks the names of fashions with actual mannequin situations.
    • Monitoring choices: Management if the monitoring is enabled, if audio recordsdata might be loaded, assign work move names, monitoring identifications and extra.
    • Particular mannequin configuration: Customise the indications, language preferences and suitable knowledge sorts for TTS and STT fashions.

Run a voice pipe

To start out a voice pipe, you’ll use the run() methodology. Settle for the audio entry in considered one of two methods, relying on how the speech is dealing with:

  • Audio audio It’s perfect when you have already got an entire audio clip or transcription. It’s excellent for circumstances the place when the speaker is prepared, as with a pre -recorded audio or push configurations to talk. There is no such thing as a have to detect dwell exercise right here.
  • STREAMEUDIOINPUT It’s designed for dynamics in actual time. It feeds on audio fragments as they’re captured, and the voice pipe is routinely resolved when activating the logic of the agent utilizing one thing referred to as exercise detection. That is very helpful with regards to open microphones or arms -free interplay the place it’s not apparent when the speaker ends.

Perceive the outcomes

As soon as your pipe is working, it returns a streamdaudiorsult, which lets you transmit occasions in actual time because the interplay develops. These occasions are available in some flavors:

  • Voicestreameventaudio – Comprises audio output fragments (that’s, what the agent says).
  • VICERTREAMEVELFEFYCLE – Model vital occasions of the life cycle, similar to the start or finish of a dialog flip.
  • VoiceriemeVerror – He factors out that one thing went incorrect.

Sensible voice agent utilized by Operai SDK agent

Here’s a cleaner and nicely -structured model of your information to configure a sensible voice agent that makes use of OpenAi agent SDK, with detailed steps, grouping and readability enhancements. All the things is informal and sensible, however extra readable and processable:

1. Configure the listing of your venture

mkdir my_project
cd my_project

2. Create and activate a digital surroundings

Create the surroundings:

python -m venv .venv

Activate it:

supply .venv/bin/activate

3. Set up Operai SDK agent

pip set up openai-agent

4. Set up an OpenAI API key

export OPENAI_API_KEY=sk-...

5. Clone the instance repository

git clone https://github.com/openai/openai-agents-python.git

6. Modify the instance code for Hindi Agent and Audio Garning

Navigate to the instance file:

cd openai-agents-python/examples/voice/static

Now, Edit Fundamental.py:

You’ll do two key issues:

  1. Add a Hindi agent
  2. Allow audio financial savings after playback

Exchange all content material in important.py. That is the ultimate code right here:

import asyncio
import random

from brokers import Agent, function_tool
from brokers.extensions.handoff_prompt import prompt_with_handoff_instructions
from brokers.voice import (
    AudioInput,
    SingleAgentVoiceWorkflow,
    SingleAgentWorkflowCallbacks,
    VoicePipeline,
)

from .util import AudioPlayer, record_audio

@function_tool
def get_weather(metropolis: str) -> str:
    print(f"(debug) get_weather referred to as with metropolis: {metropolis}")
    decisions = ("sunny", "cloudy", "wet", "snowy")
    return f"The climate in {metropolis} is {random.alternative(decisions)}."

spanish_agent = Agent(
    identify="Spanish",
    handoff_description="A spanish talking agent.",
    directions=prompt_with_handoff_instructions(
        "You are talking to a human, so be well mannered and concise. Communicate in Spanish.",
    ),
    mannequin="gpt-4o-mini",
)

hindi_agent = Agent(
    identify="Hindi",
    handoff_description="A hindi talking agent.",
    directions=prompt_with_handoff_instructions(
        "You are talking to a human, so be well mannered and concise. Communicate in Hindi.",
    ),
    mannequin="gpt-4o-mini",
)

agent = Agent(
    identify="Assistant",
    directions=prompt_with_handoff_instructions(
        "You are talking to a human, so be well mannered and concise. If the consumer speaks in Spanish, handoff to the spanish agent. If the consumer speaks in Hindi, handoff to the hindi agent.",
    ),
    mannequin="gpt-4o-mini",
    handoffs=(spanish_agent, hindi_agent),
    instruments=(get_weather),
)

class WorkflowCallbacks(SingleAgentWorkflowCallbacks):
    def on_run(self, workflow: SingleAgentVoiceWorkflow, transcription: str) -> None:
        print(f"(debug) on_run referred to as with transcription: {transcription}")

async def important():
    pipeline = VoicePipeline(
        workflow=SingleAgentVoiceWorkflow(agent, callbacks=WorkflowCallbacks())
    )

    audio_input = AudioInput(buffer=record_audio())

    consequence = await pipeline.run(audio_input)

    # Create an inventory to retailer all audio chunks
    all_audio_chunks = ()

    with AudioPlayer() as participant:
        async for occasion in consequence.stream():
            if occasion.sort == "voice_stream_event_audio":
                audio_data = occasion.knowledge
                participant.add_audio(audio_data)
                all_audio_chunks.append(audio_data)
                print("Obtained audio")
            elif occasion.sort == "voice_stream_event_lifecycle":
                print(f"Obtained lifecycle occasion: {occasion.occasion}")
    
    # Save the mixed audio to a file
    if all_audio_chunks:
        import wave
        import os
        import time

        os.makedirs("output", exist_ok=True)
        filename = f"output/response_{int(time.time())}.wav"

        with wave.open(filename, "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(16000)
            wf.writeframes(b''.be a part of(all_audio_chunks))

        print(f"Audio saved to {filename}")

if __name__ == "__main__":
    asyncio.run(important())

6. Execute the voice agent

Make sure to be in the precise board:

cd openai-agents-python

Then, spear:

python -m examples.voice.static.important

I requested the agent two issues, one in English and one in Hindi:

  1. Voice discover: Whats up, voice agent, what’s a big language mannequin?
  2. Voice discover: “मुझे दिल्ली के बारे में बताओ

Right here is the terminal:

Manufacturing

English response:

Hindi reply:

Extra assets

Do you need to dig deeper? Take a look at these:

Additionally learn: OpenAI audio fashions: find out how to entry, traits, functions and extra

Conclusion

Constructing a voice agent with Openai SDK agent is way more accessible now: you not want to affix a ton of instruments. Merely select the right structure, configure your voice voice and let the SDK do heavy work.

In the event you go for a top quality dialog move, go multimodal. In order for you construction and management, see chained. Anyway, this expertise is highly effective, and it’ll solely enhance. If you’re creating one, let me know within the feedback part beneath.

Hello, I am Pankaj Singh Negi – Senior Content material editor | Passionate in regards to the narration of tales and the elaboration of convincing narratives that remodel concepts into stunning content material. I really like studying about expertise that revolutionizes our life-style.

Log in to proceed studying and having fun with content material cured by specialists.

Related Articles

Latest Articles