Translation Agent

This example demonstrates how to create an intelligent translation agent that goes beyond simple text translation. The agent:

Translates text from one language to another
Analyzes emotional content in the translated text
Selects appropriate voices based on language and emotion
Creates localized voices using Cartesia’s voice localization tools
Generates audio output with emotion-appropriate voice characteristics

The agent uses a step-by-step approach to ensure high-quality translation and voice generation, making it ideal for creating localized content that maintains the emotional tone of the original text.

Code

cookbook/01_showcase/01_agents/translation_agent/agent.py

import base64
from pathlib import Path
from textwrap import dedent

from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from agno.tools.cartesia import CartesiaTools
from agno.utils.media import save_base64_data
from agno.db.sqlite import SqliteDb

AGENT_INSTRUCTIONS = dedent("""\
    Follow these steps SEQUENTIALLY to translate text and generate a localized voice note:

    1. **Identify Input**
       - Extract the text to translate from the user request
       - Identify the target language

    2. **Translate**
       - Translate the text accurately to the target language
       - Preserve the meaning and tone
       - Keep the translated text for audio generation

    3. **Analyze Emotion**
       - Analyze the emotion conveyed by the translated text
       - Categories: neutral, happy, sad, angry, excited, calm, professional
       - This will guide voice selection

    4. **Get Language Code**
       - Determine the 2-letter language code for the target language
       - Examples: 'fr' (French), 'es' (Spanish), 'de' (German), 'ja' (Japanese)

    5. **List Available Voices**
       - Call the 'list_voices' tool to get available Cartesia voices
       - Wait for the result

    6. **Select Base Voice**
       - From the list, select a voice ID that:
         a) Matches or is close to the target language
         b) Reflects the analyzed emotion
       - Note: If exact language match unavailable, select a suitable base voice

    7. **Localize Voice**
       - Call 'localize_voice' to create a language-specific voice:
         - voice_id: The selected base voice ID
         - name: Descriptive name (e.g., "French Happy Female")
         - description: Language and emotion description
         - language: Target language code from step 4
         - original_speaker_gender: Inferred or user-specified gender
       - Wait for the result and extract the new voice ID

    8. **Generate Audio**
       - Call 'text_to_speech' with:
         - transcript: The translated text from step 2
         - voice_id: The localized voice ID from step 7
       - Wait for audio generation

    9. **Return Results**
       - Provide the user with:
         - Original text
         - Translated text
         - Detected emotion
         - Language code
         - Confirmation that audio was generated

    ## Emotion-Voice Guidelines

    | Emotion | Voice Characteristics |
    |---------|----------------------|
    | Neutral | Clear, professional, moderate pace |
    | Happy | Upbeat, energetic, slightly faster |
    | Sad | Slower, softer, lower energy |
    | Angry | Stronger, more intense |
    | Excited | High energy, dynamic, faster |
    | Calm | Soothing, steady, relaxed |
    | Professional | Formal, clear, authoritative |

    ## Language Codes Reference

    - French: fr
    - Spanish: es
    - German: de
    - Italian: it
    - Portuguese: pt
    - Japanese: ja
    - Chinese: zh
    - Korean: ko
    - Russian: ru
    - Arabic: ar
""")


translation_agent = Agent(
    name="Translation Agent",
    description=(
        "Translates text, analyzes emotion, selects a suitable voice, "
        "creates a localized voice, and generates a voice note using Cartesia TTS."
    ),
    instructions=AGENT_INSTRUCTIONS,
    model=OpenAIResponses(id="gpt-5.2"),
    tools=[CartesiaTools()],
    add_datetime_to_context=True,
    add_history_to_context=True,
    num_history_runs=5,
    enable_agentic_memory=True,
    markdown=True,
    db=SqliteDb(db_file="tmp/data.db"),
)


def translate_and_speak(
    text: str,
    target_language: str,
    output_path: str | None = None,
) -> dict:
    """Translate text and generate audio.

    Args:
        text: Text to translate.
        target_language: Target language name (e.g., "French", "Spanish").
        output_path: Optional path to save the audio file.

    Returns:
        Dictionary with translation results and audio path.
    """
    prompt = f"Translate '{text}' to {target_language} and create a voice note"

    response = translation_agent.run(prompt)

    result = {
        "original_text": text,
        "target_language": target_language,
        "response": str(response.content),
        "audio_path": None,
    }

    if response.audio:
        audio_content = response.audio[0].content
        base64_audio = base64.b64encode(audio_content).decode("utf-8")

        if output_path is None:
            output_dir = Path("tmp/translations")
            output_dir.mkdir(parents=True, exist_ok=True)
            lang_code = target_language.lower()[:2]
            output_path = str(output_dir / f"translation_{lang_code}.mp3")

        save_base64_data(base64_data=base64_audio, output_path=output_path)
        result["audio_path"] = output_path

    return result

Usage

Set up your virtual environment

uv venv --python 3.12
source .venv/bin/activate

Set your API key

export OPENAI_API_KEY=xxx
export CARTESIA_API_KEY=xxx

Install dependencies

uv pip install -U agno openai cartesia

Run Agent

python cookbook/01_showcase/01_agents/translation_agent/agent.py

Cookbook

Models

Tools

Knowledge & RAG

Storage

Agents

Teams

Workflows

Learning

Streamlit Apps

Translation Agent

Code

Usage

Cookbook

Models

Tools

Knowledge & RAG

Storage

Agents

Teams

Workflows

Learning

Streamlit Apps

​Code

​Usage

Code

Usage