Audio Streaming

Push audio streaming (Websocket)

The /conversation/{conversation_id}/stream endpoint facilitates the pushing of audio streams via a WebSocket connection. When sending data, it should be structured as a JSON object where the type is set to "audio", and data contains the encoded audio content.

/conversation/{conversation_id}/stream

Authentication:

You have to send the credentials before sending the audio. The server responds Authentication failed whatever you calls before the authentication.

{
    "type": "auth",
    "data": {
        "userId": "...",
        "apiKey": "..."
    }
}

You send:

Type: `audio`

The /conversation/{conversation_id}/stream endpoint allows real-time audio data to be processed through a WebSocket connection. When you send an audio stream, ensure it's in JSON format with "type" set to "audio" and "data" holding the encoded audio. Responses received may include transcriptions or detected actions, such as creating a to-do item or scheduling a calendar event, each associated with relevant transcript data.

{
    "type": "audio",
    "data": "{encoded audio}"
}

Type: `complete`

This type allows you to complete the conversation when the user requested to mark as ended. You need to mark it complete to get raw audio from the conversation endpoint.

{
    "type": "complete"
}

You receive:

After sending the complete message, the server starts doing the post-processing and the connection will be closed as it completes. So you have to show the spinner and wait for the websocket connection to be closed.

Type: `transcribe`

The system returns a transcript with "type": "transcribe", providing real-time transcription progress or completion.

{
    "type": "transcribe",
    "data": {
        "finalized": false,
        "transcript": "Hi, ...",
        "audioStart": 10000, // milliseconds
        "audioEnd": 20000
    }
}

Type: transcript-beautify

The transcript-beautify feature beautifies the original transcript. You can modify the transcript that are in between audioStart and audioEnd when you receive this message to give the user more beautified transcription. This transcription may include multiple segments to provide more contextual transcription.

{
    "type": "transcript-beautify",
    "data": {
        "transcript": "Hi, ...",
        "audioStart": 10000,
        "audioEnd": 20000,
        "segments": [
            {
                "transcript": "Hi",
                "audioStart": 10000,
                "audioEnd": 12000
            },
            {
                "transcript": ", ...",
                "audioStart": 12000,
                "audioEnd": 20000
            }
        ]
    }
}

Type: `detect-action`

The detect-action feature identifies specific actions mentioned within a transcript and categorizes them into different types like todo, calendar, and research. Each detected action includes a unique ID, a title, and additional metadata relevant to the action's type. For example, a todo might generate a task to complete, a calendar event schedules a meeting, and research might trigger a query. This structured data enables automated task management based on spoken input.

{
    "type": "detect-action",
    "data": {
        "type": "todo",
        "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "inner": {
            "title": "Buy a lunch",
            "body": "Go to ..."
        },
        "relate": {
            "start": 3600,
            "end": 3700,
            "transcript": "..."
        }
    }
}

{
    "type": "detect-action",
    "data": {
        "type": "calendar",
        "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "inner": {
            "title": "Meeting with Bob at 8pm",
            "datetime": "0000-00-00T00:00:00Z"
        },
        "relate": {
            "start": 3600,
            "end": 3700,
            "transcript": "..."
        }
    }
}

{
    "type": "detect-action",
    "data": {
        "type": "research",
        "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "inner": {
            "title": "...",
            "query": "..."
        },
        "relate": {
            "start": 3600,
            "end": 3700,
            "transcript": "..."
        }
    }
}

PreviousOto Voice API NextAction

Last updated 15 days ago

Push audio streaming (Websocket)

Authentication:

Type: audio

Type: complete

Type: transcribe

Type: detect-action

Type: `audio`

Type: `complete`

Type: `transcribe`

Type: `detect-action`