Audio Streaming
Push audio streaming (Websocket)
The /conversation/{conversation_id}/stream
endpoint facilitates the pushing of audio streams via a WebSocket connection. When sending data, it should be structured as a JSON object where the type
is set to "audio", and data
contains the encoded audio content.
/conversation/{conversation_id}/stream
Authentication:
You have to send the credentials before sending the audio. The server responds Authentication failed
whatever you calls before the authentication.
{
"type": "auth",
"data": {
"userId": "...",
"apiKey": "..."
}
}
You send:
Type: audio
audio
The /conversation/{conversation_id}/stream
endpoint allows real-time audio data to be processed through a WebSocket connection. When you send an audio stream, ensure it's in JSON format with "type"
set to "audio"
and "data"
holding the encoded audio. Responses received may include transcriptions or detected actions, such as creating a to-do item or scheduling a calendar event, each associated with relevant transcript data.
{
"type": "audio",
"data": "{encoded audio}"
}
Type: complete
complete
This type allows you to complete the conversation when the user requested to mark as ended. You need to mark it complete to get raw audio from the conversation endpoint.
{
"type": "complete"
}
You receive:
After sending the complete
message, the server starts doing the post-processing and the connection will be closed as it completes. So you have to show the spinner and wait for the websocket connection to be closed.
Type: transcribe
transcribe
The system returns a transcript with "type": "transcribe"
, providing real-time transcription progress or completion.
{
"type": "transcribe",
"data": {
"finalized": false,
"transcript": "Hi, ...",
"audioStart": 10000, // milliseconds
"audioEnd": 20000
}
}
Type: transcript-beautify
The transcript-beautify
feature beautifies the original transcript. You can modify the transcript that are in between audioStart
and audioEnd
when you receive this message to give the user more beautified transcription. This transcription may include multiple segments to provide more contextual transcription.
{
"type": "transcript-beautify",
"data": {
"transcript": "Hi, ...",
"audioStart": 10000,
"audioEnd": 20000,
"segments": [
{
"transcript": "Hi",
"audioStart": 10000,
"audioEnd": 12000
},
{
"transcript": ", ...",
"audioStart": 12000,
"audioEnd": 20000
}
]
}
}
Type: detect-action
detect-action
The detect-action
feature identifies specific actions mentioned within a transcript and categorizes them into different types like todo
, calendar
, and research
. Each detected action includes a unique ID, a title, and additional metadata relevant to the action's type. For example, a todo
might generate a task to complete, a calendar
event schedules a meeting, and research
might trigger a query. This structured data enables automated task management based on spoken input.
{
"type": "detect-action",
"data": {
"type": "todo",
"id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
"inner": {
"title": "Buy a lunch",
"body": "Go to ..."
},
"relate": {
"start": 3600,
"end": 3700,
"transcript": "..."
}
}
}
{
"type": "detect-action",
"data": {
"type": "calendar",
"id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
"inner": {
"title": "Meeting with Bob at 8pm",
"datetime": "0000-00-00T00:00:00Z"
},
"relate": {
"start": 3600,
"end": 3700,
"transcript": "..."
}
}
}
{
"type": "detect-action",
"data": {
"type": "research",
"id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
"inner": {
"title": "...",
"query": "..."
},
"relate": {
"start": 3600,
"end": 3700,
"transcript": "..."
}
}
}
Last updated