Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

interface TurnDetection {
    create_response?: boolean;
    interrupt_response?: boolean;
    prefix_padding_ms?: number;
    silence_duration_ms?: number;
    threshold?: number;
    type?: string;
}

Properties

create_response?: boolean

Whether or not to automatically generate a response when a VAD stop event occurs. true by default.

interrupt_response?: boolean

Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. true by default.

prefix_padding_ms?: number

Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.

silence_duration_ms?: number

Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.

threshold?: number

Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.

type?: string

Type of turn detection, only server_vad is currently supported.