Generating Automatic Subtitles and Translations using Speech-to-Text
Introduction
Speech-to-Text uses automatic speech recognition (ASR) technology to convert spoken audio into text. Qencode offers a Speech-to-Text output for transcoding jobs, enabling you not only to generate subtitles and transcripts from any audio or video file, but also to automatically translate them into multiple target languages within the same workflow. This feature utilizes advanced AI models to identify human speech across a wide range of languages and produce readable text outputs, enriching the content in your library. The resulting transcripts and subtitles are packaged into JSON, SRT, VTT, or TXT files, ensuring compatibility with most common video platforms.
Generating Subtitles with STT Output
Defining the Output Format
Speech-to-Text outputs can be created by adding a 'speech_to_text' output format to your transcoding job. In order to do this, use the /v1/start_encode2 method to launch a transcoding job with the output param set to speech_to_text. Make sure to also specify the destination where the generated Speech-to-Text output should be stored.
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
}
}Customizing STT Outputs
These are all parameters available for the STT, which allow you to control for Speech-to-Text and Translations, including accuracy, language, which types of files are generated and naming conventions.
| Parameter | Type | Requirement | Description |
|---|---|---|---|
| output | string | Required | Must be "speech_to_text" to enable STT + translation. |
| mode | string | Optional | "accuracy" | "balanced" | "speed" |
| language | string | Optional | Force source language. Defaults to auto-detect language if blank. |
| destination | array[obj] | Optional | One or more storage targets. |
| start_time duration | float | Optional | Optional segmenting (in seconds) to process a clip only. |
| translate_languages | array[string] | Required | Target languages to generate; max 15 entries. |
| transcript | 0/1 | Optional | Generate TXT transcript for each target language. |
| transcript_name | string | Optional | Customize the transcript file name, with the default being ‘transcript.txt’. |
| json | 0/1 | Optional | Generate JSON with timestamps for each target language. |
| json_name | string | Optional | Customize the JSON file name, with the default being ‘timestamps.json’. |
| srt | 0/1 | Optional | Generate SRT subtitles for each target language. |
| srt_name | string | Optional | Customize the SRT subtitles file name, with the default being ‘subtitles.srt’. |
| vtt | 0/1 | Optional | Generate VTT subtitles for each target language. |
| vtt_name | string | Optional | Customize the VTT subtitles file name, with the default being ‘subtitles.vtt’. |
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
},
"mode": "accuracy",
"translate_languages": ["uk","de"],
"transcript": 1,
"transcript_name": "your_custom_naming.txt",
"json": 0,
"vtt": 1,
"srt": 1,
"srt_name": "your_custom_naming.srt",
"start_time": 60.0,
"duration": 30.0,
"language": "es"
}Choosing the Mode
There are three modes available for Speech-to-Text output.
| Mode | Description |
|---|---|
| Accuracy | More accurate transcription, but takes more time to process. |
| Balanced | Balance between speed and accuracy. |
| Speed | Faster transcriptions, but less accurate in some cases. |
To select a mode, include the mode parameter in your request.
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
},
"mode": "speed"
}Controlling STT Output Types and File Names
You can choose to include the following file types in your Speech-to-Text outputs:
| File Type | Description | API params |
|---|---|---|
| Transcript | A text file containing all the speech from the input. | transcript, transcript_name |
| JSON | A JSON file containing the text with associated timestamps. | json, json_name |
| SRT | Subtitles in the SRT format. | srt, srt_name |
| VTT | Subtitles in the VTT format. | vtt, vtt_name |
You can enable or disable specific output types by setting their corresponding parameters to 1 (enabled) or 0 (disabled). By default all STT outputs are enabled - the transcript, json, srt and vtt params are set to 1.
You can also control the names of the output files using the following parameters:
| Parameter | Default Name |
|---|---|
| transcript_name | transcript.txt |
| json_name | timestamps.json |
| srt_name | subtitles.srt |
| vtt_name | subtitles.vtt |
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
},
"mode": "accuracy",
"transcript": 1,
"transcript_name": "your_custom_naming.txt",
"json": 0,
"vtt": 0,
"srt": 1,
"srt_name": "your_custom_naming.srt"
}Speech-to-Text (STT) Outputs
When a Speech-to-Text (STT) job completes, the /v1/status response includes metadata about each generated output file. These outputs typically include transcripts, JSON, SRT, and VTT files.
You can control which outputs are generated and customize their file names using parameters in the STT request as described above in Controlling STT Output Types and File Names section.
Speech-to-Text outputs metadata can be found in the texts array in the /v1/status response.
Response Example
"texts": [
{
"tag": "speech_to_text-0-0-0",
"profile": null,
"user_tag": null,
"storage": {
"zip": {
"region": "us-east-1",
"bucket": "qencode-temp-us-east-1",
"host": "prod-us-east-1-7-storage-aws.qencode.com"
},
"format": "speech_to_text",
"host": "storage-aws-us-east-1.qencode.com",
"names": {
"transcript": "transcript.txt"
},
"path": "2b9e8b5fb88b2b8005d3c5d6d36f8d95/speech_to_text/1-0",
"type": "local",
"expire": "2025-10-16 14:09:39",
"timestamp": "2025-11-07 17:49:45"
},
"url": "https://storage-aws-us-east-1.qencode.com/2b9e8b5fb88b2b8005d3c5d6d36f8d95/speech_to_text/1-0",
"download_url": "https://storage-aws-us-east-1.qencode.com/2b9e8b5fb88b2b8005d3c5d6d36f8d95/speech_to_text/1-0",
"meta": {
"language": "auto",
"translate_languages": null,
"entropy_threshold": 2.7999999999999998,
"beam_size": 9,
"mode": "accuracy",
"model": "large",
"max_context": 128
},
"duration": "30.0002",
"detected_language": "en",
"size": "0.00126076",
"output_format": "speech_to_text",
"error": false,
"error_description": null,
"warnings": [],
"cost": {
"currency": "USD",
"code": 840,
"amount": "0.0025"
}
}
]Setting a Language
Source Language
Qencode supports over 70 languages for Speech-to-Text processing, enabling you to accurately transcribe and translate spoken content from a wide range of global sources. To override, set the optional language parameter (e.g., "language": "en").
Stable
The Stable languages are those that have been thoroughly tested and show consistent, high-quality recognition results across diverse audio conditions. These languages typically achieve Word Error Rate (WER) ≤ 20% and Character Error Rate (CER) ≤ 30%, indicating that the AI model performs reliably in real-world use cases.
About WER and CER
- WER (Word Error Rate) shows how accurately individual words are recognized. Lower WER means higher accuracy.
- CER (Character Error Rate) measures the proportion of characters that are inserted, deleted, or substituted in the transcription relative to the total number of characters in the reference. Lower CER means higher transcription accuracy.
| Language | Code | WER | CER |
|---|---|---|---|
| English | en | 12.33% | 8.13% |
| Catalan | ca | 16.81% | 9.23% |
| Dutch | nl | 8.07% | 6.1% |
| Finnish | fi | 11.51% | 2.52% |
| French | fr | 16.68% | 9.13% |
| German | de | 10.8% | 6.27% |
| Greek | el | 15.9% | 6.13% |
| Hebrew | he | 15.97% | 5.46% |
| Indonesian | id | 12.36% | 5.57% |
| Italian | it | 10.51% | 4.86% |
| Polish | pl | 12.12% | 7.68% |
| Romanian | ro | 19.96% | 9.1% |
| Russian | ru | 5.71% | 1.88% |
| Slovenian | sl | 14.81% | 3.91% |
| Spanish | es | 9.67% | 5.49% |
| Swedish | sv-SE | 13.03% | 6.16% |
| Turkish | tr | 15.28% | 7.0% |
| Ukrainian | uk | 12.01% | 3.67% |
| Vietnamese | vi | 17.24% | 9.92% |
Beta
Beta languages are still being improved and may show less accurate results. They are suitable for testing or experimentation but may require manual review or corrections, especially for complex audio or strong accents.
| Language | Code |
|---|---|
| Afrikaans | af |
| Albanian | sq |
| Amharic | am |
| Arabic | ar |
| Armenian | hy |
| Bashkir | ba |
| Basque | eu |
| Belarusian | be |
| Bengali | bn |
| Breton | br |
| Bulgarian | bg |
| Chinese | zh |
| Czech | cs |
| Danish | da |
| Estonian | et |
| Galician | gl |
| Georgian | ka |
| Hindi | hi |
| Hungarian | hu |
| Icelandic | is |
| Japanese | ja |
| Kazakh | kk |
| Korean | ko |
| Lao | lo |
| Latvian | lv |
| Lithuanian | lt |
| Macedonian | mk |
| Malayalam | ml |
| Maltese | mt |
| Marathi | mr |
| Mongolian | mn |
| Nepali | ne-NP |
| Norwegian Nynorsk | nn-NO |
| Occitan | oc |
| Persian | fa |
| Portuguese | pt |
| Pushto | ps |
| Serbian | sr |
| Slovak | sk |
| Swahili | sw |
| Tamil | ta |
| Thai | th |
| Tatar | tt |
| Urdu | ur |
| Welsh | cy |
| Yiddish | yi |
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
},
"language": "uk"
}Targeting Specific Segments
In instances where your video or audio file is lengthy and you're interested in only processing specific segments, Qencode allows you to precisely target these segments for speech-to-text conversion. To specify the segment you wish to target, include the start_time and duration attributes in your transcoding job request. The start_time attribute determines the starting point of the segment (in seconds) from the beginning of the file, and the duration attribute specifies the length of the segment to process (also in seconds). This can be especially useful for scenarios where you need to focus on highlighting key events or reduce processing time and costs.
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
},
"start_time":"60.0",
"duration":"30.0"
}Adding Translations to your STT Output
To run a Speech-to-Text job with Translations, set the translate_languages parameter with the target language codes.
Example Request
{
"output": "speech_to_text",
"translate_languages": ["uk", "cs", "de"],
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
}
}In Translations jobs output file names will contain language code suffix (e.g., SRT-UK, JSON-ES, TEXT-DE).
Output files name are stored in status.texts[INDEX].storage.names:
"names": {
"srt-cs": "subtitles-cs.srt",
"vtt-cs": "subtitles-cs.vtt",
"text-cs": "transcript-cs.txt",
"json-cs": "timestamps-cs.json",
"srt-es": "subtitles-es.srt",
"vtt-es": "subtitles-es.vtt",
"text-es": "transcript-es.txt",
"json-es": "timestamps-es.json",
"srt-nl": "subtitles-nl.srt",
"vtt-nl": "subtitles-nl.vtt",
"text-nl": "transcript-nl.txt",
"json-nl": "timestamps-nl.json"
}, Target Languages (Translation)
Provide up to 15 language codes in translate_languages e.g.,
"translate_languages": ["uk","cs","de","es","it","fr", ...]If you need more than 15 languages, split them across multiple jobs.
Stable
Stable translation languages provide the most reliable and natural results. These languages have been thoroughly tested and are recommended for production use when translation quality and fluency are important.
| Language | Code |
|---|---|
| Belarusian | be |
| Bulgarian | bg |
| Burmese (Myanmar) | my |
| Czech | cs |
| Danish | da |
| English | en |
| French | fr |
| German | de |
| Greek | el |
| Indonesian | id |
| Italian | it |
| Lithuanian | lt |
| Norwegian Bokmål | nb |
| Polish | pl |
| Portuguese | pt |
| Romanian | ro |
| Russian | ru |
| Serbian | sr |
| Slovak | sk |
| Slovenian | sl |
| Spanish | es |
| Swedish | sv |
| Turkish | tr |
| Ukrainian | uk |
| Vietnamese | vi |
Beta
Beta translation languages are still being optimized and may occasionally produce less accurate or less fluent results. They are best used for testing, internal reviews, or exploratory use cases where small translation inaccuracies are acceptable.
| Language | Code |
|---|---|
| Afrikaans | af |
| Amharic | am |
| Arabic | ar |
| Armenian | hy |
| Assamese | as |
| Azerbaijani | az |
| Basque | eu |
| Bengali | bn |
| Bosnian | bs |
| Cantonese (Yue Chinese) | yue |
| Catalan | ca |
| Chichewa (Nyanja) | ny |
| Chinese | zh |
| Croatian | hr |
| Dutch | nl |
| Estonian | et |
| Finnish | fi |
| Fulah (Fula) | ff |
| Galician | gl |
| Ganda (Luganda) | lg |
| Georgian | ka |
| Gujarati | gu |
| Hebrew | he |
| Hindi | hi |
| Hungarian | hu |
| Icelandic | is |
| Irish | ga |
| Japanese | ja |
| Javanese | jv |
| Kannada | kn |
| Kazakh | kk |
| Khmer | km |
| Korean | ko |
| Kurdish | ku |
| Kyrgyz | ky |
| Lao | lo |
| Latvian | lv |
| Luxembourgish | lb |
| Macedonian | mk |
| Malay | ms |
| Malayalam | ml |
| Maltese | mt |
| Marathi | mr |
| Nepali | ne |
| Norwegian Nynorsk | nn |
| Odia (Oriya) | or |
| Occitan | oc |
| Oromo | om |
| Pashto | ps |
| Persian (Farsi) | fa |
| Punjabi | pa |
| Shona | sn |
| Sindhi | sd |
| Somali | so |
| Swahili | sw |
| Tagalog (Filipino) | tl |
| Tajik | tg |
| Tamil | ta |
| Telugu | te |
| Thai | th |
| Urdu | ur |
| Uzbek | uz |
| Welsh | cy |
| Xhosa | xh |
| Yoruba | yo |
| Zulu | zu |
Notes
- translate_languages accepts an array of Language Codes.
- If you omit language, Qencode auto-detects the source language. You only need to set language if you want to force the source language.
Improving Accuracy
Aside from changing the mode, there are a few other things to take into consideration when trying to improve the accuracy of Speech-to-Text outputs. Below you can find a list of suggestions for the media you use for Speech-to-Text processing.
- Ensure the audio within your video is clear and free from background noise. This can be wind, traffic, hums, music or any other non-speech noise.
- Compressed audio formats can lose quality. Ensure your source files are exported using high-quality settings.
- Maintain consistent volume levels throughout the audio. Significant fluctuations can affect the transcription's accuracy.
- Avoid conversations with overlapping voices and other forms of overtalk.
- Whenever possible, use media with a single language throughout the recording.
Troubleshooting and Tips
- Wrong source language detected
- Override auto-detect by setting the language (source) parameter explicitly.
- 15 language limit per job
- Keep translate_languages to 15 or fewer. If you need more, split into multiple jobs.
- Expected files not generated
- Make sure the output flags you need are set to
1: srt, vtt, json, and/or transcript. - No files appearing at your destination
- Verify all of the following:
- Destination URL / bucket path is correct.
- Credentials (key, secret) are valid.
- You have write permissions on the destination.
- ISO language codes
- Use standard two-letter codes where available (e.g.,
en,de,es,it,tr,uk,cs,hy,bg,pl,ru,ar,af,az,sr). - Per-language outputs
- Outputs are generated per target language. UI labels and filenames typically include a language suffix (e.g.,
myfile-en.srt,myfile-es.vtt). - Storage Access Permissions
- If files must be publicly accessible, set the destination’s permissions accordingly.
Need More Help?
For additional information, tutorials, or support, visit the Qencode Documentation page or contact Qencode Support at support@qencode.com.