On this page

Generating Automatic Subtitles and Translations using Speech-to-Text

Introduction

Speech-to-Text uses automatic speech recognition (ASR) technology to convert spoken audio into text. Qencode offers a Speech-to-Text output for transcoding jobs, enabling you not only to generate subtitles and transcripts from any audio or video file, but also to automatically translate them into multiple target languages within the same workflow. This feature utilizes advanced AI models to identify human speech across a wide range of languages and produce readable text outputs, enriching the content in your library. The resulting transcripts and subtitles are packaged into JSON, SRT, VTT, or TXT files, ensuring compatibility with most common video platforms.

Generating Subtitles with STT Output

Defining the Output Format

Speech-to-Text outputs can be created by adding a 'speech_to_text' output format to your transcoding job. In order to do this, use the /v1/start_encode2 method to launch a transcoding job with the output param set to speech_to_text. Make sure to also specify the destination where the generated Speech-to-Text output should be stored.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  }
}

Customizing STT Outputs

These are all parameters available for the STT, which allow you to control for Speech-to-Text and Translations, including accuracy, language, which types of files are generated and naming conventions.

ParameterTypeRequirementDescription
outputstringRequiredMust be "speech_to_text" to enable STT + translation.
modestringOptional"accuracy" | "balanced" | "speed"
languagestringOptionalForce source language. Defaults to auto-detect language if blank.
destinationarray[obj]OptionalOne or more storage targets.
start_time durationfloatOptionalOptional segmenting (in seconds) to process a clip only.
translate_languagesarray[string]RequiredTarget languages to generate; max 15 entries.
transcript0/1OptionalGenerate TXT transcript for each target language.
transcript_namestringOptionalCustomize the transcript file name, with the default being ‘transcript.txt’.
json0/1OptionalGenerate JSON with timestamps for each target language.
json_namestringOptionalCustomize the JSON file name, with the default being ‘timestamps.json’.
srt0/1OptionalGenerate SRT subtitles for each target language.
srt_namestringOptionalCustomize the SRT subtitles file name, with the default being ‘subtitles.srt’.
vtt0/1OptionalGenerate VTT subtitles for each target language.
vtt_namestringOptionalCustomize the VTT subtitles file name, with the default being ‘subtitles.vtt’.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  },
  "mode": "accuracy",
  "translate_languages": ["uk","de"],
  "transcript": 1,
  "transcript_name": "your_custom_naming.txt",
  "json": 0,
  "vtt": 1,
  "srt": 1,
  "srt_name": "your_custom_naming.srt",
  "start_time": 60.0,
  "duration": 30.0,
  "language": "es"
}

Choosing the Mode

There are three modes available for Speech-to-Text output.

ModeDescription
AccuracyMore accurate transcription, but takes more time to process.
BalancedBalance between speed and accuracy.
SpeedFaster transcriptions, but less accurate in some cases.

To select a mode, include the mode parameter in your request.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  },
  "mode": "speed"
}

Controlling STT Output Types and File Names

You can choose to include the following file types in your Speech-to-Text outputs:

File TypeDescriptionAPI params
TranscriptA text file containing all the speech from the input.transcript, transcript_name
JSONA JSON file containing the text with associated timestamps.json, json_name
SRTSubtitles in the SRT format.srt, srt_name
VTTSubtitles in the VTT format.vtt, vtt_name

You can enable or disable specific output types by setting their corresponding parameters to 1 (enabled) or 0 (disabled). By default all STT outputs are enabled - the transcript, json, srt and vtt params are set to 1.
You can also control the names of the output files using the following parameters:

ParameterDefault Name
transcript_nametranscript.txt
json_nametimestamps.json
srt_namesubtitles.srt
vtt_namesubtitles.vtt

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  },
  "mode": "accuracy",
  "transcript": 1,
  "transcript_name": "your_custom_naming.txt",
  "json": 0,
  "vtt": 0,
  "srt": 1,
  "srt_name": "your_custom_naming.srt"
}

Speech-to-Text (STT) Outputs

When a Speech-to-Text (STT) job completes, the /v1/status response includes metadata about each generated output file. These outputs typically include transcripts, JSON, SRT, and VTT files.
You can control which outputs are generated and customize their file names using parameters in the STT request as described above in Controlling STT Output Types and File Names section.
Speech-to-Text outputs metadata can be found in the texts array in the /v1/status response.

Response Example

"texts": [
  {
      "tag": "speech_to_text-0-0-0",
      "profile": null,
      "user_tag": null,
      "storage": {
          "zip": {
              "region": "us-east-1",
              "bucket": "qencode-temp-us-east-1",
              "host": "prod-us-east-1-7-storage-aws.qencode.com"
          },
          "format": "speech_to_text",
          "host": "storage-aws-us-east-1.qencode.com",
          "names": {
              "transcript": "transcript.txt"
          },
          "path": "2b9e8b5fb88b2b8005d3c5d6d36f8d95/speech_to_text/1-0",
          "type": "local",
          "expire": "2025-10-16 14:09:39",
          "timestamp": "2025-11-07 17:49:45"
      },
      "url": "https://storage-aws-us-east-1.qencode.com/2b9e8b5fb88b2b8005d3c5d6d36f8d95/speech_to_text/1-0",
      "download_url": "https://storage-aws-us-east-1.qencode.com/2b9e8b5fb88b2b8005d3c5d6d36f8d95/speech_to_text/1-0",
      "meta": {
          "language": "auto",
          "translate_languages": null,
          "entropy_threshold": 2.7999999999999998,
          "beam_size": 9,
          "mode": "accuracy",
          "model": "large",
          "max_context": 128
      },
      "duration": "30.0002",
      "detected_language": "en",
      "size": "0.00126076",
      "output_format": "speech_to_text",
      "error": false,
      "error_description": null,
      "warnings": [],
      "cost": {
          "currency": "USD",
          "code": 840,
          "amount": "0.0025"
      }
  }
]

Setting a Language

Source Language

Qencode supports over 70 languages for Speech-to-Text processing, enabling you to accurately transcribe and translate spoken content from a wide range of global sources. To override, set the optional language parameter (e.g., "language": "en").

Stable

The Stable languages are those that have been thoroughly tested and show consistent, high-quality recognition results across diverse audio conditions. These languages typically achieve Word Error Rate (WER) ≤ 20% and Character Error Rate (CER) ≤ 30%, indicating that the AI model performs reliably in real-world use cases.

About WER and CER

  • WER (Word Error Rate) shows how accurately individual words are recognized. Lower WER means higher accuracy.
  • CER (Character Error Rate) measures the proportion of characters that are inserted, deleted, or substituted in the transcription relative to the total number of characters in the reference. Lower CER means higher transcription accuracy.
LanguageCodeWERCER
Englishen12.33%8.13%
Catalanca16.81%9.23%
Dutchnl8.07%6.1%
Finnishfi11.51%2.52%
Frenchfr16.68%9.13%
Germande10.8%6.27%
Greekel15.9%6.13%
Hebrewhe15.97%5.46%
Indonesianid12.36%5.57%
Italianit10.51%4.86%
Polishpl12.12%7.68%
Romanianro19.96%9.1%
Russianru5.71%1.88%
Sloveniansl14.81%3.91%
Spanishes9.67%5.49%
Swedishsv-SE13.03%6.16%
Turkishtr15.28%7.0%
Ukrainianuk12.01%3.67%
Vietnamesevi17.24%9.92%

Beta

Beta languages are still being improved and may show less accurate results. They are suitable for testing or experimentation but may require manual review or corrections, especially for complex audio or strong accents.

LanguageCode
Afrikaansaf
Albaniansq
Amharicam
Arabicar
Armenianhy
Bashkirba
Basqueeu
Belarusianbe
Bengalibn
Bretonbr
Bulgarianbg
Chinesezh
Czechcs
Danishda
Estonianet
Galiciangl
Georgianka
Hindihi
Hungarianhu
Icelandicis
Japaneseja
Kazakhkk
Koreanko
Laolo
Latvianlv
Lithuanianlt
Macedonianmk
Malayalamml
Maltesemt
Marathimr
Mongolianmn
Nepaline-NP
Norwegian Nynorsknn-NO
Occitanoc
Persianfa
Portuguesept
Pushtops
Serbiansr
Slovaksk
Swahilisw
Tamilta
Thaith
Tatartt
Urduur
Welshcy
Yiddishyi

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "language": "uk"
}

Targeting Specific Segments

In instances where your video or audio file is lengthy and you're interested in only processing specific segments, Qencode allows you to precisely target these segments for speech-to-text conversion. To specify the segment you wish to target, include the start_time and duration attributes in your transcoding job request. The start_time attribute determines the starting point of the segment (in seconds) from the beginning of the file, and the duration attribute specifies the length of the segment to process (also in seconds). This can be especially useful for scenarios where you need to focus on highlighting key events or reduce processing time and costs.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "start_time":"60.0", 
  "duration":"30.0"
}

Adding Translations to your STT Output

To run a Speech-to-Text job with Translations, set the translate_languages parameter with the target language codes.

Example Request

{
  "output": "speech_to_text",
  "translate_languages": ["uk", "cs", "de"],
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  }
}

In Translations jobs output file names will contain language code suffix (e.g., SRT-UK, JSON-ES, TEXT-DE).

Output files name are stored in status.texts[INDEX].storage.names:

"names": { 
  "srt-cs": "subtitles-cs.srt", 
  "vtt-cs": "subtitles-cs.vtt",
  "text-cs": "transcript-cs.txt", 
  "json-cs": "timestamps-cs.json",
  "srt-es": "subtitles-es.srt", 
  "vtt-es": "subtitles-es.vtt",
  "text-es": "transcript-es.txt",
  "json-es": "timestamps-es.json",
  "srt-nl": "subtitles-nl.srt",
  "vtt-nl": "subtitles-nl.vtt",
  "text-nl": "transcript-nl.txt",  
  "json-nl": "timestamps-nl.json"    
}, 

Target Languages (Translation)

Provide up to 15 language codes in translate_languages e.g.,

"translate_languages": ["uk","cs","de","es","it","fr", ...]

If you need more than 15 languages, split them across multiple jobs.

Stable

Stable translation languages provide the most reliable and natural results. These languages have been thoroughly tested and are recommended for production use when translation quality and fluency are important.

LanguageCode
Belarusianbe
Bulgarianbg
Burmese (Myanmar)my
Czechcs
Danishda
Englishen
Frenchfr
Germande
Greekel
Indonesianid
Italianit
Lithuanianlt
Norwegian Bokmålnb
Polishpl
Portuguesept
Romanianro
Russianru
Serbiansr
Slovaksk
Sloveniansl
Spanishes
Swedishsv
Turkishtr
Ukrainianuk
Vietnamesevi

Beta

Beta translation languages are still being optimized and may occasionally produce less accurate or less fluent results. They are best used for testing, internal reviews, or exploratory use cases where small translation inaccuracies are acceptable.

LanguageCode
Afrikaansaf
Amharicam
Arabicar
Armenianhy
Assameseas
Azerbaijaniaz
Basqueeu
Bengalibn
Bosnianbs
Cantonese (Yue Chinese)yue
Catalanca
Chichewa (Nyanja)ny
Chinesezh
Croatianhr
Dutchnl
Estonianet
Finnishfi
Fulah (Fula)ff
Galiciangl
Ganda (Luganda)lg
Georgianka
Gujaratigu
Hebrewhe
Hindihi
Hungarianhu
Icelandicis
Irishga
Japaneseja
Javanesejv
Kannadakn
Kazakhkk
Khmerkm
Koreanko
Kurdishku
Kyrgyzky
Laolo
Latvianlv
Luxembourgishlb
Macedonianmk
Malayms
Malayalamml
Maltesemt
Marathimr
Nepaline
Norwegian Nynorsknn
Odia (Oriya)or
Occitanoc
Oromoom
Pashtops
Persian (Farsi)fa
Punjabipa
Shonasn
Sindhisd
Somaliso
Swahilisw
Tagalog (Filipino)tl
Tajiktg
Tamilta
Telugute
Thaith
Urduur
Uzbekuz
Welshcy
Xhosaxh
Yorubayo
Zuluzu

Notes

  • translate_languages accepts an array of Language Codes.
  • If you omit language, Qencode auto-detects the source language. You only need to set language if you want to force the source language.

Improving Accuracy

Aside from changing the mode, there are a few other things to take into consideration when trying to improve the accuracy of Speech-to-Text outputs. Below you can find a list of suggestions for the media you use for Speech-to-Text processing.

  1. Ensure the audio within your video is clear and free from background noise. This can be wind, traffic, hums, music or any other non-speech noise.
  2. Compressed audio formats can lose quality. Ensure your source files are exported using high-quality settings.
  3. Maintain consistent volume levels throughout the audio. Significant fluctuations can affect the transcription's accuracy.
  4. Avoid conversations with overlapping voices and other forms of overtalk.
  5. Whenever possible, use media with a single language throughout the recording.

Troubleshooting and Tips

  • Wrong source language detected
    • Override auto-detect by setting the language (source) parameter explicitly.
  • 15 language limit per job
    • Keep translate_languages to 15 or fewer. If you need more, split into multiple jobs.
  • Expected files not generated
    • Make sure the output flags you need are set to 1: srt, vtt, json, and/or transcript.
  • No files appearing at your destination
    • Verify all of the following:
      • Destination URL / bucket path is correct.
      • Credentials (key, secret) are valid.
      • You have write permissions on the destination.
  • ISO language codes
    • Use standard two-letter codes where available (e.g., en, de, es, it, tr, uk, cs, hy, bg, pl, ru, ar, af, az, sr).
  • Per-language outputs
    • Outputs are generated per target language. UI labels and filenames typically include a language suffix (e.g., myfile-en.srt, myfile-es.vtt).
  • Storage Access Permissions
    • If files must be publicly accessible, set the destination’s permissions accordingly.

Need More Help?

For additional information, tutorials, or support, visit the Qencode Documentation pageLink or contact Qencode Support at support@qencode.com.