On this page

Using Speech-to-Text to create Subtitles and Transcripts

Introduction

Speech-to-Text is a technology that uses automatic speech recognition to generate text from audio in which speech is recognized. Qencode offers a Speech-to-Text output for transcoding jobs, which uses this technology to generate subtitles and transcripts from any of your audio or video files. This output utilizes the latest advanced artificial intelligence models to identify human speech across a wide range of languages, and create readable text ouputs that can be used to further enrich the content in your library. This output is packaged into transcripts and subtitles, and formatted so that they are compatible with most common by video platforms (JSON, SRT, VTT, and TXT files).

Launching Speech-to-Text Transcoding

Defining the Output Format

Speech-to-Text outputs can be created by adding a 'speech_to_text' output format to your transcoding job. In order to do this, use the /v1/start_encode2 method to launch a transcoding job with the format object set to speech_to_text. Make sure to also specify the destination where the generated Speech-to-Text output should stored.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  }
}

Choosing the Mode

There are three modes available for Speech-to-Text output.

ModeDescription
AccuracyMore accurate transcription, but takes more time to process.
BalancedBalance between speed and accuracy.
SpeedFaster transcriptions, but less accurate in some cases.

To select the mode for Speech-to-Text processing, add the mode parameter to your request.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "mode": "speed"
}

Speech-to-Text Output Files

You can choose to include the following file types in your Speech-to-Text outputs:

File TypeDescription
TranscriptA text file containing all the speech from the input.
JSONA JSON file containing the text with associated timestamps.
SRTSubtitles in the SRT format.
VTTSubtitles in the VTT format.

Customizing Speech-to-Text Outputs

You can customize the files and names created from Speech-to-Text processing. Below is a list of parameters that can be used to control these settings:

ParameterDescription
transcriptToggle the generation of a transcript file. Set to 1 to enable (default) or 0 to disable.
transcript_nameCustomize the transcript file name, with the default being ‘transcript.txt’.
jsonToggle the generation of a JSON file with timestamps. Set to 1 to enable (default) or 0 to disable.
json_nameCustomize the JSON file name, with the default being ‘timestamps.json’.
srtToggle the generation of subtitles in SRT format. Set to 1 to enable (default) or 0 to disable.
srt_nameCustomize the SRT subtitles file name, with the default being ‘subtitles.srt’.
vttToggle the generation of subtitles in VTT format. Set to 1 to enable (default) or 0 to disable.
vtt_nameCustomize the VTT subtitles file name, with the default being ‘subtitles.vtt’.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "transcript": 1,
  "transcript_name": "your_custom_naming.txt",
  "json": 0,
  "vtt": 0,
  "srt": 1,
  "srt_name": "your_custom_naming.srt"
}

Setting a Language

Speech-to-Text processing with Qencode automatically tries to recognize the spoken language in your media files, however there may be instances where more control over selected language is required. In case you want to explicitly define the target langauage for Speech-to-Text processing, you can do so using the following optional language parameter.

LanguageCodeWERCER
Englishen15.2%10.3%
Afrikaansaf36.3%11.1%
Albaniansq36.3%11.1%
Amharicam100%89.3%
Arabicar61.1%39.3%
Armenianhy64.1%30.6%
Bashkirba96.7%38.7%
Basqueba50.6%15.7%
Belarusianbe48.6%14.6%
Bengalibn100%96.3%
Bretonbg97.7%88.6%
Bulgarianbg22.0%6.9%
Catalanca16.0%8.7%
Chinesezh58.3%32.2%
Czechcs32.1%30.5%
Danishda33.6%25.4%
Dutchnl6.9%4.8%
Estonianet46.6%13.7%
Finnishfi15.4%3.1%
Frenchfr17.4%10.7%
Galiciangl34.2%12.4%
Germande10.7%5.6%
Greekel20.5%11.0%
Hebrewhe16.2%5.1%
Hindihi62.9%64.1%
Hungarianhu31.6%14.5%
Icelandicis63.0%30.6%
Indonesianid14.0%9.1%
Italianit13.2%6.2%
Japaneseja77.5%35.5%
Kazakhkk49.6%12.2%
Koreanko88.4%100.0%
Latvianlv23.6%7.0%
Lithuanianlt53.5%34.5%
Macedonianmk42.7%14.3%
Marathimr66.0%19.0%
Nepaline-NP72.4%45.1%
Norwegiannn-NO40.8%15.2%
Persianfa68.8%70.5%
Polishpl12.4%7.0%
Portuguesept23.6%11.4%
Romanianro18.7%8.8%
Russianru8.1%3.3%
Serbiansr90.0%74.0%
Slovaksk47.3%47.2%
Sloveniansl18.0%6.1%
Spanishes55.7%53.8%
Swahilisw88.4%51.0%
Swedishsv-SE25.7%18.8%
Tamilta54.5%31.1%
Thaith82.0%26.2%
Turkishtr17.0%7.1%
Ukrainianuk13.5%4.3%
Urduur27.0%10.3%
Vietnamesevi15.5%7.6%
Welshcy41.1%26.5%

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "language": "uk"
}

Targeting Specific Segments

In instances where your video or audio file is lengthy and you're interested in only processing specific segments, Qencode allows you to precisely target these segments for for speech-to-text conversion. To specify the segment you wish to target, include the start_time and duration attributes in your transcoding job request. The start_time attribute determines the starting point of the segment (in seconds) from the beginning of the file, and the duration attribute specifies the length of the segment to process (also in seconds). This can be especially useful for scenarios where you need to focus on highlight key events or reduce processing time and costs.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "start_time":"60.0", 
  "duration":"30.0"
}

Improving Accuracy

Aside from changing the mode, there are a few other things to take into consideration when trying to improve the accuracy of Speech-to-Text outputs. Below you can find a list of suggestions for the media you use for Speech-to-Text processing.

  1. Ensure the audio within your video is clear and free from background noise. This can be wind, traffic, hums, music or any other non-speech noise.
  2. Compressed audio formats can lose quality. Ensure your source files are exported using high-quality settings.
  3. Maintain consistent volume levels throughout the audio. Significant fluctuations can affect the transcription's accuracy.
  4. Avoid conversations with overlapping voices and other forms of overtalk.
  5. Whenever possible, use media with a single language throughout the recording.