On this page

Using Speech-to-Text to create Subtitles and Transcripts

Introduction

Speech-to-Text uses automatic speech recognition (ASR) technology to convert spoken audio into text. Qencode offers a Speech-to-Text output for transcoding jobs, enabling you to generate subtitles and transcripts from any audio or video file. This feature utilizes advanced AI models to identify human speech across a wide range of languages and produce readable text outputs, enriching the content in your library. The resulting transcripts and subtitles are packaged into JSON, SRT, VTT, or TXT files, ensuring compatibility with most common video platforms.

Launching Speech-to-Text Transcoding

Defining the Output Format

Speech-to-Text outputs can be created by adding a 'speech_to_text' output format to your transcoding job. In order to do this, use the /v1/start_encode2 method to launch a transcoding job with the format object set to speech_to_text. Make sure to also specify the destination where the generated Speech-to-Text output should be stored.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  }
}

Choosing the Mode

There are three modes available for Speech-to-Text output.

Mode	Description
Accuracy	More accurate transcription, but takes more time to process.
Balanced	Balance between speed and accuracy.
Speed	Faster transcriptions, but less accurate in some cases.

To select the mode for Speech-to-Text processing, add the mode parameter to your request.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "mode": "speed"
}

Speech-to-Text Output Files

You can choose to include the following file types in your Speech-to-Text outputs:

File Type	Description
Transcript	A text file containing all the speech from the input.
JSON	A JSON file containing the text with associated timestamps.
SRT	Subtitles in the SRT format.
VTT	Subtitles in the VTT format.

Customizing Speech-to-Text Outputs

You can customize the files and names created from Speech-to-Text processing. Below is a list of parameters that can be used to control these settings:

Parameter	Description
transcript	Toggle the generation of a transcript file. Set to 1 to enable (default) or 0 to disable.
transcript_name	Customize the transcript file name, with the default being ‘transcript.txt’.
json	Toggle the generation of a JSON file with timestamps. Set to 1 to enable (default) or 0 to disable.
json_name	Customize the JSON file name, with the default being ‘timestamps.json’.
srt	Toggle the generation of subtitles in SRT format. Set to 1 to enable (default) or 0 to disable.
srt_name	Customize the SRT subtitles file name, with the default being ‘subtitles.srt’.
vtt	Toggle the generation of subtitles in VTT format. Set to 1 to enable (default) or 0 to disable.
vtt_name	Customize the VTT subtitles file name, with the default being ‘subtitles.vtt’.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "transcript": 1,
  "transcript_name": "your_custom_naming.txt",
  "json": 0,
  "vtt": 0,
  "srt": 1,
  "srt_name": "your_custom_naming.srt"
}

Setting a Language

Speech-to-Text processing with Qencode automatically tries to recognize the spoken language in your media files, however there may be instances where more control over selected language is required. In case you want to explicitly define the target langauage for Speech-to-Text processing, you can do so using the following optional language parameter.

Language	Code	WER	CER
English	en	15.2%	10.3%
Afrikaans	af	36.3%	11.1%
Albanian	sq	36.3%	11.1%
Amharic	am	100%	89.3%
Arabic	ar	61.1%	39.3%
Armenian	hy	64.1%	30.6%
Bashkir	ba	96.7%	38.7%
Basque	ba	50.6%	15.7%
Belarusian	be	48.6%	14.6%
Bengali	bn	100%	96.3%
Breton	bg	97.7%	88.6%
Bulgarian	bg	22.0%	6.9%
Catalan	ca	16.0%	8.7%
Chinese	zh	58.3%	32.2%
Czech	cs	32.1%	30.5%
Danish	da	33.6%	25.4%
Dutch	nl	6.9%	4.8%
Estonian	et	46.6%	13.7%
Finnish	fi	15.4%	3.1%
French	fr	17.4%	10.7%
Galician	gl	34.2%	12.4%
German	de	10.7%	5.6%
Greek	el	20.5%	11.0%
Hebrew	he	16.2%	5.1%
Hindi	hi	62.9%	64.1%
Hungarian	hu	31.6%	14.5%
Icelandic	is	63.0%	30.6%
Indonesian	id	14.0%	9.1%
Italian	it	13.2%	6.2%
Japanese	ja	77.5%	35.5%
Kazakh	kk	49.6%	12.2%
Korean	ko	88.4%	100.0%
Latvian	lv	23.6%	7.0%
Lithuanian	lt	53.5%	34.5%
Macedonian	mk	42.7%	14.3%
Marathi	mr	66.0%	19.0%
Nepali	ne-NP	72.4%	45.1%
Norwegian	nn-NO	40.8%	15.2%
Persian	fa	68.8%	70.5%
Polish	pl	12.4%	7.0%
Portuguese	pt	23.6%	11.4%
Romanian	ro	18.7%	8.8%
Russian	ru	8.1%	3.3%
Serbian	sr	90.0%	74.0%
Slovak	sk	47.3%	47.2%
Slovenian	sl	18.0%	6.1%
Spanish	es	55.7%	53.8%
Swahili	sw	88.4%	51.0%
Swedish	sv-SE	25.7%	18.8%
Tamil	ta	54.5%	31.1%
Thai	th	82.0%	26.2%
Turkish	tr	17.0%	7.1%
Ukrainian	uk	13.5%	4.3%
Urdu	ur	27.0%	10.3%
Vietnamese	vi	15.5%	7.6%
Welsh	cy	41.1%	26.5%

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "language": "uk"
}

Targeting Specific Segments

In instances where your video or audio file is lengthy and you're interested in only processing specific segments, Qencode allows you to precisely target these segments for speech-to-text conversion. To specify the segment you wish to target, include the start_time and duration attributes in your transcoding job request. The start_time attribute determines the starting point of the segment (in seconds) from the beginning of the file, and the duration attribute specifies the length of the segment to process (also in seconds). This can be especially useful for scenarios where you need to focus on highlighting key events or reduce processing time and costs.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "start_time":"60.0", 
  "duration":"30.0"
}

Improving Accuracy

Aside from changing the mode, there are a few other things to take into consideration when trying to improve the accuracy of Speech-to-Text outputs. Below you can find a list of suggestions for the media you use for Speech-to-Text processing.

Ensure the audio within your video is clear and free from background noise. This can be wind, traffic, hums, music or any other non-speech noise.
Compressed audio formats can lose quality. Ensure your source files are exported using high-quality settings.
Maintain consistent volume levels throughout the audio. Significant fluctuations can affect the transcription's accuracy.
Avoid conversations with overlapping voices and other forms of overtalk.
Whenever possible, use media with a single language throughout the recording.