Using Speech-to-Text to create Subtitles and Transcripts
Introduction
Speech-to-Text is a technology that uses automatic speech recognition to generate text from audio in which speech is recognized. Qencode offers a Speech-to-Text output for transcoding jobs, which uses this technology to generate subtitles and transcripts from any of your audio or video files. This output utilizes the latest advanced artificial intelligence models to identify human speech across a wide range of languages, and create readable text ouputs that can be used to further enrich the content in your library. This output is packaged into transcripts and subtitles, and formatted so that they are compatible with most common by video platforms (JSON, SRT, VTT, and TXT files).
Launching Speech-to-Text Transcoding
Defining the Output Format
Speech-to-Text outputs can be created by adding a 'speech_to_text' output format to your transcoding job. In order to do this, use the /v1/start_encode2 method to launch a transcoding job with the format object set to speech_to_text
. Make sure to also specify the destination where the generated Speech-to-Text output should stored.
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
}
}
Choosing the Mode
There are three modes available for Speech-to-Text output.
Mode | Description |
Accuracy | More accurate transcription, but takes more time to process. |
Balanced | Balance between speed and accuracy. |
Speed | Faster transcriptions, but less accurate in some cases. |
To select the mode for Speech-to-Text processing, add the mode parameter to your request.
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
},
"mode": "speed"
}
Speech-to-Text Output Files
You can choose to include the following file types in your Speech-to-Text outputs:
File Type | Description |
Transcript | A text file containing all the speech from the input. |
JSON | A JSON file containing the text with associated timestamps. |
SRT | Subtitles in the SRT format. |
VTT | Subtitles in the VTT format. |
Customizing Speech-to-Text Outputs
You can customize the files and names created from Speech-to-Text processing. Below is a list of parameters that can be used to control these settings:
Parameter | Description |
transcript | Toggle the generation of a transcript file. Set to 1 to enable (default) or 0 to disable. |
transcript_name | Customize the transcript file name, with the default being ‘transcript.txt’. |
json | Toggle the generation of a JSON file with timestamps. Set to 1 to enable (default) or 0 to disable. |
json_name | Customize the JSON file name, with the default being ‘timestamps.json’. |
srt | Toggle the generation of subtitles in SRT format. Set to 1 to enable (default) or 0 to disable. |
srt_name | Customize the SRT subtitles file name, with the default being ‘subtitles.srt’. |
vtt | Toggle the generation of subtitles in VTT format. Set to 1 to enable (default) or 0 to disable. |
vtt_name | Customize the VTT subtitles file name, with the default being ‘subtitles.vtt’. |
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
},
"transcript": 1,
"transcript_name": "your_custom_naming.txt",
"json": 0,
"vtt": 0,
"srt": 1,
"srt_name": "your_custom_naming.srt"
}
Setting a Language
Speech-to-Text processing with Qencode automatically tries to recognize the spoken language in your media files, however there may be instances where more control over selected language is required. In case you want to explicitly define the target langauage for Speech-to-Text processing, you can do so using the following optional language parameter.
Language | Code | WER | CER |
English | en | 15.2% | 10.3% |
Afrikaans | af | 36.3% | 11.1% |
Albanian | sq | 36.3% | 11.1% |
Amharic | am | 100% | 89.3% |
Arabic | ar | 61.1% | 39.3% |
Armenian | hy | 64.1% | 30.6% |
Bashkir | ba | 96.7% | 38.7% |
Basque | ba | 50.6% | 15.7% |
Belarusian | be | 48.6% | 14.6% |
Bengali | bn | 100% | 96.3% |
Breton | bg | 97.7% | 88.6% |
Bulgarian | bg | 22.0% | 6.9% |
Catalan | ca | 16.0% | 8.7% |
Chinese | zh | 58.3% | 32.2% |
Czech | cs | 32.1% | 30.5% |
Danish | da | 33.6% | 25.4% |
Dutch | nl | 6.9% | 4.8% |
Estonian | et | 46.6% | 13.7% |
Finnish | fi | 15.4% | 3.1% |
French | fr | 17.4% | 10.7% |
Galician | gl | 34.2% | 12.4% |
German | de | 10.7% | 5.6% |
Greek | el | 20.5% | 11.0% |
Hebrew | he | 16.2% | 5.1% |
Hindi | hi | 62.9% | 64.1% |
Hungarian | hu | 31.6% | 14.5% |
Icelandic | is | 63.0% | 30.6% |
Indonesian | id | 14.0% | 9.1% |
Italian | it | 13.2% | 6.2% |
Japanese | ja | 77.5% | 35.5% |
Kazakh | kk | 49.6% | 12.2% |
Korean | ko | 88.4% | 100.0% |
Latvian | lv | 23.6% | 7.0% |
Lithuanian | lt | 53.5% | 34.5% |
Macedonian | mk | 42.7% | 14.3% |
Marathi | mr | 66.0% | 19.0% |
Nepali | ne-NP | 72.4% | 45.1% |
Norwegian | nn-NO | 40.8% | 15.2% |
Persian | fa | 68.8% | 70.5% |
Polish | pl | 12.4% | 7.0% |
Portuguese | pt | 23.6% | 11.4% |
Romanian | ro | 18.7% | 8.8% |
Russian | ru | 8.1% | 3.3% |
Serbian | sr | 90.0% | 74.0% |
Slovak | sk | 47.3% | 47.2% |
Slovenian | sl | 18.0% | 6.1% |
Spanish | es | 55.7% | 53.8% |
Swahili | sw | 88.4% | 51.0% |
Swedish | sv-SE | 25.7% | 18.8% |
Tamil | ta | 54.5% | 31.1% |
Thai | th | 82.0% | 26.2% |
Turkish | tr | 17.0% | 7.1% |
Ukrainian | uk | 13.5% | 4.3% |
Urdu | ur | 27.0% | 10.3% |
Vietnamese | vi | 15.5% | 7.6% |
Welsh | cy | 41.1% | 26.5% |
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
},
"language": "uk"
}
Targeting Specific Segments
In instances where your video or audio file is lengthy and you're interested in only processing specific segments, Qencode allows you to precisely target these segments for for speech-to-text conversion. To specify the segment you wish to target, include the start_time and duration attributes in your transcoding job request. The start_time attribute determines the starting point of the segment (in seconds) from the beginning of the file, and the duration attribute specifies the length of the segment to process (also in seconds). This can be especially useful for scenarios where you need to focus on highlight key events or reduce processing time and costs.
Example Request
{
"output": "speech_to_text",
"destination": {
"url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
},
"start_time":"60.0",
"duration":"30.0"
}
Improving Accuracy
Aside from changing the mode, there are a few other things to take into consideration when trying to improve the accuracy of Speech-to-Text outputs. Below you can find a list of suggestions for the media you use for Speech-to-Text processing.
- Ensure the audio within your video is clear and free from background noise. This can be wind, traffic, hums, music or any other non-speech noise.
- Compressed audio formats can lose quality. Ensure your source files are exported using high-quality settings.
- Maintain consistent volume levels throughout the audio. Significant fluctuations can affect the transcription's accuracy.
- Avoid conversations with overlapping voices and other forms of overtalk.
- Whenever possible, use media with a single language throughout the recording.