On this page

Using Speech-to-Text to create Subtitles and Transcripts

Introduction

Speech-to-Text is an technology uses automatic speech recognition to generate text from audio where speech is recognized. Qencode offers a Speech-to-Text output for transcoding jobs, which uses this technology to generate subtitles and transcripts from any of your audio or video files. This output utilizes the latest advanced artificial intelligence models to identify human speech across a wide range of languages, and create readable text ouputs that can be used to further enrich the content in your library. This output is packaged into transcripts and subtitles, and formatted so that they are compatible with most common by video platforms (JSON, SRT, VTT, and TXT files).

Launching Speech-to-Text Transcoding

Defining the Output Format

Speech-to-Text outputs can be created by adding a 'speech_to_text' output format to your transcoding job. In order to do this, use the /v1/start_encode2 method to launch a transcoding job with the format object set to speech_to_text. Make sure to also specify the destination where the generated Speech-to-Text output should stored.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder"
  }
}

Choosing the Mode

There are three modes available for Speech-to-Text output.

ModeDescription
BalancedBalance between speed and accuracy. This is the default.
SpeedFaster at the expense of being more accurate in some cases.
AccuracyMore accurate in some cases at the expense of speed.

To select the mode for Speech-to-Text processing, add the mode parameter to your request.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "mode": "speed"
}

Speech-to-Text Output Files

You can choose to include the following file types in your Speech-to-Text outputs:

File TypeDescription
TranscriptA text file containing all the speech from the input.
JSONA JSON file containing the text with associated timestamps.
SRTSubtitles in the SRT format.
VTTSubtitles in the VTT format.

Customizing Speech-to-Text Outputs

You can customize the files and names created from Speech-to-Text processing. Below is a list of parameters that can be used to control these settings:

ParameterDescription
transcriptToggle the generation of a transcript file. Set to 1 to enable (default) or 0 to disable.
transcript_nameCustomize the transcript file name, with the default being ‘transcript.txt’.
jsonToggle the generation of a JSON file with timestamps. Set to 1 to enable (default) or 0 to disable.
json_nameCustomize the JSON file name, with the default being ‘timestamps.json’.
srtToggle the generation of subtitles in SRT format. Set to 1 to enable (default) or 0 to disable.
srt_nameCustomize the SRT subtitles file name, with the default being ‘subtitles.srt’.
vttToggle the generation of subtitles in VTT format. Set to 1 to enable (default) or 0 to disable.
vtt_nameCustomize the VTT subtitles file name, with the default being ‘subtitles.vtt’.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "transcript": 1,
  "transcript_name": "your_custom_naming.txt",
  "json": 0,
  "vtt": 0,
  "srt": 1,
  "srt_name": "your_custom_naming.srt"
}

Setting a Language

Speech-to-Text processing with Qencode automatically tries to recognize the spoken language in your media files, however there may be instances where more control over selected language is required. In case you want to explicitly define the target langauage for Speech-to-Text processing, you can do so using the following optional language parameter.

LanguageCodeStatus
EnglishenBeta
SpanishesBeta
AfrikaansafAlpha
ArabicarAlpha
ArmenianhyAlpha
AzerbaijaniazAlpha
BelarusianbeAlpha
BosnianbsAlpha
BulgarianbgAlpha
CatalancaAlpha
ChinesezhAlpha
CroatianhrAlpha
CzechcsAlpha
DanishdaAlpha
DutchnlAlpha
EnglishenAlpha
EstonianetAlpha
FinnishfiAlpha
FrenchfrAlpha
GalicianglAlpha
GermandeAlpha
GreekelAlpha
HebrewheAlpha
HindihiAlpha
HungarianhuAlpha
IcelandicisAlpha
IndonesianidAlpha
ItalianitAlpha
JapanesejaAlpha
KannadaknAlpha
KazakhkkAlpha
KoreankoAlpha
LatvianlvAlpha
LithuanianltAlpha
MacedonianmkAlpha
MalaymsAlpha
MarathimrAlpha
MaorimiAlpha
NepalineAlpha
NorwegiannoAlpha
PersianfaAlpha
PolishplAlpha
PortugueseptAlpha
RomanianroAlpha
RussianruAlpha
SerbiansrAlpha
SlovakskAlpha
SlovenianslAlpha
SpanishesAlpha
SwahiliswAlpha
SwedishsvAlpha
TagalogtlAlpha
TamiltaAlpha
ThaithAlpha
TurkishtrAlpha
UkrainianukAlpha
UrduurAlpha
VietnameseviAlpha
WelshcyAlpha

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "language": "uk"
}

Targeting Specific Segments

In instances where your video or audio file is lengthy and you're interested in only processing specific segments, Qencode allows you to precisely target these segments for for speech-to-text conversion. To specify the segment you wish to target, include the start_time and duration attributes in your transcoding job request. The start_time attribute determines the starting point of the segment (in seconds) from the beginning of the file, and the duration attribute specifies the length of the segment to process (also in seconds). This can be especially useful for scenarios where you need to focus on highlight key events or reduce processing time and costs.

Example Request

{
  "output": "speech_to_text",
  "destination": {
    "url": "s3://us-west.s3.qencode.com/yourbucket/output_folder",
  },
  "start_time":"60.0", 
  "duration":"30.0"
}

Improving Accuracy

Aside from changing the mode, there are a few other things to take into consideration when trying to improve the accuracy of Speech-to-Text outputs. Below you can find a list of suggestions for the media you use for Speech-to-Text processing.

  1. High Audio Quality: Ensure the audio within your video is clear and free from background noise. This can be wind, traffic, hums, music or any other non-speech noise.
  2. Avoid Compression: Compressed audio formats can lose quality. Ensure your source files are exported using high-quality settings.
  3. Normalized Volume: Maintain consistent volume levels throughout the audio. Significant fluctuations can affect the transcription's accuracy.
  4. Minimize Overlapping Speech: Avoid conversations with overlapping voices and other forms of overtalk.
  5. Single Language: Whenever possible, use media with a single language throughout the recording.