Re-visiting Captions and Automatic Speech Recognition results.

YouTube video in German with automatic speech recognition captions translated into English

Around this time last year (November 2021) we discussed the challenges that are linked to the evaluation of captions generated by the use of Automatic Speech Recognition (ASR) in two blogs ‘Collaboration and Captioning’ and ‘Transcripts from Captions?’

We mentioned other methods in comparison to word error rates where any mistakes in the caption are caught and provided as a percentage. This does not take into account how much is actually understandable or whether the output would make sense if it was translated or even used in a transcript.

“Word Error Rate (WER) is a common metric for measuring speech-to-text accuracy of automatic speech recognition (ASR) systems. Microsoft claims to have a word error rate of 5.1%. Google boasts a WER of 4.9%. For comparison, human transcriptionists average a word error rate of 4%. “

Does Word Error Rate Matter? HELENA CHEN Jan 2021

Helena Chen in her blog goes on to explain how to calculate WER

Word Error Rate = (Substitutions + Insertions + Deletions) / Number of Words Spoken

Where errors are:

  • Substitution: when a word is replaced (for example, “shipping” is transcribed as “sipping”)
  • Insertion: when a word is added that wasn’t said (for example, “hostess” is transcribed as “host is”)
  • Deletion: when a word is omitted from the transcript (for example, “get it done” is transcribed as “get done”)”

She describes how acoustics and background noises affect the results, although we may be able to cope with a certain amount of distraction and understand what is being said. She highlights issues with homophones and accents that may fail with ASR, as well as cross talk where ASR can omit one speaker’s comments, that could affect the results. Finally, there is the complexity of the content and as we know with STEM subjects and medicine this may mean specialist glossaries are needed.

Improvements have been made by the use of natural language understanding with the use of Artificial Intelligence (AI). However, it seems that as well as the acoustic checks, we need to delve into the criteria needed for improved comprehension and this may be to do with the consistency of errors that can be omitted in some situations. For an example a transcript of a science lecture may not need all the lecturer’s ‘ums’ and ‘aahs’, unless there is the need to add an emotional feel to the output. These would be counted as word errors, but actually do not necessarily help understanding.

There may also be the need to flag up the amount of effort needed to understand the captions. This involves the need to check output for automatic translations, as well as general language support.

Image by Katrin B. from Pixabay

Roof us, the dog caught the bull and bought it back to his Mrs

Dak ons, de hond ving de stier en kocht hem terug aan zijn Mrs