Voice Cloning to improve Automatic Speech Recognition

laptop and notetaker for transcription https://pixabay.com/photos/write-transcript-laptop-seminar-2654138/

This blog is about the interesting idea that when you have biases that affect the accuracy results of a transcription made with automatic speech recognition (ASR) you might consider using voice cloning to gather different speech types that could improve the large language model that is used for the speech to text output.   Voice cloning uses artificial intelligence (AI) with the aim of creating a similar voice to one provided by a speaker emulating articulation and intonation patterns.  It is not the same as speech synthesis where a user can choose from a limited number of voices to provide text to speech. In theory once you have cloned your voice, it is possible to fine-tune it to work with any language as well as altering pitch and pace for different emotions.  I might sound a bit robotic at times and lacking in emotion when you expect some excited sounding exclamations but at other times it can be alarmingly accurate!

However, as we have discussed in a past blog about evaluating ASR output, when it comes to transcriptions, errors may occur due to the way a person enunciates their words and the quality of the voice.  If English is the language being used various accents, dialects, aging voices both male and female, lack of clarity and speed of speaking can result in an increase in Word Error Rates. Pronunciation may also be affected by imperfect knowledge about how words are said, whether that is due to their complexity or the speaker is working in English that is not their first language.

Could AI help with a series of voices that have been cloned from the very types of speech that cause problems for automated transcriptions?  We have been experimenting and there are several issues to consider.  The first issue arises when you discover the company selling the cloned voices is situated in the United States and when you try your British English with limited training of the model your voice develops an American accent or in the case of another company a uniquely European type of accent that has a hint of German because the company is based in Berlin.  These voices can be tweaked and would probably improve if several hours of training occurred. 

This leads to the second issue that is the amount of dictating that has to occur to improve results. – Tests with 30 minutes improved the quality but that did not solve some problems especially when working with STEM subjects where complex words may have been incorrectly pronounced during the training or did not match that expected in previously undertaken training by the voice cloning company. So unless a word has been pronounced with an American accent the system tends to produce an error or changes the original pronunciation to one accepted in the United States but not the United Kingdom!

Finally, there is always a cost involved when training is needed for longer periods and as a lecture is usually around 45 minutes long it is important for any experimentation to capture a speaker for at least this period in order to cater for the changes that can occur in speech over time.  For example, tiredness or even poor use of recording devices can occur changing the quality of output.  It has also been found that the cloned voice may not be perfectly replicated if you only have short training periods of a few minutes.  But, 100 sentences of general dictation, where all the words can be accurately read, can produce a good result with clear speech for a transcription.  In the case of a basic account with Resemble.ai transcripts have to be divided into 3000 character sections which means you need on average at least 11 sections of cloned voice for a whole lecture. Misspoken words can be changed and using the phonetic alphabet allows for adaptations to particular parts of words, emotion can be added but it would take time and there are several language options for different accents.

Key words can help us remember important points in a transcript. But can AI find them?

With all the discussions around ChatGPT and Automatic Speech Recognition (ASR) it would seem that Large Language Models (LLMs) as part of a collection of Artificial Intelligence (AI) models could provide us with language understanding capability.   Some models appear to be able to answer complex questions based on the amazing amounts of data or information collected from us all.   We can speak or type a question into Skype and Bing using ChatGPT and the system will provide us with an uncannily appropriate answer.   If we want to speed read through an article the process of finding key words that are meant to represent a main point can also be automated as can a summarisation of content as described in the ‘ChatGPT for teachers: summarize a YouTube transcript’    

But can automated processes pick out the words that might help us to remember important points in a transcript, as they are designed to do when manually chosen[1]?

We rarely know where the ASR data comes from and in the case of transcribed academic lectures, the models used tend to be made up of large generic datasets, rather than customised educational based data collections.   So, what if the original information the model is gathering is incorrect and the transcript has errors or what if we do not really know what we are looking for when it comes to the main points or we develop our own set of criteria and the automatic key word process cannot support these ideas.   

Perhaps it is safe to say that keywords tend to be important elements within paragraphs of text or conversations that could give us clues as to the main theme of an article.  In an automatic process they may be missed where there is a two-word synonym such as a ‘bride-to-be’ or ‘future wife’ rather than ‘fiancée’ or a paraphrase or summary that can change the meaning:

Paraphrase: A giraffe can eat up to 75 pounds of Acacia leaves and hay every day.[2]

Original: Giraffes like Acacia leaves and hay and they can consume 75 pounds of food a day.

Keywords are usually names, locations, facts and figures and these can be pronounced in many different ways when spoken by a variety of English speakers[3].  If the system is using a process of randomly learning from its own large language model there maybe few variations in the accents and dialects and perhaps no account of aging voices or cultural settings.  These biases have the potential to add yet more errors, which in turn affect the relevance of chosen key words generated through probability models.    

Teasing out how to improve the output from AI generated key words is not easy due to the many variables involved.   We have already looked at a series of practical metrics in our previous blog and now we have delved into some of the other technological and human aspects that perhaps could help us to understand why automatic key wording is a challenge.    A text version of the mind map below is available.

Figure 1. Mind map of key word issues including human aspects, when working with ASR transcripts. Text version available


[1] https://libguides.reading.ac.uk/reading/notemaking

[2] http://www.ocw.upj.ac.id/files/Slide-LSE-04.pdf

[3] https://aclanthology.org/2022.lrec-1.680.pdf

Immersive Reader working within Virtual Learning Environments

There are many ways that Immersive Reader can be used and LexDis already has stratgies for using this read aloud and text support app on mobile and as a set of immersive reading tools with OneNote on Microsoft 365.

However, recently Ros Walker sent an email to the JISC Assistive technology list about some updates that have occurred. One important point was her note about the app working with virtual learning environments such as Blackboard Ally alternative formats and it is now possible to create in Moodle, an ‘Immersive reader’ option as an alternative format for most files that are added into a Moodle course.

uploaded file with link to Immersive Reader icon
Image thanks to Ros Walker – uploaded file with link to Ally

The student’s view on the Moodle course will allow them to select the A (ally logo) at the end of the title of the file they want as well as being presented with all the accessibility options. The University of Plymouth have provided guidance illustrating how this happens from the staff and student perspective as well as accessibility checks.

Introduction to Ally and Immersive Reader for Moodle

Immersive Reader in Word highlighting part of speech, colour background changes and text style.
Immersive Reader in Word highlighting part of speech, colour background changes and text style.

Ros has also been kind enough to link to her video about Immersive Reader in Word and how she has worked with PDFs to make the outcome a really useful strategy for students looking for different ways to read documents.

“If you haven’t seen the Immersive reader before, it is available in most Microsoft software and opens readings in a new window that is very clean and you can read the text aloud. (The Immersive Reader)”

Thanks to Ros Walker, University of St. Andrews

Using Google Docs and Sheets with JAWS, NVDA or most other screen readers

Google letter G

This strategy is not new but may be useful if you are using a screen reader as there are tricks that may be missed if you are not aware of the changes needed when using Google docs or sheets because it is working in a browser such as Chrome, Edge or Firefox.

The blog about ‘Google Docs and Sheets with a Screen Reader’ comes from The Perkins School for the Blind in USA and Mark Babaita added an easy tip that might also help those testing the accessibility of the content withiin a doc or sheet:

If you hear JAWS move to a heading on the page and read that heading, you know that the virtual cursor is still active. Use Insert + Z to toggle the virtual cursor on and off.

https://www.googlesupportnumber.com/

July 11th, 2018

The authors have added useful links to the Freedom Scientific free webinars and printed resources covering a variety of topics, including for using Google Drive, Docs and Sheets with JAWS and Magic.  

PDF reader in Microsoft Edge and Immersive Reader goes mobile.

We don’t usually have a collection of stategies but in this case Alistair McNaught has posted an interesting comment on Linkedin that he now uses Edge to read PDFs. From the quote below the browser offers better reading experiences not just with the usual table of contents, page view and text to speech.

Microsoft Edge comes with a built-in PDF reader that lets you open your local pdf files, online pdf files, or pdf files embedded in web pages. You can annotate these files with ink and highlighting. This PDF reader gives users a single application to meet web page and PDF document needs. The Microsoft Edge PDF reader is a secure and reliable application that works across the Windows and macOS desktop platforms. More Microsoft Edge features

Microsoft have also updated their Immersive Reader so that it now works on iOS and Android. The following text has been taken from a post that might be useful ‘What’s New in Microsoft Teams for Education | July 2021’

  • Immersive Reader on iOS and Android. Immersive Reader, which uses proven customization techniques to support reading across ages and abilities, is now available for Teams iOS and Android apps. You can now hear posts and chat messages read aloud using Immersive Reader on the Teams mobile apps.
  • Access files offline on Android. The Teams mobile app on Android now allows you to access files even when you are offline or in bad network conditions. Simply select the files you need access to, and Teams will keep a downloaded version to use in your mobile app. You can find all your files that are available offline in the files section of the app. (This is already available on iOS.)
  • Teams on Android tablets. Now you can access Teams from a dedicated app from Android tablets.
  • Inline message translation in channels for iOS and Android. Inline message translation in channels lets you translate channel posts and replies into your preferred language. To translate a message, press and hold the channel post or reply and then select “Translate”. The post or reply will be translated to your UI language by default. If you want to change the translation language, go to Settings > General > Translation.”

Thank you Alistair for this update on some new strategies.

Good use of colours when thinking about Colour Deficiency

Color Oracle menu with view of colour palette
Color Oracle menu for making a choice of colour filter.

 Bernie Jenny from Monash University in Australia has developed Color Oracle as a free colour deficiency simulator for Windows, Mac and Linux. When designing any software, apps or websites it allows you to check the colour choices.

This download works on older operating systems as well as the latest ones using Java, but it is important to follow the developer’s instructions for each operating system. It is very easy to use on a Windows machine where the app sits in the system tray and can be used at any time when testing colour options by selecting an area on the screen.

Another trick when designing web pages or other documents is to view them in grey scale or print them out to test readability.

This strategy comes thanks to Andy Eachus at University of Huddersfield.