Transcripts from Captions?

young person looking at computer for online learning

The subject of automatic captioning continues to be debated but Gerald Ford Williams has produced a really helpful “guide to the visual language of closed captions and subtitles” on UX Collective as a “user-centric guide to the editorial conventions of an accessible caption or subtitle experience.” It has a series of tips with examples and several very useful links at the bottom of the page for those adding captions to videos. There is also a standard for the presentation of different types of captions across multimedia ISO/IEC 20071-23:2018(en).

However, in this article transcripts are something that also need further discussion, as they are often used as notes gathered from a presentation, as a result of lecture capture or an online conference with automatic captioning. They may be copied from the side of the presentation, downloaded after the event or presented to the user as a file in PDF/HTML or text format depending on the system used. Some automated outputs provide notification of speaker changes and timings, but there are no hints as to content accuracy prior to download.

The problem is that there also seem to be many different ways to measure the accuracy of automated captioning processes which in many cases become transcriptions. 3PlayMedia suggest that there is a standard, saying “The industry standard for closed caption accuracy is 99% accuracy rate. Accuracy measures punctuation, spelling, and grammar. A 99% accuracy rate means that there is a 1% chance of error or a leniency of 15 errors total per 1,500 words” when discussing caption quality.

The author of the 3PlayMedia article goes on to illustrate many other aspects of ‘quality’ that need to be addressed, but the lack of detailed standards for the range of quality checks means that comparisons between the various offerings are hard to achieve. Users are often left with several other types of errors besides punctuation, spelling and grammar. The Nlive project team have been looking into these challenges when considering transcriptions rather than captions and have begun to collect a set of additional issues likely to affect understanding. So far, the list includes:

  • Number of extra words added that were not spoken
  • Number of words changed affecting meaning – more than just grammar.
  • Number of words omitted
  • Contractions … e.g. he is – he’s, do not … don’t and I’d could have three different meanings I had, I would, or I should!

The question is whether these checks could be included automatically to support collaborative manual checks when correcting transcriptions?

Below is a sample of the text we are working on as a result of an interview to demonstrate the differences between three commonly used automatically generated captioning systems for videos.

Sample 1

Sample 2

Sample 3

So stuck. In my own research, and my own teaching. I’ve been looking at how we can do the poetry’s more effectively is one of the things so that’s more for structuring the trees, not so much technology, although technology is possibleso starting after my own research uh my own teaching i’ve been looking at how we can do laboratories more effectively is one of the things so that’s more for structuring laboratories not so much technology although technology is part of the laboratoryso stop. In my own research on my own teaching, I’ve been looking at how we can do the ball trees more effectively. Is one thing, so that’s more for structuring the voluntary is not so much technology, although technology is part little bar tree

Having looked at the sentences presented in transcript form, Professor Mike Wald pointed out that (who provide automated and human transcription services) state that we should not “try to make captions verbatim, word-for-word versions of the video audio. Video transcriptions should be exact replications, but not captions.” The author of the article “YouTube Automatic Captions vs. Video Captioning Services” highlights several issues with automatic closed captioning and reasons humans offer better outcomes. Just in case you want to learn more about the difference between a transcript and closed cations 3PlayMedia wrote about the topic in August 2021 “Transcription vs. Captioning – What’s the Difference?”.

Collaboration and Captioning


Over the years captioning has become easier for the non-professional with guidance on many platforms including YouTube and a blog about “Free Tools & Subtitle Software to Make Your Video Captioning Process Easier” from Amara. This does not mean that we are good at it, nor does it mean that it does not take time! However, artificial intelligence (AI) and the use of speech recognition can help with the process.

Nevertheless, as Professor Mike Wald said only 11 months ago in an article titled Are universities finally complying with EU directive on accessibility? “Finding an affordable way to provide high quality captions and transcripts for recordings is proving very difficult for universities and using automatic speech recognition with students correcting any errors would appear to be a possible solution.”

The idea that there is the possibility of collaboration to improve the automated output at no cost is appealing and we saw it happening with Synote over ten years ago! AI alongside the use of speech recognition has improved accuracy and Verbit advertise their offerings as being “99%+Accuracy”. but sadly do not provide prices on their website!

Meanwhile Blackboard Collaborate as part of their ‘Ultra experience’ offers attendees the chance to collaborate on captioning when managed by a moderator, although at present Blackboard Collaborate does not include automated live captioning. There are many services that can be added to online video meeting platforms in order to support captioning such as Developers can also make use of options from GoogleMicrosoftIBM, and Amazon. The TechRepublic describe 5 speech recognition apps that auto-capton videos on mobiles. Table 1. shows the options available in three platforms often used in higher education.

Caption OptionsZoomMicrosoft TeamsBlackboard Collaborate
Captions – automatedYes YesHas to be added 
Captions – live manual correctionWhen set upWhen set upWhen set up
Captions – live collaborative correctionsNoNoNo
Captions Text Colour adaptations Size onlySome optionsSet sizes
Caption Window resizingNoSuggested, not implementedSet sizes 
Compliance – WCAG 2.1 AAYesYesYes
Table 1. Please forgive any errors made with the entries – not all versions offer the same options

It is important to ask if automated live captioning is used with collaborative manual intervention, who is checking the errors? Automated captioning is only around 60 – 80% accurate depending on content complexity, quality of the audio and speaker enunciation. Even, 3Playmedia in an article on “The Current State of Automatic Speech Recognition” admits that human intervention is paramount when total accuracy is required.

Recent ‘Guidance for captioning rich media’ for Advance HE, highlights the fact that the Web Content Accessibility Guidelines 2.1 (AA) require “100% accurate captioning as well as audio description.” They acknowledge the cost entailed, but perhaps this can be reduced with the increasing accuracy of automated processes in English and error correction can be completed with expert checks. It also seems to make sense to require those who have the knowledge of a subject to take more care when the initial video is created! This is suggested alongside the AdvanceHE good practice bullet points such as

“…ensure the narrator describes important visual content in rich media. The information will then feature in the captions and reduces the need for additional audio description services, benefiting everyone.”.

Let’s see how far we can go with these ideas – expert correctors, proficient narrators and willing student support!

Authentication Types: what they mean?

iris biometric scanning

You might have wondered what all those authentication types mentioned in our last blog actually meant? Some are well known, but a few are new, so it seemed to make sense to try to give each one a definition or explanation from the many sites that have this information! The result is a random collection of links. They may not be the best available and are certainly not academically based or tried and tested but here goes:

Knowledge: Something a person knows

  • Password – a string of characters that allows access to a computer system or service.
  • PIN – A personal identification number (PIN), or sometimes redundantly a PIN number, is a numeric (sometimes alpha-numeric) passcode used in the process of authenticating a user accessing a system.
  • Knowledge-based challenge questions – Knowledge-based authentication (KBA) is an authentication scheme in which the user is asked to answer at least one “secret” question.
  • Passphrase – A passphrase is a longer string of text that makes up a phrase or sentence.
  • Memorised swiping path – laying your finger on a screen and moving in any direction that covers the memorised characters.

Possession: Something a person has

  • Possession of a device evidenced by one time password (OTP) generated by, or received on a device – “The password or numbers sent to for instance a phone expire quickly and can’t be reused.”
    • Possession of a device evidenced by a signature generated by a device – “hardware or software tokens generate a single-use code to use when accessing a platform.”
    • Card or device evidenced by QR code scanned from an external device – “Quick Response (QR) code used to authenticate online accounts and verify login details via mobile scan or special device.”
    • App or browser with possession evidenced by device binding – “a security chip embedded into a device or private key linking an app to a device, or the registration of the web browser linking a browser to a device”
    • Card evidenced by a card reader – “physical security systems to read a credential that allows access through access control points.”
    • Card with possession evidenced by a dynamic card security code – “Instead of having a static three- or four-digit code on the back or front of the card, dynamic CVV technology creates a new code periodically.”

Inherence: Something about the person e.g. biometrics

  • Fingerprint scanning – “When your finger rests on a surface, the ridges in your fingerprints touch the surface while the hollows between the ridges stand slightly clear of it. In other words, there are varying distances between each part of your finger and the surface below. A capacitive scanner builds up a picture of your fingerprint by measuring these distances.”
  • Voice recognition – “Voice and speech recognition are two separate biometric modalities…By measuring the sounds a user makes while speaking, voice recognition software can measure the unique biological factors that, combined, produce [the] voice.”
  • Hand & face geometry – A biometric that identifies users from the shape of their hands and in the case of Google’s Media Pipe face identification it is complex network of 3D facial keypoints using artificial intellingence etc to analyse the results.
  • Retina & iris scanning – “both ocular-based biometric identification technologies… no person has the same iris or retina pattern”
  • Keystroke dynamics – …”keystroke dynamics don’t require an active input. Instead, keystroke dynamics analyzes the typing patterns of users; this can include typing rhythms, frequent mistakes, which shift keys they use for capitalization and pace.”
  • Angle at which device is held – “the exact angle a user holds the phone as a means of making replay attacks a lot more difficult.”

There has been a debate about which of the above should be considered under the various headings and acceptable as part of a secure multifactor authentication system. If you are interested in these processes and want more information it may be worth reading the Opinion of the European Banking Authority on the elements of strong customer authentication under PSD2. By the way PSD2 means ‘Payment Services Directive 2’ and the UK will be following the directive, but there is an extension for UK e-commerce transactions.

However, in the meantime many organisations other than banking, shopping sites and those that hold personal data have asked users to consider multifactor authentication including the NLive project lead and the University of Southampton that has some helpful instructions.

LexDis introduces its News Blog

Person sitting behind a laptop trying to access a screen button with right hand.

LexDis was set up 14 years ago as a JISC project with learner experiences becoming a series of strategies that demonstrated ways of overcoming accessibility barriers and finding innovations that support digital learning. COVID-19 has meant these types of strategies have become even more important and technology companies have had to provide improved built in options in settings to enhance access to their online offerings.

We have just secured an Innovate UK funded NLive project that is all about evaluating the outcomes of an “automated quality controlled human collaboratively edited closed-caption and live transmission system”. It is a mouth full as well as a challenge! There are added goals to sort out including digital right management issues, improving recording quality for streamed audio and videos, making use of AI and noise cancellation algorithms as well as pesonalising accessibility options. Lots to achieve in a year!

So at the moment we are planning a series of news blogs that will track the outcomes of our endeavours and we will be asking for help along the way!

When evaluating online services for their usability and accessibility it is important to think about how a system will be used. So when we started to think about the elements that might cause barriers we turned to experts in the field last year, then studied the guidelines and articles to build on the knowledge we had gained in the past.

Just last week (August 16th, 2021) a really interesting article by Gareth Ford Williams came to our notice, thanks to Steve Lee. It was all about UX = Accessibility & Accessibility = UX where Gareth talked about evaluations seeming to ‘focus on guidelines rather than user outcomes’. I think that is what we tried to achieve with LexDis, so once again we are on that journey!

Gareth poses the following thought that we are going to hold onto as we explore ways of making it easier for students to access their online learning systems.

“If we step away from the compliance model and think of accessibility being first and foremost about people and the rich diversity we find within any audience, it starts to raise a lot of questions about what ‘good’ actually is.

Gareth goes on to mention “10 Human Intersectional UX Obstacles within any Product or Service’s Design” and presents a series of built in settings and strategies to support user preferences. During the coming year we will explore the challenges for an internet multimedia system and present ideas for overcoming them. Wish us luck!

PDF reader in Microsoft Edge and Immersive Reader goes mobile.

We don’t usually have a collection of stategies but in this case Alistair McNaught has posted an interesting comment on Linkedin that he now uses Edge to read PDFs. From the quote below the browser offers better reading experiences not just with the usual table of contents, page view and text to speech.

Microsoft Edge comes with a built-in PDF reader that lets you open your local pdf files, online pdf files, or pdf files embedded in web pages. You can annotate these files with ink and highlighting. This PDF reader gives users a single application to meet web page and PDF document needs. The Microsoft Edge PDF reader is a secure and reliable application that works across the Windows and macOS desktop platforms. More Microsoft Edge features

Microsoft have also updated their Immersive Reader so that it now works on iOS and Android. The following text has been taken from a post that might be useful ‘What’s New in Microsoft Teams for Education | July 2021’

  • Immersive Reader on iOS and Android. Immersive Reader, which uses proven customization techniques to support reading across ages and abilities, is now available for Teams iOS and Android apps. You can now hear posts and chat messages read aloud using Immersive Reader on the Teams mobile apps.
  • Access files offline on Android. The Teams mobile app on Android now allows you to access files even when you are offline or in bad network conditions. Simply select the files you need access to, and Teams will keep a downloaded version to use in your mobile app. You can find all your files that are available offline in the files section of the app. (This is already available on iOS.)
  • Teams on Android tablets. Now you can access Teams from a dedicated app from Android tablets.
  • Inline message translation in channels for iOS and Android. Inline message translation in channels lets you translate channel posts and replies into your preferred language. To translate a message, press and hold the channel post or reply and then select “Translate”. The post or reply will be translated to your UI language by default. If you want to change the translation language, go to Settings > General > Translation.”

Thank you Alistair for this update on some new strategies.