Qualitative Transcription Metric, Verbit.AI

Limor Nunu

Data Science Fellows June 2020 Cohort



  • When comparing a given transcription to the “ground truth” of an audio, the simplest way to evaluate the transcription quality is by computing the fraction of words that are different. This is called Word Error Rate, or WER for short. The problem is, of course, that this includes cases which are irrelevant/insignificant for a human reader – not all diffs are created equal. Some matter more than others, and some are totally meaningless. This renders our existing quality metric (WER) very noisy. The project aim is to (start to) build a separate metric (or set of such metrics), which are “human-reader-centric”, i.e. designed specifically for the purpose of providing a quality score which is robust to the aforementioned complications.

Challenges (at least two)

  1. The first challenge was to understand what was considered as irrelevant and finding how common this in the jobs has been done.
  2. After overcoming the first challenge, I needed to think about methods to take care of the irrelevant changes/differences in the jobs.
  3. Implement those methods in python code. 
  4. Thinking about how to improve the methods after receiving first results.

Achievements (according to KPIs)

  1. Defining changes and evaluating their scope. (examples for changes: Capitalization, false starts speaking and inaudible cases) 
  2. Defining methods to deal with the changes (for example: ignoring X words and ignoring X seconds)
  3. Writing functions to implement those methods.
  4. Testing how those methods affect the WER metric.

Future project development 

  • Test my methods on more jobs to compare the results and understand better which method is good for most of the cases. 
  • Combine methods and test the results.
  • In the project I focused on a few changes, so it is necessary to develop more methods to encapsulate on more changes. 


Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email