Data Science Fellows Projects 2019

Transcriber and Task Behavior – developing a model for reducing the cost of transcription E2E.

Project by Nitzan

Abstract

The project’s goal was to reduce the transcription cost by understanding what makes transcribers more efficient and more accurate than others. The project involved mining data from the company’s database, data cleaning and processing, feature engineering of transcriber related features and transcription job related features, and lastly – modeling of the problem and analysis of the results. The conclusions will be used to support both business and technological related decision making in the company. The model’s quality was measured on a subset of the dataset, therefore more testing and evaluation is required.

 

Challenges

  • How the company works – Technologies used, data types, transcription job life cycle
  • The problem – What exactly are we trying to achieve, how to measure a good transcriptions, how to measure efficiency, how does all the data supports answering the questions
  • Infrastructure – There was no data science team at the company and therefore function and tools were created during the project
  • Data collection – Understanding the database structure, what it contains, how to get it efficiently
  • Feature engineering – creating meaningful features to better understand the reason contributing to quality and efficiency
  • Data quantity – The goal was to get insights from transcriptions of a specific customer, therefore the dataset was rather small

 

Achievements (according to KPIs)

  • Editor efficiency prediction model
  • Metrics:
    • Average accuracy: 0.86
    • Average F1 score: 0.86
  • A full report was written describing the project, the process of development, the features created, the model’s quality and limitations and final conclusions
  • A list of features affecting editors efficiency

 

Further development

In order to verify the results, further investigation is required. The next steps are offered:

  1. Getting more data from different customers to create a more reliable model – the current model was trained on transcriptions of only one customer.
  2. Experimenting more with the features that were created such as tf-idf vectorization to fully understand them and their potential. In case of needed, creating more features.
  3. Testing the findings on the editors in a controlled experiment.

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email