Unstructured Document Cluster Naming, BigID

Itamar Zaltsman

Data Science Fellows February 2021 Cohort



Given clusters of unstructured documents from various file types and topics, find for each cluster a representative name: automatically, meaningful and reliable. This is under the constraints of runtime.

Challenges (at least two)

  1. data collection
  2. model evaluation

Achievements (according to KPIs)

  1. POC – The offered solution shows that there is a good feasible solution.
  2. building a model that can be expended to deployment
  3. presenting the project to the company management.

Future project development 

gather more data to stabilize the model performance and prepare it for deployment.

