A team of academic medical researchers needed to extract key medical information from a variety of diagnostic medical reports across hospitals. Some were scanned images while others were XML, PDF and Microsoft Word format. With the volume and variety of data involved, process workflow automation and quality of information captured were key considerations. Furthermore, information needed to be shared with physicians in a fully identified form, and with researchers in a de-identified view.

  • Extract keywords and medical concepts from diagnostic reports
  • Build a custom application with workflow to analyze big data
  • Automatically de-identify and share data based on HIPAA and IRB guidelines

To find out more, please contact us


PHEMI Central machine learning tools were used to extract keywords, SNOMED, LOINC, RxNorm and ICD codes, capturing negation, family history, allergies, anatomy and temporal conditions.

A custom application allowed study coordinators to manage consent, data collection and validation workflows, and even track the disease relationship among family members using a pedigree tree.

All Personal Health Information was de-identified, and PHEMI’s privacy rules ensured that sensitive data could only be shared with approved users: named physicians could access fully identified data while researchers could only see de-identified data.

Administrators were able to easily monitor access to patient data and view risk-based de-identification reports.

The system was designed for agile growth in scale and scope including the addition of new data types such as large whole genome/exome files.