
SOAP Notes Training Data at Scale for Clinical Documentation
A global communications technology provider partnered with Defined.ai to transform thousands of hours of doctor-patient conversations into high-quality transcripts and structured SOAP notes.
TL;DR
- Delivered a total of 6,000 hours of English (United States) clinical audio plus aligned transcripts as foundation healthcare data for model training and evaluation.
- Reduced WER from 9% to 5% through annotations from hundreds of linguists, medical experts and healthcare professionals.
- Produced 14K structured SOAP notes (Subjective, Objective, Assessment, Plan) from clinical conversations using a stringent template and expert review.
- Added structured coding as ICD-CPT outputs plus insurance tags, with specialist claim processing review for clinical accuracy and downstream usability.
- Implemented a scalable hybrid workflow using ASR plus multi-layer human corrections and sampling-based QA to drive WER improvements across 14,115 doctor visits/audio files.
- Ensured the dataset remained anonymized (no PII included) to support responsible use of medical data in model development.
Our customer: healthcare data for enterprise clinical workflows
Our customer is a large, enterprise-scale communications technology organization building AI-driven clinical documentation capabilities. They depend on high-quality, diverse healthcare data like conversational audio, accurate transcripts and structured clinical summaries.
They specifically needed large volumes of real-life, phone-recorded doctor-patient conversations to support transcription and downstream summarization into SOAP notes, using US English conventions and accurate medical industry language.
The challenge: high-accuracy outputs for medical data and documentation
Clinical audio is high-stakes: small transcription errors can cascade into incorrect patient care summaries, weak training signals and low-quality outputs. The customer set an ambitious quality objective centered on WER for the 2025 expansion dataset, with a target of ≤ 5%. Starting WER before fine-tuning and correction was approximately 9%, making the final target especially demanding at this scale.
In addition, transcripts weren't the only output. The project required structured clinical outputs that are difficult to generate consistently:
- Structured SOAP notes: Beyond basic summaries and treatment plans, the customer supplied a detailed template with required fields and “key information” expectations, increasing the need for domain-aware interpretation.
- Coding outputs: The customer requested ICD-CPT codes and insurance tags aligned to clinical intent and patient context—not solely reimbursement framing—so specialist review and corrections were essential.
- Privacy constraints: The input medical data, like Electronic Health Records (EHRs), needed to remain anonymous with no PII present in the final product.
Our capabilities: medical specialists for SOAP note and ICD-CPT quality
Defined.ai combined scalable operations with human expertise to deliver enterprise-grade clinical-language datasets, including complex structured outputs.
Key capabilities applied in this engagement included:
- Hybrid production workflow. A model-first approach using ASR for initial transcription, followed by expert human corrections and iterative QA to systematically reduce WER.
- Specialist-led annotation. Dedicated medical specialists focused on the structured components—especially SOAP notes and ICD-CPT coding—while trained linguists handled transcription accuracy at scale. In total, hundreds of linguists and medical specialists contributed to the project.
- Domain-aware structuring. The team accounted for variation across clinical domains (including psychiatry/psychology-focused conversations as well as more general medical categories) to ensure consistent note structure and content coverage.
- Privacy-by-design handling. We received the dataset anonymized, and our production guidance reinforced that no patient information should appear in transcripts or structured outputs to support responsible use of medical data.
The solution: scaling doctor-patient conversations into SOAP notes with ICD-CPT codes and insurance tags
Defined.ai delivered a staged program over two years that grew in volume and complexity while maintaining a consistent focus on quality and usability.
1) Dataset expansion: from 2,000 to 6,000 hours
- Year 1. Delivered 2,000 hours of phone-recorded doctor-patient conversations, enabling transcription and subsequent creation of note-style summaries.
- Year 2. Added 4,000 hours and extended the scope to include structured SOAP notes and coding deliverables, bringing the total program volume to 6,000 hours of audio with aligned transcript outputs across 14,115 doctor-patient visits/audio files.
2) Production workflow: model-first, human-refined, QA-validated
To hit strict quality thresholds, the delivery pipeline used multiple layers of review:
- Initial transcription produced using ASR output to accelerate throughput and standardize first-pass formatting.
- Specialist linguists performed detailed corrections on transcripts, followed by additional passes for structured tagging and consistency.
- Sampling-based QA measured accuracy trends and triggered further correction rounds as needed to improve WER and strengthen structured-output consistency.
3) Structured outputs tailored to real clinical use
Beyond “clean transcripts”, the dataset included structured artifacts designed to be directly useful in clinical automation:
- SOAP notes: Produced as structured summaries aligned to the customer’s detailed template requirements (not simply generic notes).
- ICD-CPT coding and insurance tags: Added coding labels commonly used by insurance companies for reimbursement and clinical categorization, with expert review to align tags to patient-centric clinical meaning.
- Anonymization assurance: Our workflow maintained anonymized inputs throughout the production lifecycle.
The results: WER-driven quality gains for SOAP notes and large-scale healthcare data
The customer received a large, structured medical care dataset designed to increase transcription accuracy, support reliable summarization and accelerate healthcare system AI initiatives for improved patient quality of life.
Measured and operational outcomes included:
- Scale delivered. 6,000 hours of phone-recorded doctor-patient conversations with aligned transcripts in US English.
- Structured deliverables. 14K corresponding structured SOAP notes plus ICD-CPT coding outputs (including insurance tags) suitable for downstream clinical documentation and analytics workflows.
- Expert workforce. Hundreds of linguists and medical specialists contributed to transcription correction, structured note generation and coding QA.
- Privacy maintained. The delivered medical data remained 100% anonymized, supporting responsible use in development and evaluation.
- Reduced WER. We reduced the baseline WER of 9% to 5%, achieving the target of ≤ 5%.