← Back to documentation

Quality Assurance Report

10-gate pipeline results, PHI detection, demographic distributions, and inter-rater reliability.

Data Scientists & ML Engineers

SynthABA Quality Assurance Report

Version: 1.0.0 | Date: April 2026 | Classification: Technical -- For Data Scientists and ML Engineers


Table of Contents

  1. Methodology Overview
  2. Quality Gates
  3. PHI Detection
  4. Schema Validity
  5. Clinical Plausibility
  6. Deduplication
  7. Demographic Distribution
  8. Inter-Rater Reliability
  9. VLayer Compliance Pipeline
  10. Known Limitations
  11. Recommendations for Consumers

1. Methodology Overview

SynthABA enforces a 10-gate quality pipeline. Every batch of synthetic clinical records must pass all 10 gates before it is eligible for release. No partial passes are permitted -- a single gate failure rejects the entire batch.

In addition to automated gates, licensed clinicians (BCBAs, SLPs, OTs, and psychologists) review a stratified sample of each batch using a standardized 5-dimension rubric. Review scores and inter-rater reliability metrics are recorded in the batch metadata and available to consumers.

The pipeline operates on the principle of defense in depth: template-based generation makes certain classes of errors (e.g., PHI leakage) structurally unlikely, but every gate is enforced regardless, providing independent verification at each stage.

Pipeline Architecture

Generation (Claude API + Pydantic templates)
    |
    v
Gate 1: Schema Validation -----> Gate 2: Completeness Check
    |                                  |
    v                                  v
Gate 3: Clinical Consistency     Gate 4: Deduplication
    |                                  |
    v                                  v
Gate 5: PHI Detection            Gate 6: Demographic Balance
    |                                  |
    v                                  v
Gate 7: Vocabulary Compliance    Gate 8: Edge Case Coverage
    |                                  |
    v                                  v
Gate 9: Train/Val/Test Split     Gate 10: Version Control
    |
    v
Clinician Review (5-dimension rubric)
    |
    v
VLayer Passport Issuance --> Release

2. Quality Gates

Every batch is evaluated against these 10 gates sequentially. A gate failure halts the pipeline and triggers either automatic retry (Gates 1-2) or manual review (Gates 3-10).

| Gate | Name | Threshold | Description | |------|------|-----------|-------------| | 1 | Schema Validation | 100% valid | Every record validated against its Pydantic v2 schema. Type constraints, field-level min/max, closed enums, and array size limits are enforced. | | 2 | Completeness | 100% required fields | All required fields present and non-null. Optional fields may be absent but must conform to type when present. | | 3 | Clinical Consistency | 0 violations | Cross-field consistency: severity-hours alignment, CPT-provider match, age-appropriate diagnoses and interventions. | | 4 | Deduplication | < 2% near-duplicates | TF-IDF cosine similarity > 0.95 flags a pair as near-duplicate. Batch must have < 2% duplicate rate. | | 5 | PHI Detection | 0 findings | 163+ regex patterns covering all 18 HIPAA Safe Harbor identifiers. Zero-tolerance: any finding = automatic batch rejection. | | 6 | Demographic Balance | Within tolerance | Sex ratio within tolerance of ASD prevalence (4:1 male:female per CDC MMWR 2023). Age, severity, insurer, and setting distributions audited against targets. | | 7 | Vocabulary Compliance | 100% compliant | All enum fields use closed vocabulary sets: behavior functions, severity levels, CPT codes, diagnosis codes, insurer IDs. | | 8 | Edge Case Coverage | Minimum thresholds | >= 5% severe cases, >= 5% with comorbidities, >= 5% telehealth, >= 5% bilingual (Spanish). | | 9 | Train/Val/Test Split | Correct proportions | 70/15/15 split, stratified by severity, deterministic seed for reproducibility. | | 10 | Version Control | Present on all records | generator_version and template_version tracked in every record. Batch ID and SHA-256 hash recorded. |

Gate Pass/Fail Summary (Reference Batch)

| Gate | Result | Metric | |------|--------|--------| | 1 -- Schema Validation | PASS | 100.0% (500/500 records valid) | | 2 -- Completeness | PASS | 100.0% (0 null required fields) | | 3 -- Clinical Consistency | PASS | 0 violations detected | | 4 -- Deduplication | PASS | 0.4% duplicate rate (threshold: < 2%) | | 5 -- PHI Detection | PASS | 0 findings across 163 patterns | | 6 -- Demographic Balance | PASS | All distributions within tolerance | | 7 -- Vocabulary Compliance | PASS | 100.0% closed vocabulary adherence | | 8 -- Edge Case Coverage | PASS | Severe: 24.8%, Comorbid: 18.2%, Telehealth: 5.4%, Bilingual: 6.0% | | 9 -- Train/Val/Test Split | PASS | 70.0/15.0/15.0 (stratified, seed=42) | | 10 -- Version Control | PASS | generator_version=1.0.0 on all records |


3. PHI Detection

Design Philosophy

SynthABA's generation pipeline never ingests, transforms, or references real patient data. Records are generated from Pydantic templates and an AI language model's general clinical knowledge. PHI detection is therefore a verification layer, not a remediation layer -- it confirms the structural guarantee that PHI cannot appear.

Pattern Coverage

The PHI scanner applies 163+ regex patterns covering all 18 HIPAA Safe Harbor identifier categories:

| # | Identifier Category | Pattern Examples | Status | |---|---------------------|------------------|--------| | 1 | Names | First/last name dictionaries, salutation patterns | Covered | | 2 | Geographic data (sub-state) | Street addresses, ZIP codes, city names | Covered | | 3 | Dates (except year) | MM/DD/YYYY, YYYY-MM-DD, month-day combos, DOB patterns | Covered | | 4 | Telephone numbers | (XXX) XXX-XXXX, XXX-XXX-XXXX, international formats | Covered | | 5 | Fax numbers | Fax-prefixed phone patterns | Covered | | 6 | Email addresses | RFC 5322 compliant patterns | Covered | | 7 | Social Security Numbers | XXX-XX-XXXX, XXXXXXXXX | Covered | | 8 | Medical Record Numbers | MRN-prefixed alphanumeric, hospital ID patterns | Covered | | 9 | Health plan beneficiary numbers | Policy/member ID patterns | Covered | | 10 | Account numbers | Numeric sequences with account prefixes | Covered | | 11 | Certificate/license numbers | State license, DEA, NPI patterns | Covered | | 12 | Vehicle identifiers | VIN patterns | Covered | | 13 | Device identifiers | UDI/serial number patterns | Covered | | 14 | Web URLs | HTTP/HTTPS URL patterns | Covered | | 15 | IP addresses | IPv4, IPv6 patterns | Covered | | 16 | Biometric identifiers | Fingerprint, retinal scan references | Covered | | 17 | Photographs | Image file reference patterns | Covered | | 18 | Other unique identifiers | Catchall patterns for unclassified PII | Covered |

Policy

  • Zero-tolerance: Any PHI finding triggers automatic batch rejection
  • No manual override: Rejected batches must be regenerated, not patched
  • Audit trail: Every scan result (including clean passes) is logged with timestamp and pattern version

4. Schema Validity

Target: 100%

Every record is validated against its document-type-specific Pydantic v2 schema before inclusion in the batch. SynthABA supports 25 document types across 4 disciplines:

| Discipline | Document Types | Count | |---|---|---| | ABA | SOAP Notes, ABC Data, Treatment Goals, Insurance Auth, Crisis Plans, Supervision Notes, Discharge Summaries, Progress Reports, Session Data, Assessment Sections | 10 | | Psychotherapy | Psychotherapy Notes, Psychiatric Evals, Mental Status Exams, Safety Plans, Group Therapy Notes, Family Therapy Notes, Psychological Testing, Substance Abuse Assessments | 8 | | Speech-Language Pathology | Evaluations, Session Notes, Progress Reports, Treatment Plans | 4 | | Occupational Therapy | Evaluations, Session Notes, Treatment Plans | 3 |

Constraints Enforced

| Constraint Type | Examples | |---|---| | Type enforcement | string, integer, float, boolean, array[T], enum | | String bounds | min_length / max_length on narrative fields (e.g., SOAP subjective: 50-2000 chars) | | Numeric bounds | ge / le on quantitative fields (e.g., frequency: 0-500, percentage: 0-100) | | Closed enums | severity, sex, age_band, insurer, setting, diagnosis_codes, cpt_code | | Array size limits | min_items / max_items on list fields (e.g., behaviors_observed: 1-10 items) | | Required vs. optional | Required fields enforced as non-null; optional fields validated when present |

Retry Strategy

Records that fail schema validation are retried once at a lower generation temperature (0.3 vs. default 0.7). If the retry also fails, the record is dropped and a replacement is generated. The batch metadata records the number of retries and drops.


5. Clinical Plausibility

Clinical plausibility checks ensure that generated records reflect realistic clinical practice. These are cross-field consistency rules that go beyond schema validation.

5.1 Severity-Hours Consistency

Treatment hours must align with clinical severity. The following ranges are enforced:

| Severity | Authorized Hours/Week | Rationale | |---|---|---| | Mild (Level 1) | 5--20 hours | Lower intensity; focus on skill acquisition and parent training | | Moderate (Level 2) | 15--30 hours | Moderate intensity; direct therapy with behavior reduction | | Severe (Level 3) | 25--40 hours | High intensity; comprehensive ABA with crisis protocols |

Overlapping ranges reflect real-world clinical variability. A record with severity: mild and authorized_hours_weekly: 35 would be flagged and rejected.

5.2 CPT-Provider Alignment

Billing codes must correspond to the correct provider credential:

| CPT Code | Expected Provider | Service Description | |---|---|---| | 97151 | BCBA | Behavior identification assessment | | 97152 | BCBA / RBT (under supervision) | Behavior identification supporting assessment | | 97153 | RBT | Adaptive behavior treatment by protocol | | 97154 | RBT (group) | Group adaptive behavior treatment | | 97155 | BCBA | Adaptive behavior treatment with modification | | 97156 | BCBA | Family adaptive behavior treatment guidance | | 97157 | BCBA (group) | Multiple family group treatment guidance | | 97158 | BCBA | Group behavior treatment with modification |

A record billing 97153 with provider_role: BCBA (instead of RBT) would be flagged as inconsistent.

5.3 Age-Appropriate Diagnoses

Diagnosis codes are validated against the patient's age band:

  • F84.0 (Autistic disorder): Valid for all age bands
  • F84.5 (Asperger syndrome): Primarily used for age bands 6-11 and older (historical diagnosis, less common in younger children)
  • F70/F71 (Intellectual disability): Valid for all age bands, but F71 (moderate) constrained to align with severity
  • F90.0/F90.2 (ADHD): Valid as comorbidity for age bands 4-5 and older

5.4 Quantitative Data Ranges

| Metric | Valid Range | Clinical Context | |---|---|---| | Behavior frequency | 0--500 per session | High frequencies possible for stereotypy; low for aggression | | Percentage correct (trials) | 0--100% | Mastery typically >= 80% | | Duration (minutes) | 0 -- session length | Cannot exceed session duration | | Interval recording (%) | 0--100% | Partial/whole interval or momentary time sampling | | Rate (per minute) | 0.0--50.0 | High rates for rapid behaviors like hand-flapping |


6. Deduplication

Method

  1. Text extraction: All free-text fields (narratives, observations, recommendations) are concatenated per record
  2. TF-IDF vectorization: Scikit-learn TfidfVectorizer with English stop words removed, max 10,000 features
  3. Cosine similarity matrix: Pairwise cosine similarity computed across all records in the batch
  4. Flagging: Pairs with cosine similarity > 0.95 are flagged as near-duplicates
  5. Deduplication: The later-generated record in each flagged pair is removed

Targets

| Metric | Threshold | Typical Result | |---|---|---| | Near-duplicate rate | < 2.0% | 0.2% -- 0.8% | | Similarity threshold | > 0.95 | Fixed | | Feature dimensionality | <= 10,000 | Fixed |

Why Low Duplicate Rates

SynthABA achieves low duplicate rates through:

  • Seeded random parameter variation: Each record uses a unique combination of age band, severity, insurer, setting, and clinical parameters
  • Temperature-controlled generation: Default temperature of 0.7 provides lexical diversity while maintaining clinical accuracy
  • Template variation: Multiple prompt templates per document type, randomly selected

7. Demographic Distribution

Demographic balance ensures that models trained on SynthABA data do not inherit systematic bias from skewed distributions.

7.1 Age Distribution

| Age Band | Target | Tolerance | Rationale | |---|---|---|---| | 2--3 | 15% | +/- 3% | Early intervention population | | 4--5 | 25% | +/- 3% | Peak diagnosis age (CDC, 2023) | | 6--11 | 35% | +/- 3% | Largest treatment population | | 12--17 | 20% | +/- 3% | Adolescent ABA services | | 18+ | 5% | +/- 2% | Adult services (emerging) |

7.2 Sex Distribution

| Sex | Target | Tolerance | Rationale | |---|---|---|---| | Male | 72% | +/- 5% | ASD prevalence ratio ~4:1 male:female (CDC MMWR, 2023) | | Female | 28% | +/- 5% | Historically underdiagnosed; 28% reflects current prevalence data |

7.3 Severity Distribution

| Severity | Target | Tolerance | Rationale | |---|---|---|---| | Mild (Level 1) | 30% | +/- 5% | Largest clinical population by volume | | Moderate (Level 2) | 45% | +/- 5% | Most common treatment intensity | | Severe (Level 3) | 25% | +/- 5% | Higher support needs; critical for model generalization |

7.4 Insurer Distribution

| Insurer | Target | Rationale | |---|---|---| | UnitedHealthcare (UHC) | ~17% | Major national payer | | Blue Cross Blue Shield (BCBS) | ~17% | Major national payer | | Aetna | ~17% | Major national payer | | Cigna | ~17% | Major national payer | | TRICARE | ~16% | Military/veteran families | | Medicaid FL | ~16% | State Medicaid program |

Insurer distribution is approximately uniform to avoid payer-specific bias. Each insurer has unique authorization and documentation requirements that are reflected in the generated records.

7.5 Service Setting Distribution

| Setting | Target | Tolerance | Rationale | |---|---|---|---| | Clinic | 50% | +/- 5% | Most common ABA service setting | | Home | 30% | +/- 5% | Second most common; different documentation patterns | | School | 15% | +/- 3% | School-based ABA services | | Telehealth | 5% | +/- 2% | Growing modality post-COVID; distinct documentation |


8. Inter-Rater Reliability

Review Protocol

Licensed clinicians evaluate a stratified sample (minimum 10% of batch) across 5 dimensions using a standardized rubric.

5-Dimension Review Rubric

| Dimension | Description | Scale | |---|---|---| | Terminology Accuracy | Clinical terms used correctly and consistently (e.g., "extinction burst" vs. "tantrum increase") | 1--5 | | Clinical Plausibility | Case presentation is realistic; symptoms, behaviors, and interventions align logically | 1--5 | | Assessment Logic | Functional behavior assessments, diagnostic reasoning, and data interpretation are sound | 1--5 | | Plan Appropriateness | Treatment plans, goals, and recommended interventions are clinically appropriate for the case | 1--5 | | Documentation Standards | Formatting, structure, and completeness meet industry documentation standards | 1--5 |

Agreement Metrics

| Metric | Method | Target | Interpretation (Landis & Koch, 1977) | |---|---|---|---| | Cohen's Kappa | 2 reviewers, per dimension | >= 0.61 | Substantial agreement | | Fleiss' Kappa | 3+ reviewers, per dimension | >= 0.61 | Substantial agreement | | Weighted Kappa | Ordinal scores (1--5 scale) | >= 0.61 | Accounts for degree of disagreement |

Interpretation Scale

| Kappa Range | Agreement Level | |---|---| | 0.81 -- 1.00 | Almost perfect | | 0.61 -- 0.80 | Substantial | | 0.41 -- 0.60 | Moderate | | 0.21 -- 0.40 | Fair | | 0.00 -- 0.20 | Slight | | < 0.00 | Poor (less than chance) |

Review Workflow

  1. Sampling: Records stratified by severity (mild/moderate/severe) to ensure coverage
  2. Blinding: Reviewers do not see each other's scores during evaluation
  3. Scoring: Each reviewer scores all 5 dimensions independently (1--5 scale)
  4. Reconciliation: Dimension scores with Kappa < 0.61 trigger a reconciliation round
  5. Reporting: Final Kappa scores, mean dimension scores, and reconciliation notes are included in batch metadata

9. VLayer Compliance Pipeline

VLayer provides an immutable audit trail for every released batch. The compliance pipeline has 4 stages:

Pipeline Stages

Stage 1: Ingest         Stage 2: PHI Scan       Stage 3: Synthetic       Stage 4: Release
                                                  Validation
+-----------+          +-----------+          +-----------+          +-----------+
| Receive   |  ---->   | Run 163+  |  ---->   | Verify    |  ---->   | Issue     |
| batch     |          | PHI regex |          | synthetic |          | VLayer    |
| from      |          | patterns  |          | flag +    |          | passport  |
| generator |          |           |          | schema    |          |           |
+-----------+          +-----------+          +-----------+          +-----------+

VLayer Passport Schema

Every released batch receives a VLayer passport containing:

| Field | Type | Description | |---|---|---| | asset_id | string (UUID) | Unique identifier for the dataset asset | | passport_id | string (UUID) | Unique identifier for this passport issuance | | phi_findings | integer | Number of PHI detections (must be 0 for release) | | synthetic_verified | boolean | Whether all records pass synthetic verification (must be true) | | schema_valid | boolean | Whether all records pass schema validation (must be true) | | batch_sha256 | string | SHA-256 hash of the complete batch contents for tamper detection | | issued_at | string (ISO 8601) | Timestamp of passport issuance | | generator_version | string | Pipeline version that produced the batch | | gate_results | object | Pass/fail status for all 10 quality gates |

Tamper Detection

The batch_sha256 hash covers the concatenation of all data files (train.json, validation.json, test.json), the batch metadata, and the quality report. Any modification to released data will produce a different hash, enabling consumers to verify batch integrity independently:

import hashlib
import json

def verify_batch(batch_dir: str, expected_sha256: str) -> bool:
    """Verify batch integrity against VLayer passport hash."""
    hasher = hashlib.sha256()
    for filename in sorted(["train.json", "validation.json", "test.json",
                            "batch_metadata.json", "quality_report.json"]):
        with open(f"{batch_dir}/data/{filename}", "rb") as f:
            hasher.update(f.read())
    return hasher.hexdigest() == expected_sha256

10. Known Limitations

Transparency about dataset limitations is essential for responsible use. The following limitations are documented to help consumers make informed decisions.

| Limitation | Impact | Mitigation | |---|---|---| | Adult ASD (18+) | 5% representation; less clinical depth than pediatric cases | Weight sampling or oversample if adult cases are your primary use case | | Rare comorbidities | Rett syndrome, Fragile X, Angelman syndrome not explicitly modeled | Supplement with domain-specific data if rare comorbidity recognition is required | | Bilingual (Spanish) | Available but model has stronger English performance; clinical terminology may be less varied | Review Spanish samples with native-speaking clinician before use in production | | TRICARE documentation | TRICARE-specific authorization and documentation requirements are underrepresented | Cross-reference with TRICARE ABA policy manuals for compliance-critical applications | | Newer discipline templates | Psychotherapy, SLP, and OT templates have fewer clinician review hours than ABA templates | Check clinician_review_hours in batch metadata; ABA templates have ~3x more review time | | Temporal patterns | No longitudinal patient trajectories; each record is independent | Not suitable for time-series or treatment outcome prediction without additional engineering | | Geographic variation | Insurer requirements based primarily on Florida Medicaid and national payers | State-specific Medicaid rules outside Florida may not be accurately represented |


11. Recommendations for Consumers

Data Selection

  • Use provided splits: Train/val/test files are pre-split (70/15/15), stratified by severity, and deterministically seeded. Respect these splits to ensure reproducible benchmarks.
  • Filter by case_type: For discipline-specific model training, filter to the relevant document types rather than training on the full 25-type corpus.
  • Check quality_score: Batch metadata includes a quality_score field (0.0--1.0). We recommend using batches with quality_score > 0.85 for production model training.

Demographic Auditing

  • Review the demographic_audit section in batch metadata before training. Verify that the distribution matches your use case.
  • If your application disproportionately serves a specific subpopulation (e.g., female patients, severe cases), consider rebalancing the training set accordingly.

Schema Validation

  • Independently validate records against the JSON Schema files included in each batch (documentation/schemas/). Do not rely solely on SynthABA's validation -- verify with your own tooling.

Version Tracking

  • Pin your training data to a specific generator_version. Different versions may produce records with different field distributions or narrative styles.
  • Record the batch_sha256 from the VLayer passport in your experiment tracking system for full reproducibility.

Combining with Real Data

  • SynthABA data is designed to bootstrap models, not replace real clinical data entirely. For production deployment, fine-tune on validated real-world data when available.
  • Use SynthABA for pre-training or data augmentation, then evaluate on held-out real data to measure transfer performance.

References

  • CDC MMWR (2023). Prevalence and Characteristics of Autism Spectrum Disorder Among Children Aged 8 Years. Morbidity and Mortality Weekly Report, 72(2).
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159--174.
  • HIPAA Privacy Rule, 45 CFR Section 164.514(b)(2). Safe Harbor method for de-identification.
  • SynthABA HIPAA Position Paper, v1.0 (April 2026).
  • SynthABA Data Dictionary, v1.0.0.
  • SynthABA Technical Integration Guide, v1.0.0.

This report is generated as part of the SynthABA quality assurance process and accompanies every released dataset batch. For questions, contact the SynthABA engineering team.