SynthABA Quality Assurance Report

Version: 1.0.0 | Date: April 2026 | Classification: Technical -- For Data Scientists and ML Engineers

Methodology Overview
Quality Gates
PHI Detection
Schema Validity
Clinical Plausibility
Deduplication
Demographic Distribution
Inter-Rater Reliability
Automated Compliance Pipeline
Known Limitations
Recommendations for Consumers
Simula Pipeline Enhancements

1. Methodology Overview

SynthABA enforces an 8-gate quality pipeline; per-gate results are sealed in each document's provenance passport. Every batch of synthetic clinical records must pass all 8 gates before it is eligible for release. No partial passes are permitted -- a single gate failure rejects the entire batch. Alongside the gates, the quality ledger records two informational entries -- train/val/test split proportions and version-control stamps -- which document batch accounting metadata and are not pass/fail gates (see Section 2).

In addition to automated gates, clinical templates and reference materials are reviewed by Board Certified Behavior Analysts using a standardized 5-dimension rubric. The scope of human validation applied to any given document is declared in its provenance passport (human_validation.scope) -- never implied.

The pipeline operates on the principle of defense in depth: template-based generation makes certain classes of errors (e.g., PHI leakage) structurally unlikely, but every gate is enforced regardless, providing independent verification at each stage.

Pipeline Architecture

Generation (Claude API + Pydantic templates)
    |
    v
Gate 1: Schema Validation -----> Gate 2: Completeness Check
    |                                  |
    v                                  v
Gate 3: Clinical Consistency     Gate 4: Deduplication
    |                                  |
    v                                  v
Gate 5: PHI Detection            Gate 6: Demographic Balance
    |                                  |
    v                                  v
Gate 7: Vocabulary Compliance    Gate 8: Edge Case Coverage
    |
    v
Ledger entries (informational): Train/Val/Test Split, Version Control
    |
    v
Clinician Review (templates & gold-set; scope declared in passport)
    |
    v
Provenance Passport Sealing + Batch Manifest --> Release

2. Quality Gates

Every batch is evaluated against these 8 gates sequentially. A gate failure halts the pipeline and triggers either automatic retry (Gates 1-2) or manual review (Gates 3-8).

| Gate | Name | Threshold | Description | |------|------|-----------|-------------| | 1 | Schema Validation | 100% valid | Every record validated against its Pydantic v2 schema. Type constraints, field-level min/max, closed enums, and array size limits are enforced. | | 2 | Completeness | 100% required fields | All required fields present and non-null. Optional fields may be absent but must conform to type when present. | | 3 | Clinical Consistency | 0 violations | Cross-field consistency: severity-hours alignment, CPT-provider match, age-appropriate diagnoses and interventions. | | 4 | Deduplication | < 2% near-duplicates | TF-IDF cosine similarity > 0.95 flags a pair as near-duplicate. Batch must have < 2% duplicate rate. | | 5 | PHI Detection | 0 findings | 9-pattern pre-screen + validity-aware content verifier (phi-content-1.0.0); coverage per Safe Harbor category detailed in Section 3. Zero-tolerance: any critical finding = automatic batch rejection. | | 6 | Demographic Balance | Within tolerance | Sex ratio within tolerance of ASD prevalence (4:1 male:female per CDC MMWR 2023). Age, severity, insurer, and setting distributions audited against targets. | | 7 | Vocabulary Compliance | 100% compliant | All enum fields use closed vocabulary sets: behavior functions, severity levels, CPT codes, diagnosis codes, insurer IDs. | | 8 | Edge Case Coverage | Minimum thresholds | >= 5% severe cases, >= 5% with comorbidities, >= 5% telehealth, >= 5% bilingual (Spanish). |

Informational Ledger Entries

Two additional entries are recorded in the quality ledger alongside the gates. They document batch accounting metadata -- they are informational, not pass/fail gates:

| Entry | Name | Recorded | |-------|------|----------| | A | Train/Val/Test Split | 70/15/15 split, stratified by severity, deterministic seed for reproducibility. | | B | Version Control | generator_version and template_version tracked in every record. Batch ID and SHA-256 hash recorded. |

Gate Pass/Fail Summary (Reference Batch)

| Gate | Result | Metric | |------|--------|--------| | 1 -- Schema Validation | PASS | 100.0% (500/500 records valid) | | 2 -- Completeness | PASS | 100.0% (0 null required fields) | | 3 -- Clinical Consistency | PASS | 0 violations detected | | 4 -- Deduplication | PASS | 0.4% duplicate rate (threshold: < 2%) | | 5 -- PHI Detection | PASS | 0 findings (9-pattern pre-screen + phi-content-1.0.0 content verifier) | | 6 -- Demographic Balance | PASS | All distributions within tolerance | | 7 -- Vocabulary Compliance | PASS | 100.0% closed vocabulary adherence | | 8 -- Edge Case Coverage | PASS | Severe: 24.8%, Comorbid: 18.2%, Telehealth: 5.4%, Bilingual: 6.0% | | A -- Train/Val/Test Split | RECORDED (informational) | 70.0/15.0/15.0 (stratified, seed=42) | | B -- Version Control | RECORDED (informational) | generator_version=1.0.0 on all records |

3. PHI Detection

Design Philosophy

SynthABA's generation pipeline never ingests, transforms, or references real patient data. Records are generated from Pydantic templates and an AI language model's general clinical knowledge. PHI detection is therefore a verification layer, not a remediation layer -- it confirms the structural guarantee that PHI cannot appear.

Pattern Coverage

The delivery-path PHI verification runs two automated layers on every document: a 9-pattern quality-gate pre-screen (Gate 5) and the validity-aware content verifier phi-content-1.0.0 (6 rules). Together they cover the highest-signal HIPAA Safe Harbor identifier categories today; the remaining categories are structurally prevented by template-based generation (no real data enters the pipeline), and dedicated detectors for them are on the roadmap as part of the broader 163+ pattern deep-scan expansion. What ran on any given document is sealed in its provenance passport.

| # | Identifier Category | Detection Today | Status | |---|---------------------|-----------------|--------| | 1 | Names | Common given-name + surname combinations (pre-screen) | Covered today (pre-screen) | | 2 | Geographic data (sub-state) | Street addresses (pre-screen); address + state + ZIP (verifier) | Covered today | | 3 | Dates (except year) | MM/DD/YYYY and full-month dates (pre-screen) | Covered today (pre-screen) | | 4 | Telephone numbers | Formatted US numbers (both layers); verifier excludes the reserved fictional 555-01XX range | Covered today | | 5 | Fax numbers | -- | Roadmap (deep scan) | | 6 | Email addresses | Address patterns (pre-screen); real-provider-domain check (verifier) | Covered today | | 7 | Social Security Numbers | XXX-XX-XXXX (pre-screen); structural validity check (verifier) | Covered today | | 8 | Medical Record Numbers | MRN-prefixed patterns (both layers) | Covered today | | 9 | Health plan beneficiary numbers | -- | Roadmap (deep scan) | | 10 | Account numbers | -- | Roadmap (deep scan) | | 11 | Certificate/license numbers | Labeled, Luhn-valid NPI (verifier) | Covered today (NPI) | | 12 | Vehicle identifiers | -- | Roadmap (deep scan) | | 13 | Device identifiers | -- | Roadmap (deep scan) | | 14 | Web URLs | -- | Roadmap (deep scan) | | 15 | IP addresses | -- | Roadmap (deep scan) | | 16 | Biometric identifiers | -- | Roadmap (deep scan) | | 17 | Photographs | -- | Roadmap (deep scan) | | 18 | Other unique identifiers | -- | Roadmap (deep scan) |

Policy

Zero-tolerance: Any PHI finding triggers automatic batch rejection
No manual override: Rejected batches must be regenerated, not patched
Audit trail: Every scan result (including clean passes) is logged with timestamp and pattern version

4. Schema Validity

Target: 100%

Every record is validated against its document-type-specific Pydantic v2 schema before inclusion in the batch. SynthABA supports 25 document types across 4 disciplines:

| Discipline | Document Types | Count | |---|---|---| | ABA | SOAP Notes, ABC Data, Treatment Goals, Insurance Auth, Crisis Plans, Supervision Notes, Discharge Summaries, Progress Reports, Session Data, Assessment Sections | 10 | | Psychotherapy | Psychotherapy Notes, Psychiatric Evals, Mental Status Exams, Safety Plans, Group Therapy Notes, Family Therapy Notes, Psychological Testing, Substance Abuse Assessments | 8 | | Speech-Language Pathology | Evaluations, Session Notes, Progress Reports, Treatment Plans | 4 | | Occupational Therapy | Evaluations, Session Notes, Treatment Plans | 3 |

Constraints Enforced

| Constraint Type | Examples | |---|---| | Type enforcement | string, integer, float, boolean, array[T], enum | | String bounds | min_length / max_length on narrative fields (e.g., SOAP subjective: 50-2000 chars) | | Numeric bounds | ge / le on quantitative fields (e.g., frequency: 0-500, percentage: 0-100) | | Closed enums | severity, sex, age_band, insurer, setting, diagnosis_codes, cpt_code | | Array size limits | min_items / max_items on list fields (e.g., behaviors_observed: 1-10 items) | | Required vs. optional | Required fields enforced as non-null; optional fields validated when present |

Retry Strategy

Records that fail schema validation are retried once at a lower generation temperature (0.3 vs. default 0.7). If the retry also fails, the record is dropped and a replacement is generated. The batch metadata records the number of retries and drops.

5. Clinical Plausibility

Clinical plausibility checks ensure that generated records reflect realistic clinical practice. These are cross-field consistency rules that go beyond schema validation.

5.1 Severity-Hours Consistency

Treatment hours must align with clinical severity. The following ranges are enforced:

| Severity | Authorized Hours/Week | Rationale | |---|---|---| | Mild (Level 1) | 5--20 hours | Lower intensity; focus on skill acquisition and parent training | | Moderate (Level 2) | 15--30 hours | Moderate intensity; direct therapy with behavior reduction | | Severe (Level 3) | 25--40 hours | High intensity; comprehensive ABA with crisis protocols |

Overlapping ranges reflect real-world clinical variability. A record with severity: mild and authorized_hours_weekly: 35 would be flagged and rejected.

5.2 CPT-Provider Alignment

Billing codes must correspond to the correct provider credential:

| CPT Code | Expected Provider | Service Description | |---|---|---| | 97151 | BCBA | Behavior identification assessment | | 97152 | BCBA / RBT (under supervision) | Behavior identification supporting assessment | | 97153 | RBT | Adaptive behavior treatment by protocol | | 97154 | RBT (group) | Group adaptive behavior treatment | | 97155 | BCBA | Adaptive behavior treatment with modification | | 97156 | BCBA | Family adaptive behavior treatment guidance | | 97157 | BCBA (group) | Multiple family group treatment guidance | | 97158 | BCBA | Group behavior treatment with modification |

A record billing 97153 with provider_role: BCBA (instead of RBT) would be flagged as inconsistent.

5.3 Age-Appropriate Diagnoses

Diagnosis codes are validated against the patient's age band:

F84.0 (Autistic disorder): Valid for all age bands
F84.5 (Asperger syndrome): Primarily used for age bands 6-11 and older (historical diagnosis, less common in younger children)
F70/F71 (Intellectual disability): Valid for all age bands, but F71 (moderate) constrained to align with severity
F90.0/F90.2 (ADHD): Valid as comorbidity for age bands 4-5 and older

5.4 Quantitative Data Ranges

| Metric | Valid Range | Clinical Context | |---|---|---| | Behavior frequency | 0--500 per session | High frequencies possible for stereotypy; low for aggression | | Percentage correct (trials) | 0--100% | Mastery typically >= 80% | | Duration (minutes) | 0 -- session length | Cannot exceed session duration | | Interval recording (%) | 0--100% | Partial/whole interval or momentary time sampling | | Rate (per minute) | 0.0--50.0 | High rates for rapid behaviors like hand-flapping |

6. Deduplication

Method

Text extraction: All free-text fields (narratives, observations, recommendations) are concatenated per record
TF-IDF vectorization: Scikit-learn TfidfVectorizer with English stop words removed, max 10,000 features
Cosine similarity matrix: Pairwise cosine similarity computed across all records in the batch
Flagging: Pairs with cosine similarity > 0.95 are flagged as near-duplicates
Deduplication: The later-generated record in each flagged pair is removed

Targets

| Metric | Threshold | Typical Result | |---|---|---| | Near-duplicate rate | < 2.0% | 0.2% -- 0.8% | | Similarity threshold | > 0.95 | Fixed | | Feature dimensionality | <= 10,000 | Fixed |

Why Low Duplicate Rates

SynthABA achieves low duplicate rates through:

Seeded random parameter variation: Each record uses a unique combination of age band, severity, insurer, setting, and clinical parameters
Temperature-controlled generation: Default temperature of 0.7 provides lexical diversity while maintaining clinical accuracy
Template variation: Multiple prompt templates per document type, randomly selected

7. Demographic Distribution

Demographic balance ensures that models trained on SynthABA data do not inherit systematic bias from skewed distributions.

7.1 Age Distribution

| Age Band | Target | Tolerance | Rationale | |---|---|---|---| | 2--3 | 15% | +/- 3% | Early intervention population | | 4--5 | 25% | +/- 3% | Peak diagnosis age (CDC, 2023) | | 6--11 | 35% | +/- 3% | Largest treatment population | | 12--17 | 20% | +/- 3% | Adolescent ABA services | | 18+ | 5% | +/- 2% | Adult services (emerging) |

7.2 Sex Distribution

| Sex | Target | Tolerance | Rationale | |---|---|---|---| | Male | 72% | +/- 5% | ASD prevalence ratio ~4:1 male:female (CDC MMWR, 2023) | | Female | 28% | +/- 5% | Historically underdiagnosed; 28% reflects current prevalence data |

7.3 Severity Distribution

| Severity | Target | Tolerance | Rationale | |---|---|---|---| | Mild (Level 1) | 30% | +/- 5% | Largest clinical population by volume | | Moderate (Level 2) | 45% | +/- 5% | Most common treatment intensity | | Severe (Level 3) | 25% | +/- 5% | Higher support needs; critical for model generalization |

7.4 Insurer Distribution

| Insurer | Target | Rationale | |---|---|---| | UnitedHealthcare (UHC) | ~17% | Major national payer | | Blue Cross Blue Shield (BCBS) | ~17% | Major national payer | | Aetna | ~17% | Major national payer | | Cigna | ~17% | Major national payer | | TRICARE | ~16% | Military/veteran families | | Medicaid FL | ~16% | State Medicaid program |

Insurer distribution is approximately uniform to avoid payer-specific bias. Each insurer has unique authorization and documentation requirements that are reflected in the generated records.

7.5 Service Setting Distribution

| Setting | Target | Tolerance | Rationale | |---|---|---|---| | Clinic | 50% | +/- 5% | Most common ABA service setting | | Home | 30% | +/- 5% | Second most common; different documentation patterns | | School | 15% | +/- 3% | School-based ABA services | | Telehealth | 5% | +/- 2% | Growing modality post-COVID; distinct documentation |

8. Inter-Rater Reliability

Review Protocol

The protocol below defines how clinical review is scored when it runs. Today, Board Certified Behavior Analysts review clinical templates and gold-set reference materials against this rubric; per-batch document review is on the roadmap, and each delivered document's provenance passport declares its human_validation status accordingly.

5-Dimension Review Rubric

| Dimension | Description | Scale | |---|---|---| | Terminology Accuracy | Clinical terms used correctly and consistently (e.g., "extinction burst" vs. "tantrum increase") | 1--5 | | Clinical Plausibility | Case presentation is realistic; symptoms, behaviors, and interventions align logically | 1--5 | | Assessment Logic | Functional behavior assessments, diagnostic reasoning, and data interpretation are sound | 1--5 | | Plan Appropriateness | Treatment plans, goals, and recommended interventions are clinically appropriate for the case | 1--5 | | Documentation Standards | Formatting, structure, and completeness meet industry documentation standards | 1--5 |

Agreement Metrics

| Metric | Method | Target | Interpretation (Landis & Koch, 1977) | |---|---|---|---| | Cohen's Kappa | 2 reviewers, per dimension | >= 0.61 | Substantial agreement | | Fleiss' Kappa | 3+ reviewers, per dimension | >= 0.61 | Substantial agreement | | Weighted Kappa | Ordinal scores (1--5 scale) | >= 0.61 | Accounts for degree of disagreement |

Interpretation Scale

| Kappa Range | Agreement Level | |---|---| | 0.81 -- 1.00 | Almost perfect | | 0.61 -- 0.80 | Substantial | | 0.41 -- 0.60 | Moderate | | 0.21 -- 0.40 | Fair | | 0.00 -- 0.20 | Slight | | < 0.00 | Poor (less than chance) |

Review Workflow

Sampling: Records stratified by severity (mild/moderate/severe) to ensure coverage
Blinding: Reviewers do not see each other's scores during evaluation
Scoring: Each reviewer scores all 5 dimensions independently (1--5 scale)
Reconciliation: Dimension scores with Kappa < 0.61 trigger a reconciliation round
Reporting: Final Kappa scores, mean dimension scores, and reconciliation notes are included in batch metadata

9. Automated Compliance Pipeline

The automated compliance pipeline provides an immutable audit trail for every released batch. It has 4 stages:

Pipeline Stages

Stage 1: Ingest         Stage 2: PHI Scan       Stage 3: Synthetic       Stage 4: Release
                                                  Validation
+-----------+          +-----------+          +-----------+          +-----------+
| Receive   |  ---->   | Run       |  ---->   | Verify    |  ---->   | Seal      |
| batch     |          | version-  |          | synthetic |          | passports |
| from      |          | stamped   |          | flag +    |          | + batch   |
| generator |          | PHI scan  |          | schema    |          | manifest  |
+-----------+          +-----------+          +-----------+          +-----------+

Release Artifacts

Every released batch ships with one provenance passport per document (gate results, PHI verdict, model versions, SHA-256 chain of custody) and a batch-level MANIFEST.passport.json binding them together:

| Field | Type | Description | |---|---|---| | manifest_version | string | Manifest format version (1.0) | | dataset_id | string | Unique identifier for the released batch | | created_at | string (ISO 8601) | Timestamp of manifest creation | | documents | array | One entry per document: document_id + passport_sha256 | | manifest_sha256 | string | Self-hash of the manifest for tamper detection |

Tamper Detection

The manifest_sha256 self-hash covers the full manifest with the hash field nulled; each passport_sha256 covers its document's sealed passport. Any modification to released passports or the manifest produces a different hash. Every delivery includes a standalone verify_passport.py script, so consumers can verify every document, passport, and the manifest independently with stock Python 3 -- no vendor tooling required:

python3 verify_passport.py <dataset_dir>

10. Known Limitations

Transparency about dataset limitations is essential for responsible use. The following limitations are documented to help consumers make informed decisions.

| Limitation | Impact | Mitigation | |---|---|---| | Adult ASD (18+) | 5% representation; less clinical depth than pediatric cases | Weight sampling or oversample if adult cases are your primary use case | | Rare comorbidities | Rett syndrome, Fragile X, Angelman syndrome not explicitly modeled | Supplement with domain-specific data if rare comorbidity recognition is required | | Bilingual (Spanish) | Available but model has stronger English performance; clinical terminology may be less varied | Review Spanish samples with native-speaking clinician before use in production | | TRICARE documentation | TRICARE-specific authorization and documentation requirements are underrepresented | Cross-reference with TRICARE ABA policy manuals for compliance-critical applications | | Newer discipline templates | Psychotherapy, SLP, and OT templates have fewer clinician review hours than ABA templates | Check clinician_review_hours in batch metadata; ABA templates have ~3x more review time | | Temporal patterns | No longitudinal patient trajectories; each record is independent | Not suitable for time-series or treatment outcome prediction without additional engineering | | Geographic variation | Insurer requirements based primarily on Florida Medicaid and national payers | State-specific Medicaid rules outside Florida may not be accurately represented |

11. Recommendations for Consumers

Data Selection

Use provided splits: Train/val/test files are pre-split (70/15/15), stratified by severity, and deterministically seeded. Respect these splits to ensure reproducible benchmarks.
Filter by case_type: For discipline-specific model training, filter to the relevant document types rather than training on the full 25-type corpus.
Check quality_score: Batch metadata includes a quality_score field (0.0--1.0). We recommend using batches with quality_score > 0.85 for production model training.

Demographic Auditing

Review the demographic_audit section in batch metadata before training. Verify that the distribution matches your use case.
If your application disproportionately serves a specific subpopulation (e.g., female patients, severe cases), consider rebalancing the training set accordingly.

Schema Validation

Independently validate records against the JSON Schema files included in each batch (documentation/schemas/). Do not rely solely on SynthABA's validation -- verify with your own tooling.

Version Tracking

Pin your training data to a specific generator_version. Different versions may produce records with different field distributions or narrative styles.
Record the manifest_sha256 from MANIFEST.passport.json in your experiment tracking system for full reproducibility.

Combining with Real Data

SynthABA data is designed to bootstrap models, not replace real clinical data entirely. For production deployment, fine-tune on validated real-world data when available.
Use SynthABA for pre-training or data augmentation, then evaluate on held-out real data to measure transfer performance.

References

CDC MMWR (2023). Prevalence and Characteristics of Autism Spectrum Disorder Among Children Aged 8 Years. Morbidity and Mortality Weekly Report, 72(2).
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159--174.
HIPAA Privacy Rule, 45 CFR Section 164.514(b)(2). Safe Harbor method for de-identification.
SynthABA HIPAA Position Paper, v1.0 (April 2026).
SynthABA Data Dictionary, v1.0.0.
SynthABA Technical Integration Guide, v1.0.0.

This report is generated as part of the SynthABA quality assurance process and accompanies every released dataset batch. For questions, contact the SynthABA engineering team.

12. Simula Pipeline Enhancements

Scope: Simula validation pipeline — runs on the internal corpus (generated_documents); integration with the delivery path is on the roadmap. Documents delivered to customers today are produced by the delivery pipeline summarized in the table below, and what actually ran on each document is sealed in its provenance passport.

Current delivery pipeline

| Stage | Component | Passport status per delivered document | |-------|-----------|----------------------------------------| | Generator | Claude Sonnet 4.6 | executed — exact model version sealed | | Quality gates | 8 automated gates | executed — per-gate results sealed | | PHI verification | Automated PHI content verifier | executed — verdict sealed | | Provenance | Sealed per-document passport, SHA-256 chain of custody | always written | | Dual-critic + semantic judge (Simula) | Internal corpus only | not_run — roadmap for delivery path |

Beyond the gates described above, SynthABA's Simula pipeline applies the four principles of Google Research's Simula framework (TMLR, April 2026) to address failure modes that single-LLM validation cannot catch.

12.1 Dual-Critic Verification

Each document is reviewed in parallel by two independent AI clinicians with adversarial prompts:

Critic A ("quality review"): evaluates the document for clinical accuracy and lists strengths.
Critic B ("adversarial review"): actively looks for errors with severity critical, major, or minor.

When Critic B reports errors, a semantic judge (separate model, separate call) reconciles whether Critic A's identified strengths actually address the errors. The judge's output is a per-error boolean addressed.

Rejection policy:

Any critical error that is not addressed → reject.
More than 2 major errors unaddressed → reject.
minor errors only → approve.

This breaks the sycophancy bias that single-LLM validation exhibits (one model, one pass, reviewer and generator aligned). Independent adversarial review catches errors the generator was happy to ship.

12.2 Taxonomic Coverage

Every generation job samples its clinical scenarios from a 113-node taxonomy covering seven categories for ABA: behaviors, interventions, assessments, comorbidities, settings, phases, and barriers. Sampling is weighted by rarity so that rare-but-clinically-important combinations (active SIB with comorbid epilepsy, pica with caregiver inconsistency barrier, etc.) get proportionally more attention than the mode.

Each batch ships with a coverage report: how many unique taxonomy nodes the dataset touched, broken down by category. Coverage ratio is the headline metric — a dataset that covers 80 of 113 nodes is categorically different from one that covers 20, even at the same record count.

12.3 Elo-Calibrated Complexity

Every document receives two complexity signals:

An absolute score (1-10) from a scorer LLM that reasons about the rare factors present in the case.
An Elo rating that starts at 1500 + (absolute - 5) × 33 and is updated via pairwise comparisons against other documents in the corpus.

The Elo is what gets updated nightly via a calibration cron — two documents are sampled, the LLM picks which is clinically more complex, and both ratings shift. Over hundreds of pairwise matches, Elo converges on a relative ranking that is more reliable than any single absolute score, because it is grounded in direct comparison rather than miscalibrated intuition.

Consumers filter edge cases with a simple predicate: complexity_elo > 1700.

12.4 Fail-Closed Judge Behavior

If the semantic judge call fails for any reason — network error, malformed JSON, timeout — the pipeline treats every error reported by Critic B as unaddressed. This always-reject-on-failure policy means that a transient infrastructure issue can cause a false rejection and an extra regeneration, but never a false approval. The downstream user never receives a document that was approved by default.

Rejected documents are automatically regenerated up to max_regenerations_per_doc times (default: 2) before the slot is given up.

12.5 Provenance Trail

Every Simula-approved document carries a full provenance block in its metadata:

taxonomy_path — list of sampled nodes with category, slug, rarity.
complexity — absolute score, rare-factors list, Elo rating.
dual_critic — Critic A verdict + strengths, Critic B verdict + errors, rejection reason if any, list of unaddressed errors.
model_versions — generator, critics, complexity scorer, judge model IDs.
generated_at — ISO timestamp.
attempt — regeneration count (1 = first-try pass).

This provenance is queryable from the SynthABA API and persisted alongside every document in Supabase Postgres for full auditability.

Quality Assurance Report