← Back to home

How SynthABA Implements Google's Simula Framework for Clinical Synthetic Data

A technical breakdown of how we adapted Google Research's Simula framework — taxonomic sampling, dual-critic verification, and Elo-based complexity calibration — to generate production-grade synthetic ABA clinical documents.

Simón Franco·

How SynthABA Implements Google's Simula Framework for Clinical Synthetic Data

In April 2026, Google Research and EPFL published Simula: A Reasoning-First Framework for Synthetic Data Generation in TMLR. It's the framework behind ShieldGemma, FunctionGemma, and MedGemma — the data pipeline that lets Google train specialist variants of Gemma at scale.

SynthABA is an implementation of Simula's four principles specialized for behavioral health. This post walks through how each principle translates to our production pipeline — taxonomic sampling, meta-prompt diversification, dual-critic verification, and Elo-based complexity calibration — and why those choices matter when the downstream model you're training is going to make clinical decisions.

If you're training an AI scribe, a claims-adjudication model, or a clinical-reasoning assistant on ABA, SLP, OT, or psychotherapy documents, skip to the code at the end. Otherwise, here's the long version.

Why synthetic data is the bottleneck in clinical AI

Every serious clinical AI product hits the same wall: you can't put your hands on enough real documents to train a model, and you can't get HIPAA clearance to use the ones that exist. The current workaround — scraping academic papers, asking GPT-4 for "a typical BIP" — produces training data that's fine on easy cases and broken on the hard ones.

What breaks your model in production isn't the straightforward kindergartner with ASD Level 1 who learns PECS in four sessions. It's the 12-year-old with ASD Level 3, comorbid epilepsy, active SIB with protective equipment contingency, and a home environment with inconsistent caregivers. Those cases are 15% of your caseload and 80% of your clinical errors. Your training data needs to cover them.

Simula's entire design is oriented around this long tail: generate data that spans the full taxonomy of the domain, not just the mode.

Principle 1 — Global Diversification via hierarchical taxonomy

The first thing Simula does is refuse to generate from prompts alone. Instead, it samples from an explicit taxonomy of the domain and diversifies globally across that taxonomy.

In SynthABA, the taxonomy is a Postgres table with 113 active nodes across four disciplines (ABA, SLP, OT, psych), organized by category: behaviors, interventions, assessments, comorbidities, settings, phases, barriers. Each node carries clinical metadata — ICD-10 codes, BACB task list references, age ranges, VB-MAPP hints, function annotations — and a rarity score (1-10) used to weight sampling.

create table taxonomy_nodes (
  id uuid primary key,
  domain text,           -- 'aba' | 'slp' | 'ot' | 'psych'
  category text,         -- 'behavior' | 'intervention' | ...
  slug text,
  clinical_metadata jsonb,
  sampling_weight numeric,
  rarity smallint,       -- 1 (common) to 10 (very rare)
  compatible_with text[],
  excludes text[]
);

The compatible_with and excludes arrays encode a compatibility graph. Extinction procedures, for example, exclude themselves from any document involving eye-gouging SIB — applying extinction alone to a topography with permanent-injury risk is a BACB ethics violation, and our sampler won't produce that combination.

Sampling is a weighted random walk through the category priority order (behavior → assessment → comorbidity → intervention → setting → phase → barrier). The weighting function is a Gaussian centered on the target complexity, so a job configured for complexity 8 will preferentially sample rare, high-rarity behaviors and comorbidities, while complexity 3 will preferentially sample common topographies. The 0.2 floor on weights preserves diversity — no node ever becomes impossible to sample.

This is why taxonomic coverage is the right metric, not embedding-space diversity. You can measure it exactly: of the 107 active ABA nodes, this dataset touched 89 of them — coverage ratio 0.83. Ship that number with every delivery.

Principle 2 — Local Diversification via meta-prompts

Simula's second step addresses a subtler failure mode: even when your taxonomic coverage is wide, the documents within a node tend to collapse into a single mode. Every SIB/head-banging note sounds the same. Every DTT session starts with the same preference assessment.

The fix is a prompt_variants table indexed on each taxonomy node, with a handful of pre-generated meta-prompts that diversify across orthogonal axes — age, setting, severity, family context, history of failed interventions, caregiver barriers. At generation time, the orchestrator samples a variant and populates the template.

After each run, we update the variant's approval_rate based on whether downstream critics accepted the documents it produced. Variants that consistently produce rejected documents get retired. This is basic bandit-style selection, but it keeps the prompt library healthy without manual curation.

Principle 3 — Elo-based complexity calibration

Absolute complexity scores are noisy. A Claude call asked "rate this 1-10" on a session note will give you a 7, and asked again will give you an 8. Over thousands of documents, that drift makes your complexity tier labels unreliable — which matters when you're selling a "premium edge case" tier at 3-5× standard pricing.

Simula's fix is pairwise comparison. We do absolute scoring at generation time (cheap, good enough for initial bucketing) and map the score to a starting Elo rating via a simple linear transform. Then, as a background task, a calibration job samples random pairs of approved documents and asks: which of these is more clinically complex?

Each pairwise comparison updates both documents' Elo ratings using the standard chess formula (K=32). After a few hundred comparisons, the Elo ordering is vastly more robust than the absolute scores it was initialized from. Documents with genuinely rare clinical elements — eye-gouging with protective equipment, pica with medical sequelae, multi-system coordination — rise to the top. Documents that looked complex but aren't drift down.

def elo_update(rating_a, rating_b, winner, k=32):
    expected_a = 1.0 / (1.0 + 10 ** ((rating_b - rating_a) / 400))
    score_a = 1.0 if winner == 'a' else 0.0 if winner == 'b' else 0.5
    new_a = round(rating_a + k * (score_a - expected_a))
    new_b = round(rating_b + k * ((1 - score_a) - (1 - expected_a)))
    return new_a, new_b

Publishing the Elo rating on every document isn't just for our pricing tiers. It's for the buyer — an ML engineer at a claims-adjudication startup can filter the dataset on complexity_elo > 1700 and get exactly the edge cases their model is failing on.

Principle 4 — Dual-Critic verification (the anti-sycophancy design)

This is the part of Simula that most deserves the engineering time. Single-critic verification — asking an LLM is this document accurate? — systematically fails because LLMs are trained to be helpful and want to agree with the premise of the question. If you ask is this accurate, the model looks for reasons it's accurate. If you ask is this inaccurate, the model looks for reasons it's inaccurate.

Simula's fix is to ask both, independently, and reconcile.

# The critical pattern: parallel, independent calls
critic_a, critic_b = await asyncio.gather(
    client.messages.create(
        system="You are a BCBA-D. Is this document CLINICALLY ACCURATE? List strengths.",
        messages=[{"role": "user", "content": document}]
    ),
    client.messages.create(
        system="You are a BCBA-D. Is this document CLINICALLY INACCURATE? Assume errors exist. List them.",
        messages=[{"role": "user", "content": document}]
    ),
)

Critic A sees only the document and the "find strengths" framing. Critic B sees only the document and the "find errors" framing. Neither sees the other's output — that's the property that breaks sycophancy. If you chain them (Critic B sees Critic A's reasoning), Critic B will defer.

Reconciliation is a third Claude call — this time to Haiku 4.5, which is cheaper and fast enough that the ~1-2 seconds of latency disappears into the overall generation time. The judge gets both A's strengths and B's errors, and for each error answers a single constrained question: is this addressed by any of A's strengths? It returns a boolean array the same length as B's errors. Unaddressed critical errors or more than two unaddressed major errors trigger rejection and regeneration. If the judge call fails or returns a malformed response, the pipeline fails closed — all errors are treated as unaddressed, because in clinical data it's always safer to over-reject and regenerate than under-reject and ship something defective.

In our benchmarks against single-critic, dual-critic catches ~3.2× more clinically meaningful errors on adversarial test cases — the ones we seeded with deliberately incorrect function/intervention mismatches.

In the most important category — life-threatening safety errors like applying extinction to eye-gouging without a safety plan — single-critic missed 9 of 12 seeded cases. Dual-critic missed 1.

What ships with every dataset

When you buy a SynthABA dataset, you get three things:

  1. The JSON documents themselves.
  2. A complete provenance trail per document: the taxonomy path sampled, the absolute and Elo complexity scores, both critic outputs, the VLayer HIPAA scan results, and the model versions used at each step.
  3. An embeddable Web Component — <synthaba-document-viewer> — that renders each document with its provenance inline. Drop it into any frontend (React, Vue, plain HTML) and your team can audit the data before you commit to training on it.

The viewer was inspired by OpenAI's Euphony, which solves the same problem for Harmony conversation logs — turning raw JSON provenance into something an engineer can read at a glance.

Why we built this instead of buying it

Google isn't going to ship Simula as a product. It's a research framework they use internally to train Gemma variants. The cleanest path for us was to adopt the principles — taxonomic sampling, meta-prompt diversification, dual-critic verification, Elo calibration — and specialize them for the domain where we have deep clinical expertise: behavioral health.

The tech wasn't the differentiator. The clinical taxonomy was.

Try it

SynthABA datasets are currently available for ABA (session notes, BIPs, FBAs, VB-MAPP reports, treatment plans) with SLP, OT, and psychotherapy in rolling release. Standard tier covers complexity 1-6; premium edge-case tier covers complexity 7-10 with guaranteed rare-factor coverage.

Request a sample dataset →


Questions about the implementation? Reach out at sf@synthaba.com.