SynthABA Technical Integration Guide
Version 1.0.0 | For ML Engineers and Data Engineers
Table of Contents
- Quick Start
- API Reference
- Authentication
- Document Types Catalog
- Data Format & Schema
- Schema Validation
- Train/Val/Test Splits
- Quality Metrics
- Batch Output Structure
- Integration Examples
- Rate Limits & Timeouts
- Versioning
- Support
Quick Start
Generate your first batch of synthetic ABA clinical records in under 30 seconds:
curl -X POST https://synthaba-production.up.railway.app/generate \
-H "Content-Type: application/json" \
-H "X-API-Key: your-api-key" \
-d '{"case_type": "soap_note", "count": 5, "language": "en"}'
The response includes the generated records inline along with quality metrics, split information, and batch provenance. Every record is validated against its Pydantic schema before it leaves the pipeline.
To verify the service is running:
curl https://synthaba-production.up.railway.app/health
# {"status": "ok", "generator_ready": true}
API Reference
POST /generate
Generate a batch of synthetic clinical records.
Request Body
| Field | Type | Required | Default | Description |
|-------------|--------|----------|------------|--------------------------------------------------|
| case_type | string | Yes | -- | One of the 25 supported document types (see catalog below) |
| count | int | Yes | -- | Number of records to generate (1 -- 2000) |
| language | string | No | "en" | Language code: en (English) or es (Spanish) |
| client_id | string | No | "api" | Identifier for your application or pipeline |
| order_id | string | No | null | Your internal order or job reference |
| target | string | No | "cloud" | Deployment target hint |
| priority | string | No | "normal" | Processing priority: normal, high, or urgent |
Response Body
| Field | Type | Description |
|---------------------|---------|------------------------------------------------------------|
| batch_id | string | Unique identifier for this batch (UUID) |
| path | string | Server-side path to the batch output directory |
| records_generated | int | Number of records successfully generated |
| errors | int | Number of records that failed generation |
| success_rate | float | Ratio of successful records (0.0 -- 1.0) |
| quality_score | float | Overall quality score from the 10-gate pipeline (0.0 -- 1.0)|
| quality_passed | bool | Whether the batch passed all quality gates |
| splits | object | Record counts per split: {"train": N, "validation": N, "test": N} |
| ready_for_vlayer | bool | Whether the batch meets VLayer v2 compliance standards |
| elapsed_seconds | float | Wall-clock time for generation |
| cases | array | Full array of generated records (train + validation + test combined) |
Error Codes
| Code | Meaning |
|------|---------------------------------------------------------------|
| 400 | Invalid input -- bad case_type, count out of range, etc. |
| 401 | Invalid or missing API key |
| 500 | Internal generation error (include batch_id in support tickets) |
| 503 | Generator not ready -- the model is still initializing |
GET /health
Check whether the API and generator are online.
Response
{
"status": "ok",
"generator_ready": true
}
If generator_ready is false, the server is still loading. Retry after a few seconds.
Web API Routes (Vercel)
The SynthABA commerce layer runs on Vercel and provides these endpoints:
| Route | Method | Description |
|---------------------------|--------|------------------------------------------------------------|
| /api/create-checkout | POST | Creates a Stripe checkout session for purchasing a dataset |
| /api/webhook | POST | Stripe webhook handler -- triggers generation, upload, and email delivery |
| /api/download/[id] | GET | Returns a signed download URL (24-hour expiry) for a completed batch |
| /api/request-sample | POST | Generates a free 5-record sample (rate limited per email) |
These routes are for the web storefront. If you are integrating SynthABA data into an ML pipeline, use the Railway API directly.
Authentication
Railway API (Generation)
Pass your API key in the X-API-Key header on every request:
curl -H "X-API-Key: your-api-key" \
https://synthaba-production.up.railway.app/generate ...
If no GENERATOR_API_KEY is configured on the server, the endpoint runs in open dev mode and accepts any request. In production, an invalid or missing key returns HTTP 401.
Web API (Commerce)
The Vercel web routes use Stripe for payment authentication. No separate API key is needed -- Stripe session tokens handle authorization for checkout and download flows.
Document Types Catalog
SynthABA generates 25 clinical document types across four disciplines.
ABA -- Applied Behavior Analysis (10 types)
| Type Key | Description |
|-----------------------|---------------------------------------------------------------|
| soap_note | Subjective/Objective/Assessment/Plan session documentation |
| abc_data | Antecedent-Behavior-Consequence data collection records |
| treatment_goals | Individualized treatment goals with measurable objectives |
| insurance_auth | Insurance authorization requests with medical necessity |
| crisis_plan | Behavioral crisis intervention and de-escalation plans |
| supervision_note | BCBA supervision session documentation |
| discharge_summary | Treatment discharge summaries with outcome data |
| progress_report | Periodic progress reports for payors and families |
| session_data | Quantitative session data (trial counts, duration, frequency) |
| assessment_section | Functional behavior assessment sections |
Psychotherapy / Mental Health (8 types)
| Type Key | Description |
|-------------------------------|-----------------------------------------------------|
| psychotherapy_note | Individual psychotherapy session notes |
| psychiatric_eval | Psychiatric evaluation and diagnostic assessment |
| mental_status_exam | Mental status examination documentation |
| safety_plan | Safety planning for at-risk individuals |
| group_therapy_note | Group therapy session documentation |
| family_therapy_note | Family therapy session documentation |
| psychological_testing | Psychological testing reports and interpretations |
| substance_abuse_assessment | Substance use disorder assessment documentation |
Speech-Language Pathology (4 types)
| Type Key | Description |
|-----------------------|-----------------------------------------------------|
| slp_evaluation | Speech-language pathology initial evaluation |
| slp_session_note | SLP treatment session documentation |
| slp_progress_report | SLP periodic progress reports |
| slp_treatment_plan | SLP treatment plan with goals and objectives |
Occupational Therapy (3 types)
| Type Key | Description |
|---------------------|-------------------------------------------------------|
| ot_evaluation | Occupational therapy initial evaluation |
| ot_session_note | OT treatment session documentation |
| ot_treatment_plan | OT treatment plan with goals and objectives |
Data Format & Schema
Format
All records are JSON. Each batch produces three split files (train.json, validation.json, test.json), each containing a JSON array of record objects.
Base Envelope
Every record, regardless of document type, contains these base fields:
| Field | Type | Description |
|---------------------|----------|------------------------------------------------------|
| case_id | string | Unique UUID for this record |
| case_type | string | Document type key (e.g., soap_note) |
| language | string | en or es |
| difficulty | string | basic, intermediate, advanced, or expert |
| patient_context | object | De-identified patient demographics (see below) |
| generated_by | string | Model identifier used for generation |
| generator_version | string | Pipeline version (e.g., 1.0.0) |
| synthetic | bool | Always true -- marks the record as synthetic |
Patient Context Object
The patient_context field is present on every record and follows a fixed schema:
| Field | Type | Values / Range |
|----------------------------|---------------|-----------------------------------------------------|
| age_band | string (enum) | 2-3, 4-5, 6-11, 12-17, 18+ |
| sex | string (enum) | male, female |
| diagnosis_codes | array[string] | ICD-10 codes: F84.0, F84.1, F84.5, F90.2, F90.0, F70, F71, F80.1, F88 |
| severity | string (enum) | mild, moderate, severe |
| comorbidities | array[string] | Free text (e.g., "speech delay", "sensory processing") |
| insurer | string (enum) | uhc, bcbs, aetna, cigna, tricare, medicaid_fl |
| authorized_hours_weekly | float | Weekly authorized treatment hours |
| months_in_treatment | int | Duration of treatment in months |
| setting | string (enum) | clinic, home, school, telehealth |
Type-Specific Fields
Each document type adds fields specific to its clinical purpose. For example, a soap_note includes subjective, objective, assessment, plan, session_type, cpt_code, provider_role, and behaviors_observed. An abc_data record includes antecedent, behavior, consequence, function, and frequency fields.
Refer to the JSON Schema files in documentation/schemas/ within each batch for the complete field specification per type.
Schema Validation
Every batch includes exported JSON Schema files. You can validate records independently using either jsonschema or Pydantic directly.
Using jsonschema (Python)
import json
import jsonschema
# Load the schema for your document type
with open("batch_xxx/documentation/schemas/soap_note_schema.json") as f:
schema = json.load(f)
# Load records
with open("batch_xxx/data/train.json") as f:
records = json.load(f)
# Validate each record
for i, record in enumerate(records):
try:
jsonschema.validate(record, schema)
except jsonschema.ValidationError as e:
print(f"Record {i} failed validation: {e.message}")
Using Pydantic (Python)
If you have access to the SynthABA template classes:
from templates.soap_note import SyntheticSOAPNote
# Validate a single record
record = {"case_id": "...", "case_type": "soap_note", ...}
validated = SyntheticSOAPNote(**record) # raises ValidationError on invalid data
Schema Export
The pipeline uses Pydantic v2 with model_json_schema() to export JSON Schema files. All fields have type constraints, max_length limits, and ge/le bounds where applicable. Enums are exported as {"enum": [...]} in the JSON Schema.
Train/Val/Test Splits
Every batch is automatically split into three partitions:
| Split | Percentage | File |
|--------------|------------|----------------------------|
| Train | 70% | data/train.json |
| Validation | 15% | data/validation.json |
| Test | 15% | data/test.json |
Split Properties
- Deterministic: Uses
seed=42for reproducible splits. The same input records will always produce the same partition. - Stratified by severity: Records are shuffled, maintaining overall demographic distribution across splits (mild, moderate, severe proportions are preserved).
- No leakage: Index-based partitioning guarantees zero overlap between splits.
Loading Splits
import json
def load_batch(batch_dir: str) -> dict:
"""Load all three splits from a batch directory."""
splits = {}
for split_name in ["train", "validation", "test"]:
path = f"{batch_dir}/data/{split_name}.json"
with open(path) as f:
splits[split_name] = json.load(f)
return splits
batch = load_batch("batch_xxx")
print(f"Train: {len(batch['train'])} records")
print(f"Validation: {len(batch['validation'])} records")
print(f"Test: {len(batch['test'])} records")
Quality Metrics
Every batch passes through a 10-gate quality pipeline before delivery. The quality report is saved to quality/quality_report.json and quality/quality_report_full.json.
The 10 Quality Gates
| Gate | Name | What It Checks | Pass Threshold | |------|-------------------------|---------------------------------------------------------------|------------------| | 1 | Schema Validation | Every record validates against its Pydantic model | < 2% reject rate | | 2 | Completeness | At least one substantial text field is populated | >= 95% complete | | 3 | Clinical Consistency | Age-hours, severity-goals, CPT-provider role rules hold | < 10% reject rate| | 4 | Deduplication | No near-duplicates (TF-IDF cosine similarity > 0.95) | 0 duplicates | | 5 | PHI Leak Detection | Zero matches against 10+ PHI patterns (SSN, phone, email, etc.)| 0 findings | | 6 | Demographic Balance | Sex distribution within 15% of target (72% male / 28% female) | < 15% deviation | | 7 | Vocabulary Consistency | Behavioral functions use closed vocabulary only | 0 invalid terms | | 8 | Edge Case Coverage | Minimum representation of severe, comorbid, telehealth cases | >= 5% severe, >= 10% comorbid, >= 5% telehealth | | 9 | Train/Val/Test Split | Splits are applied by the splitter module | Always passes | | 10 | Version Control | Generator and template versions are recorded | Always passes |
Quality Report Fields
The quality/quality_report.json file contains:
{
"all_gates_passed": true,
"overall_quality_score": 0.98,
"total_records": 100,
"records_passed": 98,
"records_failed": 2,
"metrics": {
"total_records": 100,
"valid_schema_rate": 1.0,
"duplicate_rate": 0.0,
"contradiction_rate": 0.02,
"missing_critical_fields_rate": 0.0,
"vocabulary_compliance_rate": 1.0,
"internal_consistency_score": 0.97,
"split_leakage_check": true,
"edge_case_coverage": { ... },
"clinical_plausibility_proxy": 0.98
},
"demographic_audit": { ... },
"gate_results": { ... },
"detailed_failures": [ ... ]
}
Key Metrics Explained
| Metric | Description |
|--------------------------------|------------------------------------------------------------------|
| valid_schema_rate | Fraction of records passing Pydantic validation (target: 1.0) |
| duplicate_rate | Fraction of record pairs with cosine similarity > 0.95 (target: 0.0) |
| contradiction_rate | Fraction of records failing clinical consistency rules |
| clinical_plausibility_proxy | Aggregate clinical consistency score (1.0 - contradiction_rate) |
| internal_consistency_score | Composite of age-hours, severity-hours, and CPT-provider checks |
| split_leakage_check | Boolean confirming zero overlap between train/val/test |
Demographic Audit
The demographic_audit section (also saved separately as quality/demographic_audit.json) breaks down record distribution across:
- Age group:
2-3,4-5,6-11,12-17,18+ - Sex:
male,female - Diagnosis profile: ICD-10 code distribution
- Severity band:
mild,moderate,severe - Setting:
clinic,home,school,telehealth - Payer context:
uhc,bcbs,aetna,cigna,tricare,medicaid_fl - Intervention mix: FCT, DRA, DTT, NET, PRT, token economy, etc.
- Behavior categories:
escape,attention,tangible,sensory - Language:
en,es
Each category shows both raw count and percentage of total records.
Batch Output Structure
Every batch is written to a self-contained directory:
batch_{id}/
├── data/
│ ├── train.json # 70% of records
│ ├── validation.json # 15% of records
│ └── test.json # 15% of records
├── documentation/
│ ├── datasheet.yaml # Gebru et al. Datasheets for Datasets
│ ├── healthsheet.yaml # Google FAccT Healthsheet
│ ├── data_card.yaml # Google PAIR Data Card
│ ├── nutrition_label.yaml # Dataset Nutrition Label
│ └── schemas/ # JSON Schema files per document type
├── quality/
│ ├── quality_report.json # Summary quality metrics
│ ├── quality_report_full.json # Full gate-by-gate results
│ └── demographic_audit.json # Demographic distribution breakdown
├── provenance/
│ ├── generation_manifest.yaml # What was generated, when, by whom
│ ├── prompt_config.yaml # Prompt templates used
│ ├── ontology_version.yaml # Clinical vocabulary version
│ ├── pipeline_version.txt # Generator version string
│ ├── policy_applied.yaml # Data governance policies applied
│ └── vlayer_passport.json # VLayer v2 compliance passport
├── compliance/
│ ├── audit_log.jsonl # Immutable audit trail
│ ├── clinical_review/ # Clinical review artifacts
│ └── test_evidence/ # Automated test results
├── raw_batch.json # All records before splitting
└── manifest.json # Batch metadata and file inventory
What Goes Where
- data/: The files you load into your ML pipeline. Three pre-split JSON files ready for training.
- documentation/: Human-readable dataset documentation following published standards. Useful for model cards, IRB submissions, and internal audits.
- quality/: Machine-readable quality metrics. Parse
quality_report.jsonto gate your pipeline -- reject batches below your quality threshold. - provenance/: Full lineage tracking. Every batch records exactly which prompt templates, clinical ontologies, and pipeline versions produced it.
- compliance/: Audit trail and review artifacts for regulatory compliance.
Integration Examples
Python -- Load and Iterate
import json
from pathlib import Path
def load_batch(batch_dir: str) -> dict:
"""Load a SynthABA batch into a dict of splits."""
batch_path = Path(batch_dir)
splits = {}
for split in ["train", "validation", "test"]:
with open(batch_path / "data" / f"{split}.json") as f:
splits[split] = json.load(f)
return splits
# Load the batch
batch = load_batch("batch_abc123")
# Iterate over training records
for record in batch["train"]:
case_type = record["case_type"]
case_id = record["case_id"]
severity = record["patient_context"]["severity"]
print(f"[{case_type}] {case_id} -- severity: {severity}")
Python -- Filter by Demographics
# Get all severe cases from the training set
severe_train = [
r for r in batch["train"]
if r["patient_context"]["severity"] == "severe"
]
print(f"Severe training cases: {len(severe_train)}")
# Get all telehealth cases
telehealth = [
r for r in batch["train"]
if r["patient_context"]["setting"] == "telehealth"
]
# Get Spanish-language records
spanish = [r for r in batch["train"] if r["language"] == "es"]
Python -- Quality Gate Check
import json
with open("batch_abc123/quality/quality_report.json") as f:
qr = json.load(f)
# Reject batches that fail quality gates
if not qr["all_gates_passed"]:
print(f"Batch failed quality gates. Score: {qr['overall_quality_score']}")
for gate_name, result in qr["gate_results"].items():
if not result.get("passed"):
print(f" FAILED: {gate_name} -- {result}")
raise ValueError("Batch did not pass quality gates")
# Check specific thresholds
metrics = qr["metrics"]
assert metrics["valid_schema_rate"] >= 0.99, "Schema validity too low"
assert metrics["duplicate_rate"] < 0.01, "Too many duplicates"
assert metrics["clinical_plausibility_proxy"] >= 0.90, "Clinical plausibility too low"
TypeScript -- Generate via API
interface GenerateRequest {
case_type: string;
count: number;
language?: "en" | "es";
client_id?: string;
order_id?: string;
}
interface GenerateResponse {
batch_id: string;
records_generated: number;
errors: number;
success_rate: number;
quality_score: number;
quality_passed: boolean;
splits: { train: number; validation: number; test: number };
ready_for_vlayer: boolean;
elapsed_seconds: number;
cases: Record<string, unknown>[];
}
async function generateBatch(
apiKey: string,
request: GenerateRequest
): Promise<GenerateResponse> {
const response = await fetch(
"https://synthaba-production.up.railway.app/generate",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": apiKey,
},
body: JSON.stringify(request),
}
);
if (!response.ok) {
const error = await response.json();
throw new Error(`Generation failed (${response.status}): ${error.detail}`);
}
return response.json();
}
// Usage
const result = await generateBatch("your-api-key", {
case_type: "soap_note",
count: 100,
language: "en",
});
console.log(`Generated ${result.records_generated} records`);
console.log(`Quality score: ${result.quality_score}`);
console.log(`Splits: ${JSON.stringify(result.splits)}`);
HuggingFace Datasets Integration
from datasets import Dataset
# Load SynthABA batch into HuggingFace Dataset
import json
with open("batch_abc123/data/train.json") as f:
train_records = json.load(f)
# Flatten patient_context into top-level columns for easier filtering
flat_records = []
for r in train_records:
flat = {**r}
ctx = flat.pop("patient_context", {})
for k, v in ctx.items():
flat[f"patient_{k}"] = str(v) if isinstance(v, list) else v
flat_records.append(flat)
ds = Dataset.from_list(flat_records)
print(ds)
# Dataset({
# features: ['case_id', 'case_type', 'language', 'difficulty', ...],
# num_rows: 70
# })
# Filter and map
severe_ds = ds.filter(lambda x: x["patient_severity"] == "severe")
Rate Limits & Timeouts
| Parameter | Value | |-------------------|----------------------------------------------------| | Max batch size | 2,000 records per request | | Generation timeout| Up to 5 minutes for large batches | | Recommended batch | 100 -- 500 records for optimal performance | | Sample endpoint | 5 records max, rate limited per email address |
Recommendations
- For datasets over 2,000 records, issue multiple sequential requests and merge the results client-side.
- Set HTTP client timeouts to at least 300 seconds (5 minutes) for large batches.
- For production pipelines, use batch sizes of 100--500 records. This balances throughput with per-request reliability.
- Monitor
elapsed_secondsin the response to calibrate your pipeline scheduling.
Versioning
Every record and every batch tracks version information for full reproducibility.
Version Fields in Records
| Field | Example | Description |
|---------------------|-----------|-------------------------------------|
| generator_version | 1.0.0 | SynthABA pipeline version |
| template_version | 1.0.0 | Schema template version |
| generated_by | claude-sonnet | Model used for content generation |
Version Files in Batches
provenance/pipeline_version.txt-- Plain text pipeline versionprovenance/ontology_version.yaml-- Clinical vocabulary versionprovenance/generation_manifest.yaml-- Full generation metadata
Compatibility
When the pipeline version changes, the JSON Schema may gain new fields. New fields are always additive (existing fields are never removed or renamed). Pin to a specific generator_version in your data loading code if you need strict schema stability:
expected_version = "1.0.0"
for record in records:
assert record["generator_version"] == expected_version, (
f"Unexpected version: {record['generator_version']}"
)
Support
- Technical support: support@synthaba.com
- API issues: Always include the
batch_idfrom the response in your support request. - Bug reports: Include the full response body (or at minimum
batch_id,quality_score, and any error messages). - Schema questions: Reference the JSON Schema files in
documentation/schemas/within your batch -- they are the authoritative field specification.