SynthABA Bias and Fairness Report
Version: 1.0 Date: April 2026 Prepared by: SynthABA / FPI Enterprises, Inc.
1. Purpose
This report provides transparent disclosure of the demographic distributions, known biases, limitations, and mitigation strategies within SynthABA's synthetic clinical datasets. It is intended for ethics committees, governance reviewers, compliance officers, and any organization evaluating SynthABA data for use in machine learning, research, or product development.
SynthABA is committed to responsible AI practices. Synthetic data is not bias-free by default -- it reflects the design choices, clinical templates, and parameter distributions selected by its creators. This report documents those choices openly so that consumers can make informed decisions about how to use the data and what supplementary measures may be needed.
2. Demographic Distributions
The following tables detail the demographic distributions present in SynthABA datasets alongside real-world reference data for context.
2.1 Sex Distribution
| Sex | SynthABA Distribution | Real-World Reference | |-----|----------------------|---------------------| | Male | 72% | CDC MMWR 2023: ASD diagnosed at 3.8:1 male-to-female ratio (~79%) | | Female | 28% | CDC MMWR 2023: ~21% of ASD diagnoses |
Notes: SynthABA's distribution is intentionally slightly more balanced than raw CDC prevalence data to improve model generalization for female patients, who are historically underdiagnosed. The 72/28 split represents a compromise between reflecting real-world prevalence and mitigating known diagnostic bias against females.
2.2 Age Distribution
| Age Band | SynthABA Distribution | Real-World Reference | |----------|----------------------|---------------------| | 2-3 years | 15% | Early intervention period; growing but still underserved | | 4-5 years | 25% | Peak diagnosis age; high service utilization | | 6-11 years | 35% | School-age peak; highest volume of ABA services | | 12-17 years | 20% | Adolescent services; moderate utilization | | 18+ years | 5% | Adult services; historically underserved, growing demand |
Notes: The distribution reflects typical ABA service utilization patterns in the United States, where school-age children represent the largest service population. The 18+ band is intentionally conservative; see Section 3 for discussion.
2.3 Severity Distribution
| Severity Level | SynthABA Distribution | DSM-5 Reference | |---------------|----------------------|-----------------| | Mild (Level 1 -- Requiring Support) | 30% | Estimated 30-40% of ASD diagnoses | | Moderate (Level 2 -- Requiring Substantial Support) | 45% | Estimated 35-45% of ASD diagnoses | | Severe (Level 3 -- Requiring Very Substantial Support) | 25% | Estimated 15-25% of ASD diagnoses |
Notes: SynthABA slightly overrepresents severe cases relative to some prevalence estimates to ensure adequate training data for models that must handle complex presentations. Severity levels map to DSM-5 ASD support levels.
2.4 Diagnosis Code Distribution
| Diagnosis Code | Description | Representation | |---------------|-------------|----------------| | F84.0 | Autism Spectrum Disorder | Primary (dominant) | | F90.0, F90.1, F90.2, F90.9 | ADHD (various presentations) | Common comorbidity | | F70, F71 | Mild and Moderate Intellectual Disability | Moderate comorbidity | | F80.1 | Expressive Language Disorder | Common comorbidity | | F80.2 | Mixed Receptive-Expressive Language Disorder | Present | | F41.1 | Generalized Anxiety Disorder | Present | | F84.5 | Asperger Syndrome (legacy code) | Minimal (historical records only) |
Notes: F84.0 (ASD) is the dominant diagnosis, consistent with the Dataset's focus on ABA services. Comorbid conditions are included at clinically realistic rates to ensure models can handle multi-diagnosis presentations.
2.5 Insurer Distribution
| Insurer | SynthABA Distribution | Notes | |---------|----------------------|-------| | UnitedHealthcare (UHC) | ~17% | Roughly equal distribution | | Blue Cross Blue Shield (BCBS) | ~17% | Roughly equal distribution | | Aetna | ~17% | Roughly equal distribution | | Cigna | ~17% | Roughly equal distribution | | TRICARE | ~16% | Military/veteran families | | Medicaid (Florida) | ~16% | Public payer |
Notes: SynthABA uses an approximately equal distribution across six major payers. This is an intentional design choice to prevent insurer-specific model bias, though it does not reflect real-world payer market share, which varies significantly by state and region.
2.6 Service Setting Distribution
| Setting | SynthABA Distribution | Real-World Reference | |---------|----------------------|---------------------| | Clinic | 50% | Most common ABA service delivery setting | | Home | 30% | Second most common; significant for early intervention | | School | 15% | Growing but logistically constrained | | Telehealth | 5% | Expanded post-COVID; still a minority of total hours |
2.7 Language
| Language | Status | |----------|--------| | English | Primary; all records available | | Spanish | Secondary; available on request for select document types |
3. Known Biases and Limitations
3.1 Adult Underrepresentation
Adults aged 18+ constitute only 5% of the Dataset. Real-world adult ASD services are a growing segment, but historically underserved and underrepresented in clinical data. SynthABA's distribution reflects historical service utilization patterns rather than projected future demand. Organizations training models for adult ASD populations should be aware of this limitation and consider supplementary data sources.
3.2 Racial and Ethnic Dimensions Not Modeled
SynthABA intentionally does not model race or ethnicity. Records use age bands, diagnosis codes, and service settings as demographic dimensions -- not race, ethnicity, or national origin. This design choice was made to prevent the encoding of racial bias into synthetic records, which could propagate through downstream models. However, this means:
- The Dataset cannot be used to train race-aware models
- Models trained exclusively on SynthABA data will have no exposure to racial/ethnic health disparities
- Fairness testing across racial subgroups is not possible with SynthABA data alone
3.3 No Geographic Dimension
All records use generic service settings (clinic, home, school, telehealth) without geographic identifiers. There is no state, region, ZIP code, or urban/rural classification. This prevents geographic bias but limits the Dataset's utility for location-sensitive analyses such as regional service availability or state-specific regulatory compliance.
3.4 Insurer Distribution Does Not Reflect Market Reality
The approximately equal distribution across six insurers is a deliberate simplification. In practice, insurer market share varies dramatically by state, with Medicaid dominating in some regions and commercial payers in others. Models trained on SynthABA data may underperform when encountering heavily skewed real-world payer mixes.
3.5 Newer Discipline Templates
Psychotherapy, Speech-Language Pathology (SLP), and Occupational Therapy (OT) templates were developed more recently than the core ABA templates. As a result:
- These templates have undergone fewer rounds of clinician review
- Terminology and documentation conventions may be slightly less refined in early batches
- Quality is expected to converge with ABA template quality as clinician feedback is incorporated
4. Mitigation Strategies
SynthABA employs the following strategies to detect, measure, and mitigate bias in its datasets:
4.1 Pipeline-Level Controls
-
Gate 6 -- Demographic Balance Check: Every batch is validated against target demographic distributions at Gate 6 of the quality pipeline. Batches that deviate beyond acceptable thresholds are flagged for rebalancing before release.
-
Gate 8 -- Edge Case Coverage: Edge cases -- including severe presentations, comorbid diagnoses, telehealth settings, and bilingual records -- are explicitly enforced at Gate 8. A minimum representation threshold ensures these cases are never absent from a released batch.
4.2 Generation-Level Controls
-
ParameterVariator: The generation engine uses a ParameterVariator module that ensures systematic variation across all demographic dimensions. This prevents clustering or over-concentration in any single demographic profile.
-
Template Diversity: Multiple template variants exist for each document type and clinical context, reducing the risk of formulaic records that could introduce structural bias.
4.3 Review-Level Controls
-
Clinician Review: Licensed clinicians from multiple disciplines (BCBAs, SLPs, OTs, psychologists) review samples from every batch. Review specifically checks for discipline-specific terminology errors, clinical implausibility, and unintended demographic patterns.
-
Inter-Rater Reliability: Multiple reviewers assess overlapping samples to ensure consistency in quality and bias assessments.
5. Recommendations for Consumers
Organizations using SynthABA data should consider the following recommendations to ensure fair and responsible use:
5.1 Race-Aware Models
If training models that must account for racial or ethnic disparities in ASD diagnosis, treatment access, or outcomes, supplement SynthABA data with diverse real-world data that includes racial and ethnic demographics. SynthABA alone cannot support race-aware modeling.
5.2 Adult ASD Use Cases
For applications focused on adult ASD populations, note that the 18+ age band represents only 5% of the Dataset. Consider augmenting with additional adult-focused data or oversampling the 18+ records during training to prevent underperformance for this population.
5.3 Non-US Use Cases
SynthABA uses US-centric clinical terminology, insurer conventions, CPT/HCPCS billing codes, and documentation standards. Organizations operating outside the United States should evaluate whether these conventions are compatible with their local regulatory and clinical requirements.
5.4 Model Monitoring
After training on SynthABA data, monitor model performance across demographic subgroups (age bands, severity levels, service settings) to detect any disparities that may have been introduced or amplified during training. Standard fairness metrics (equalized odds, demographic parity, calibration) are recommended.
5.5 Intended Use Boundaries
SynthABA data is designed for model training, testing, benchmarking, and research -- not for direct clinical decision-making. Models trained on synthetic data should undergo additional validation with real-world data before deployment in clinical environments.
6. Commitment
SynthABA commits to the following ongoing practices to advance fairness and reduce bias in its synthetic datasets:
-
Transparent Reporting: Publishing updated demographic distributions and bias analyses with each major release. This report will be versioned alongside the Dataset.
-
Adult Representation Expansion: Increasing the 18+ age band representation in future releases as adult ABA services expand and more clinical templates become available for this population.
-
Insurer and Geographic Expansion: Adding more insurers, state-specific Medicaid variants, and geographic dimensions in future versions to better reflect the diversity of real-world ABA service delivery.
-
Community Feedback: Accepting and acting on bias-related feedback from customers, clinicians, researchers, and the broader ABA community. Reports of bias concerns can be submitted to the SynthABA team for investigation and remediation.
-
Discipline Parity: Investing in additional clinician review cycles for newer discipline templates (psychotherapy, SLP, OT) to achieve quality parity with core ABA templates.
-
Annual Review: Conducting a comprehensive bias and fairness review at least annually, incorporating new research, updated prevalence data, and lessons learned from customer feedback.
This report reflects SynthABA's best understanding of its dataset characteristics as of the publication date. Demographic distributions and mitigation strategies may evolve across releases. Consumers are encouraged to review the latest version of this report for the most current information.