SynthABA HIPAA Position Paper
Synthetic Clinical Data and HIPAA Compliance
Prepared by: SynthABA, Inc. Date: April 2026 Version: 1.0 Classification: Public — For Distribution to Prospective Buyers and Their Counsel
Executive Summary
SynthABA generates synthetic clinical documentation entirely from structured templates and AI language models. No real patient data is used as input at any stage of the generation pipeline, and no real patient data appears in any output. This paper sets forth SynthABA's position that its synthetic datasets do not constitute Protected Health Information (PHI) as defined under the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and its implementing regulations, and therefore do not trigger HIPAA obligations for organizations that acquire, store, or process them.
1. What is Synthetic Data?
Synthetic data is information that is artificially generated rather than collected from real-world events or individuals. It is critical to distinguish synthetic data from two related but fundamentally different concepts:
-
De-identified data begins as real patient data and undergoes a process of removing or obscuring the 18 categories of identifiers specified under the HIPAA Safe Harbor method (45 CFR Section 164.514(b)(2)). De-identified data is derived from real records and carries residual re-identification risk that must be managed.
-
Anonymized data similarly begins as real data and applies statistical or technical transformations to prevent re-identification. Anonymization may use techniques such as k-anonymity, l-diversity, or differential privacy, but the source remains real patient records.
SynthABA's approach is fundamentally different from both. SynthABA's records are generated from clinical templates (defined as Pydantic schemas with field-level constraints) and an AI language model's general clinical knowledge. At no point in the generation pipeline does SynthABA access, ingest, transform, or reference any individual patient's medical records, insurance claims, or personally identifiable information. The data is fabricated from first principles, not derived from any real-world dataset.
This distinction is material: de-identification and anonymization are risk-reduction processes applied to real PHI, whereas SynthABA's generation process never involves PHI at any stage.
2. HIPAA Safe Harbor Analysis
The HIPAA Privacy Rule establishes two methods for rendering health information "not individually identifiable" under 45 CFR Section 164.514: the Expert Determination method (Section 164.514(b)(1)) and the Safe Harbor method (Section 164.514(b)(2)).
The Safe Harbor method requires the removal of the following 18 categories of identifiers:
| # | Identifier Category | SynthABA Status | |---|---------------------|-----------------| | 1 | Names | Never present. Patient context uses no real names. | | 2 | Geographic data smaller than state | Never present. No addresses are generated. | | 3 | All elements of dates (except year) directly related to an individual, including birth date, admission date, discharge date, date of death, and all ages over 89 | Never present. Age bands (e.g., "3-5 years") are used, not specific dates of birth. | | 4 | Telephone numbers | Never present. Not part of any template schema. | | 5 | Fax numbers | Never present. Not part of any template schema. | | 6 | Email addresses | Never present. Not part of any template schema. | | 7 | Social Security Numbers | Never present. Not part of any template schema. | | 8 | Medical Record Numbers | Never present. Synthetic batch IDs are used instead. | | 9 | Health plan beneficiary numbers | Never present. Insurer category (e.g., "Medicaid") is used, not policy numbers. | | 10 | Account numbers | Never present. Not part of any template schema. | | 11 | Certificate/license numbers | Never present. Not part of any template schema. | | 12 | Vehicle identifiers and serial numbers | Never present. Not part of any template schema. | | 13 | Device identifiers and serial numbers | Never present. Not part of any template schema. | | 14 | Web URLs | Never present. Not part of any template schema. | | 15 | IP addresses | Never present. Not part of any template schema. | | 16 | Biometric identifiers | Never present. Not part of any template schema. | | 17 | Full-face photographs and comparable images | Never present. SynthABA generates text only. | | 18 | Any other unique identifying number, characteristic, or code | Never present. No real-world identifiers are referenced. |
The Safe Harbor standard presupposes a starting point of real health information from which identifiers are removed. SynthABA's data never contained any of these identifiers because the data is generated, not de-identified. The Safe Harbor analysis is presented here not because SynthABA data requires de-identification, but to demonstrate that even if one were to apply the Safe Harbor framework as an analytical lens, no identifiers are present.
Additionally, SynthABA has no actual knowledge that the information generated could be used, alone or in combination with other information, to identify any individual, satisfying the residual-knowledge requirement of 45 CFR Section 164.514(b)(2)(ii).
3. Generation Methodology
SynthABA's data generation pipeline operates as follows:
-
Template Definition. Clinical templates define the structure of each document type (e.g., session notes, assessment reports, treatment plans). Templates are implemented as Pydantic schemas with explicit field constraints governing data types, value ranges, and enumerated options.
-
Patient Context Generation. A synthetic patient context is generated algorithmically for each record. This context includes randomized parameters such as age band (e.g., "3-5 years"), sex, diagnostic codes (e.g., F84.0), severity level, and insurer category (e.g., "Medicaid," "Commercial"). These parameters are drawn from clinically plausible distributions, not from any patient database.
-
AI Content Generation. An AI language model (Anthropic Claude Sonnet) generates clinical narrative content using the template structure and synthetic patient context as inputs. The model draws on general clinical knowledge acquired during its pretraining phase, not on any patient-specific data source.
-
No Real Patient Data at Any Stage. No real patient records are used as input, training data, fine-tuning data, few-shot examples, or reference material at any stage of the pipeline. No real patient database is accessed.
Distinction from other synthetic data providers. Some companies in the synthetic data space (e.g., Syntegra, MDClone) train generative models directly on real patient datasets and produce synthetic records that preserve the statistical properties of the source population. SynthABA does not follow this approach. SynthABA's AI model possesses general clinical knowledge from its broad pretraining corpus but has no access to any patient database, electronic health record system, or claims dataset. This is a material distinction: SynthABA's outputs cannot "leak" real patient information because no real patient information enters the system.
4. PHI Detection Pipeline (Defense in Depth)
Although no PHI should be present in SynthABA's output (because none is used as input), SynthABA implements a defense-in-depth PHI detection pipeline as an additional safeguard. Every generated record is scanned against 163+ regular expression patterns designed to detect potential PHI, including:
- Social Security Numbers (XXX-XX-XXXX and variant patterns)
- Phone numbers (10-digit, formatted, international variants)
- Email addresses (standard RFC 5322 patterns)
- Physical addresses (street number + street name patterns, ZIP codes)
- Medical Record Numbers (common MRN formats)
- Dates of birth (MM/DD/YYYY, YYYY-MM-DD, and natural language variants)
- Real names (matched against a database of common given names and surnames)
- Insurance policy numbers (common payer ID formats)
- IP addresses (IPv4 and IPv6 patterns)
- Financial account numbers (credit card, bank account patterns)
- Vehicle and device identifiers
- 150+ additional patterns covering edge cases and format variants
Zero-tolerance policy. Any record that produces even a single PHI pattern match is automatically rejected and excluded from the dataset. There is no manual override for PHI findings. Records are not remediated; they are discarded entirely and regenerated.
5. VLayer Compliance Pipeline
Every batch of synthetic records passes through a four-step compliance pipeline before release:
-
Ingest. The batch enters a quarantine state. No data is accessible to downstream consumers until all compliance steps are completed.
-
PHI Scan. The 163+ pattern deep scan described in Section 4 is executed against every record in the batch. The pass criterion is zero findings across the entire batch.
-
Synthetic Validation. Records are evaluated for clinical plausibility, internal consistency, and adherence to template constraints. This step ensures that the data is useful for its intended purpose (ML training, benchmarking, product testing) while confirming it does not inadvertently replicate real clinical scenarios with sufficient specificity to identify an individual.
-
Release. A compliance passport is generated for the batch, including a SHA-256 cryptographic hash of the dataset contents, timestamps for each pipeline step, and pass/fail status for each check. No batch is released without passing all four steps.
The VLayer pipeline produces an auditable compliance record for every dataset shipped to buyers.
6. Clinical Review
In addition to automated compliance checks, licensed clinicians review samples from every batch. Reviewers include Board Certified Behavior Analysts (BCBAs), Speech-Language Pathologists (SLPs), Occupational Therapists (OTs), and licensed psychologists, selected based on the clinical domain of the generated records.
Reviewers evaluate samples against a standardized 10-item quality checklist. One checklist item is specifically: "Zero identifiable information (no PHI)." This item requires the reviewer to confirm that the sample contains no information that could reasonably be used to identify a real individual.
Clinical review provides a human-in-the-loop verification layer that complements the automated PHI detection pipeline, addressing edge cases that pattern matching alone may not capture (e.g., highly specific clinical narratives that might inadvertently describe a recognizable case).
7. Legal Position
Based on the foregoing analysis, SynthABA states the following legal positions:
7.1. SynthABA synthetic datasets are not Protected Health Information as defined by 45 CFR Section 160.103. PHI is defined as individually identifiable health information that is created or received by a covered entity or business associate and relates to the past, present, or future physical or mental health condition of an individual, the provision of health care to an individual, or payment for health care. SynthABA data does not relate to any individual because it is not created from, derived from, or linked to any real person's health information.
7.2. SynthABA synthetic datasets do not require a Business Associate Agreement (BAA) between SynthABA and its buyers. Because the data does not constitute PHI, the HIPAA business associate provisions (45 CFR Section 160.103, definition of "business associate") are not triggered by the receipt, storage, or processing of SynthABA data.
7.3. SynthABA synthetic datasets do not require de-identification certification under HIPAA. De-identification under 45 CFR Section 164.514 is a process applied to PHI to render it no longer individually identifiable. SynthABA data was never individually identifiable health information and therefore does not require de-identification.
7.4. Organizations that acquire SynthABA synthetic datasets may use them for machine learning training, benchmarking, software testing, and product development without incurring HIPAA compliance obligations specific to the data itself. Standard organizational HIPAA obligations (e.g., those arising from the organization's own handling of real patient data) are unaffected and remain the organization's responsibility.
Note: This position paper represents SynthABA's analysis and is provided for informational purposes. Organizations should consult their own legal counsel regarding their specific HIPAA compliance obligations.
8. Attestation
SynthABA hereby attests that its synthetic clinical datasets:
-
Are generated entirely from AI models and clinical templates — no real patient records are used as input, training data, or reference material at any stage of the generation pipeline;
-
Do not contain, derive from, or reference any real patient's protected health information — all patient context parameters (age band, sex, diagnosis, severity, insurer category) are generated algorithmically from clinically plausible distributions;
-
Are scanned with 163+ PHI detection patterns under a zero-tolerance policy — any record producing a pattern match is automatically rejected and regenerated; and
-
Are reviewed by licensed clinicians for any inadvertent identifiable information — clinical reviewers confirm zero identifiable information as part of a standardized quality checklist applied to every batch.
References
-
45 CFR Section 164.514(b) — Standards for De-identification of Protected Health Information (Safe Harbor and Expert Determination methods). Available at: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-E/section-164.514
-
45 CFR Section 160.103 — Definitions, including Protected Health Information, Covered Entity, and Business Associate. Available at: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-160/subpart-A/section-160.103
-
U.S. Department of Health and Human Services, Office for Civil Rights. "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule." Originally published November 2012; updated September 2022. Available at: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
-
National Institute of Standards and Technology (NIST). "NIST Special Publication 800-188: De-Identifying Government Datasets." December 2016 (2nd draft). Available at: https://csrc.nist.gov/publications/detail/sp/800-188/draft
This document is provided for informational and compliance review purposes. It does not constitute legal advice. Organizations evaluating SynthABA datasets for use in regulated environments should consult qualified legal counsel regarding their specific obligations under HIPAA and applicable state privacy laws.