SynthABA HIPAA Position Paper

Synthetic Clinical Data and HIPAA Compliance

Prepared by: SynthABA, Inc. Date: April 2026 Version: 1.0 Classification: Public — For Distribution to Prospective Buyers and Their Counsel

Executive Summary

SynthABA generates synthetic clinical documentation entirely from structured templates and AI language models. No real patient data is used as input at any stage of the generation pipeline, and no real patient data appears in any output. This paper sets forth SynthABA's position that its synthetic datasets do not constitute Protected Health Information (PHI) as defined under the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and its implementing regulations, and therefore do not trigger HIPAA obligations for organizations that acquire, store, or process them.

1. What is Synthetic Data?

Synthetic data is information that is artificially generated rather than collected from real-world events or individuals. It is critical to distinguish synthetic data from two related but fundamentally different concepts:

De-identified data begins as real patient data and undergoes a process of removing or obscuring the 18 categories of identifiers specified under the HIPAA Safe Harbor method (45 CFR Section 164.514(b)(2)). De-identified data is derived from real records and carries residual re-identification risk that must be managed.
Anonymized data similarly begins as real data and applies statistical or technical transformations to prevent re-identification. Anonymization may use techniques such as k-anonymity, l-diversity, or differential privacy, but the source remains real patient records.

SynthABA's approach is fundamentally different from both. SynthABA's records are generated from clinical templates (defined as Pydantic schemas with field-level constraints) and an AI language model's general clinical knowledge. At no point in the generation pipeline does SynthABA access, ingest, transform, or reference any individual patient's medical records, insurance claims, or personally identifiable information. The data is fabricated from first principles, not derived from any real-world dataset.

This distinction is material: de-identification and anonymization are risk-reduction processes applied to real PHI, whereas SynthABA's generation process never involves PHI at any stage.

2. HIPAA Safe Harbor Analysis

The HIPAA Privacy Rule establishes two methods for rendering health information "not individually identifiable" under 45 CFR Section 164.514: the Expert Determination method (Section 164.514(b)(1)) and the Safe Harbor method (Section 164.514(b)(2)).

The Safe Harbor method requires the removal of the following 18 categories of identifiers:

| # | Identifier Category | SynthABA Status | |---|---------------------|-----------------| | 1 | Names | Never present. Patient context uses no real names. | | 2 | Geographic data smaller than state | Never present. No addresses are generated. | | 3 | All elements of dates (except year) directly related to an individual, including birth date, admission date, discharge date, date of death, and all ages over 89 | Never present. Age bands (e.g., "3-5 years") are used, not specific dates of birth. | | 4 | Telephone numbers | Never present. Not part of any template schema. | | 5 | Fax numbers | Never present. Not part of any template schema. | | 6 | Email addresses | Never present. Not part of any template schema. | | 7 | Social Security Numbers | Never present. Not part of any template schema. | | 8 | Medical Record Numbers | Never present. Synthetic batch IDs are used instead. | | 9 | Health plan beneficiary numbers | Never present. Insurer category (e.g., "Medicaid") is used, not policy numbers. | | 10 | Account numbers | Never present. Not part of any template schema. | | 11 | Certificate/license numbers | Never present. Not part of any template schema. | | 12 | Vehicle identifiers and serial numbers | Never present. Not part of any template schema. | | 13 | Device identifiers and serial numbers | Never present. Not part of any template schema. | | 14 | Web URLs | Never present. Not part of any template schema. | | 15 | IP addresses | Never present. Not part of any template schema. | | 16 | Biometric identifiers | Never present. Not part of any template schema. | | 17 | Full-face photographs and comparable images | Never present. SynthABA generates text only. | | 18 | Any other unique identifying number, characteristic, or code | Never present. No real-world identifiers are referenced. |

The Safe Harbor standard presupposes a starting point of real health information from which identifiers are removed. SynthABA's data never contained any of these identifiers because the data is generated, not de-identified. The Safe Harbor analysis is presented here not because SynthABA data requires de-identification, but to demonstrate that even if one were to apply the Safe Harbor framework as an analytical lens, no identifiers are present.

Additionally, SynthABA has no actual knowledge that the information generated could be used, alone or in combination with other information, to identify any individual, satisfying the residual-knowledge requirement of 45 CFR Section 164.514(b)(2)(ii).

3. Generation Methodology

SynthABA's data generation pipeline operates as follows:

Template Definition. Clinical templates define the structure of each document type (e.g., session notes, assessment reports, treatment plans). Templates are implemented as Pydantic schemas with explicit field constraints governing data types, value ranges, and enumerated options.
Patient Context Generation. A synthetic patient context is generated algorithmically for each record. This context includes randomized parameters such as age band (e.g., "3-5 years"), sex, diagnostic codes (e.g., F84.0), severity level, and insurer category (e.g., "Medicaid," "Commercial"). These parameters are drawn from clinically plausible distributions, not from any patient database.
AI Content Generation. An AI language model (Anthropic Claude Sonnet) generates clinical narrative content using the template structure and synthetic patient context as inputs. The model draws on general clinical knowledge acquired during its pretraining phase, not on any patient-specific data source.
No Real Patient Data at Any Stage. No real patient records are used as input, training data, fine-tuning data, few-shot examples, or reference material at any stage of the pipeline. No real patient database is accessed.

Distinction from other synthetic data providers. Some companies in the synthetic data space (e.g., Syntegra, MDClone) train generative models directly on real patient datasets and produce synthetic records that preserve the statistical properties of the source population. SynthABA does not follow this approach. SynthABA's AI model possesses general clinical knowledge from its broad pretraining corpus but has no access to any patient database, electronic health record system, or claims dataset. This is a material distinction: SynthABA's outputs cannot "leak" real patient information because no real patient information enters the system.

4. PHI Detection Pipeline (Defense in Depth)

Although no PHI should be present in SynthABA's output (because none is used as input), SynthABA implements a defense-in-depth PHI detection pipeline as an additional safeguard. Two automated layers run on every delivered document:

Quality-gate pre-screen (Gate 5). A 9-pattern regex screen for the highest-signal PHI shapes: Social Security Numbers, phone numbers (two formats), email addresses, street addresses, full dates (two formats), Medical Record Numbers, and common real-name combinations. Any match rejects the record.
Content verification (verifier version phi-content-1.0.0). A validity-aware scan that goes beyond shape matching: Social Security Numbers are checked for structural validity; National Provider Identifiers are Luhn-validated and flagged only when explicitly labeled; US phone numbers are flagged unless they fall in the reserved fictional range; plus explicit-MRN, real-domain-email, and address-with-ZIP detectors. Findings carry only {type, severity, count} — no detected value is ever stored in clear text. The verifier version and verdict are sealed into each document's provenance passport.

Zero-tolerance policy. Any critical finding blocks delivery: the order is marked blocked, nothing is sealed or shipped, and the passport records that the gate fired. There is no manual override for PHI findings. Records are not remediated; they are discarded entirely and regenerated.

Roadmap. Expansion of the content verifier toward the broader 163+ pattern deep-scan coverage described in earlier VLayer documentation is planned. What actually ran on any given document is always declared by the verifier version sealed in its provenance passport — never implied.

5. Automated Compliance Pipeline

Every batch of synthetic records passes through a four-step compliance pipeline before release:

Ingest. The batch is hashed (SHA-256). No data is accessible to downstream consumers until all compliance steps are completed; if any subsequent step fails, the entire batch is blocked -- nothing is sealed, nothing is shipped, and an alert is raised.
PHI Scan. The content verification described in Section 4 (verifier version phi-content-1.0.0) is executed against every record in the batch. The pass criterion is zero critical findings across the entire batch.
Synthetic Validation. Records are evaluated for clinical plausibility, internal consistency, and adherence to template constraints. This step ensures that the data is useful for its intended purpose (ML training, benchmarking, product testing) while confirming it does not inadvertently replicate real clinical scenarios with sufficient specificity to identify an individual.
Release. A provenance passport is sealed per document — verifier version, per-stage results, and SHA-256 chain of custody — and a batch-level MANIFEST.passport.json binds all passports. No batch is released without passing all four steps.

The pipeline produces an auditable, offline-verifiable compliance record for every dataset shipped to buyers.

6. Clinical Review

In addition to automated compliance checks — which run on every batch and whose results are sealed into each document's provenance passport — clinical review operates at the template and gold-set level: Board Certified Behavior Analysts (BCBAs) review the clinical templates and reference materials that drive generation.

Template-level review uses a standardized 10-item quality checklist. One checklist item is specifically: "Zero identifiable information (no PHI)." This item requires the reviewer to confirm that the material contains no information that could reasonably be used to identify a real individual.

The scope of human validation applied to any given document is declared in its provenance passport (human_validation.scope) — never implied. Human review complements the automated PHI detection pipeline, addressing edge cases that pattern matching alone may not capture (e.g., highly specific clinical narratives that might inadvertently describe a recognizable case).

7. Legal Position

Based on the foregoing analysis, SynthABA states the following legal positions:

7.1. SynthABA synthetic datasets are not Protected Health Information as defined by 45 CFR Section 160.103. PHI is defined as individually identifiable health information that is created or received by a covered entity or business associate and relates to the past, present, or future physical or mental health condition of an individual, the provision of health care to an individual, or payment for health care. SynthABA data does not relate to any individual because it is not created from, derived from, or linked to any real person's health information.

7.2. SynthABA synthetic datasets do not require a Business Associate Agreement (BAA) between SynthABA and its buyers. Because the data does not constitute PHI, the HIPAA business associate provisions (45 CFR Section 160.103, definition of "business associate") are not triggered by the receipt, storage, or processing of SynthABA data.

7.3. SynthABA synthetic datasets do not require de-identification certification under HIPAA. De-identification under 45 CFR Section 164.514 is a process applied to PHI to render it no longer individually identifiable. SynthABA data was never individually identifiable health information and therefore does not require de-identification.

7.4. Organizations that acquire SynthABA synthetic datasets may use them for machine learning training, benchmarking, software testing, and product development without incurring HIPAA compliance obligations specific to the data itself. Standard organizational HIPAA obligations (e.g., those arising from the organization's own handling of real patient data) are unaffected and remain the organization's responsibility.

Note: This position paper represents SynthABA's analysis and is provided for informational purposes. Organizations should consult their own legal counsel regarding their specific HIPAA compliance obligations.

8. Attestation

SynthABA hereby attests that its synthetic clinical datasets:

Are generated entirely from AI models and clinical templates — no real patient records are used as input, training data, or reference material at any stage of the generation pipeline;
Do not contain, derive from, or reference any real patient's protected health information — all patient context parameters (age band, sex, diagnosis, severity, insurer category) are generated algorithmically from clinically plausible distributions;
Are scanned by an automated, version-stamped PHI verification pipeline under a zero-tolerance policy — any record producing a critical finding blocks delivery and is regenerated, and the verifier version and verdict are sealed in each document's provenance passport; and
Are generated from clinically reviewed templates with declared human-validation scope — Board Certified Behavior Analysts review the clinical templates and reference materials against a standardized quality checklist; the scope of human validation for each document is declared in its provenance passport (human_validation.scope).

References

45 CFR Section 164.514(b) — Standards for De-identification of Protected Health Information (Safe Harbor and Expert Determination methods). Available at: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-E/section-164.514
45 CFR Section 160.103 — Definitions, including Protected Health Information, Covered Entity, and Business Associate. Available at: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-160/subpart-A/section-160.103
U.S. Department of Health and Human Services, Office for Civil Rights. "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule." Originally published November 2012; updated September 2022. Available at: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
National Institute of Standards and Technology (NIST). "NIST Special Publication 800-188: De-Identifying Government Datasets." December 2016 (2nd draft). Available at: https://csrc.nist.gov/publications/detail/sp/800-188/draft

This document is provided for informational and compliance review purposes. It does not constitute legal advice. Organizations evaluating SynthABA datasets for use in regulated environments should consult qualified legal counsel regarding their specific obligations under HIPAA and applicable state privacy laws.