JSON Schema Tutorial: How to Validate JSON Data in Production
A payment webhook came in with status set to an array instead of a string.
Our deserializer didn't crash. Our database accepted it. Our reconciliation job processed 14 hours of transactions before someone noticed half of them were silently miscategorized.
That bug shipped because we had no schema validation at the boundary. The webhook came from a third party we trusted. Trust is not a validation strategy.
Table of Contents
I'm a senior engineer with 5+ years of API and integration work: a healthcare platform with HL7 integrations, an eCommerce stack handling Stripe, PayPal, and Google Pay webhooks, and a few enterprise systems that ran on data feeds from vendors who treated their JSON contracts as "guidelines."
JSON Schema is the single most useful tool I've found for keeping bad data out of a system. This is what I've actually learned using it in production: the patterns that work, the keywords that bite, and the parts of the spec that everyone skips and later regrets.
If you've moved past "validate types in the controller" and you want a real contract layer, this is for you.
Why I Use JSON Schema
It is not about types. TypeScript and Pydantic give you types. JSON Schema gives you something different: a runtime contract that is also a portable artifact.
I reach for it in five places:
- API request and response validation. The schema lives next to the endpoint and rejects bad data before any handler sees it.
- Webhook payload validation. Third parties change shapes without warning. A schema catches it the moment they do.
- Configuration files. Catches "you typo'd a field name in production.yaml" at startup, not at 3 a.m.
- Test fixtures. Generate fixtures from a schema instead of hand-rolling JSON that drifts from reality.
- LLM output validation. When you ask Claude or GPT for structured output, schema validation is the guardrail that turns "usually a JSON object" into "always a JSON object or fail loudly."
The artifact part matters. A schema is just JSON. You can store it in a repo, version it, share it across services, generate documentation from it, and use the same schema in TypeScript, Python, Go, and Rust. Try doing that with a Pydantic model.
If you need a sandbox to format and inspect a sample payload while you work on the schema, the StackConvert JSON editor and JSON diff tool are what I keep open in the next tab.
Anatomy of a Schema
The smallest useful production schema looks like this:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://stackconvert.com/schemas/order.json",
"type": "object",
"required": ["id", "total", "currency"],
"additionalProperties": false,
"properties": {
"id": { "type": "string", "format": "uuid" },
"total": { "type": "number", "minimum": 0, "maximum": 1000000 },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
"createdAt": { "type": "string", "format": "date-time" }
}
}Six lines that almost nobody writes correctly the first time:
$schema: declares which draft you are using. Always specify it. Validators behave differently across drafts, and silent compatibility is worse than loud failure.$id: a stable identifier so other schemas can reference this one. Treat it like a contract URL.type: the root type. Most production schemas root atobject.required: property names that must be present. Order does not matter.additionalProperties: false: the most important line in the schema.properties: per-key validation rules.
The defaults catch developers off guard. If you write { "type": "object" } with no additionalProperties, every JSON object on earth validates. That is rarely what you want, and it is the single biggest reason teams say "we have JSON Schema" but still ship bugs caused by unexpected fields.
Set additionalProperties: false by default. Loosen it only when you have a written reason.
allOf, anyOf, oneOf: The Composition Trap
This is where I see the most bugs in code review.
allOfmeans "must match all of these." It is intersection. Use it for inheritance-like patterns, where a "premium order" is an order plus extra fields.anyOfmeans "must match at least one." The validator stops at the first match.oneOfmeans "must match exactly one." If your input matches two of the alternatives, validation fails. That is the entire point, and it surprises people every time.
A real story. On the healthcare integration, we accepted patient identifiers in three formats: a 10-digit medical record number, a UUID, or a federated FHIR identifier. The original schema used anyOf. It worked, until a bug in the upstream system started emitting strings that matched two formats at once.
We did not notice for a week. Both alternatives accepted the input, and the downstream code assumed a single canonical format. Records ended up linked to the wrong patients in a non-trivial number of cases.
Switching to oneOf exposed the upstream bug immediately. The schema started rejecting ambiguous IDs at the boundary, which is exactly what we wanted.
The rule I follow now:
- Use
anyOfwhen the alternatives are genuinely disjoint and overlap is impossible. - Use
oneOfwhen overlap means a bug. It is a stricter and more honest contract. - Use
allOffor composition: base shape plus extensions.
If you cannot prove the alternatives are disjoint, default to oneOf.
Format Keywords (and Why They Don't Always Run)
format is JSON Schema's escape hatch for things type cannot express. It covers date-time, email, uri, uuid, ipv4, ipv6, and a handful of others.
The footgun: in many validators, format is informational by default. It does not actually validate.
In Ajv you have to install ajv-formats and explicitly enable it. In Python's jsonschema, you have to pass format_checker=Draft202012Validator.FORMAT_CHECKER. If you do not, "format": "email" is a comment, not a check.
This bit me on a webhook integration. We had format: "email" on a customer email field and assumed it would reject not-an-email. It did not. The malformed string flowed straight through to a marketing platform, which choked, which paged the team.
Explicitly enable format validation, or write a custom keyword. Never assume the validator is checking what you wrote.
Validation in Node with Ajv
Ajv is the standard in the Node ecosystem. It compiles schemas into specialized validation functions, which is the single biggest reason it is fast.
import Ajv from 'ajv'
import addFormats from 'ajv-formats'
const ajv = new Ajv({ allErrors: true, strict: true })
addFormats(ajv)
const schema = {
type: 'object',
required: ['email', 'age'],
additionalProperties: false,
properties: {
email: { type: 'string', format: 'email' },
age: { type: 'integer', minimum: 0, maximum: 150 }
}
}
const validate = ajv.compile(schema)
export function handleSignup(payload) {
if (!validate(payload)) {
return { ok: false, errors: validate.errors }
}
return { ok: true, data: payload }
}Two things to internalize:
- Compile once.
ajv.compileis the slow part. Do it at module load, not per request. I have seen production services drop 40% CPU after moving compile calls out of the request handler. allErrors: truereturns every violation, not just the first. You almost always want this in API responses, otherwise clients fix one error at a time and curse you.
Default Ajv error messages look like must have required property 'email'. Fine for logs, useless in API responses. Install ajv-errors or write a small translation layer that maps error.instancePath and error.keyword into something a client can read.
Validation in Python
The standard library is jsonschema. The pattern is the same: build a validator once, validate many.
from jsonschema import Draft202012Validator
schema = {
"type": "object",
"required": ["email", "age"],
"additionalProperties": False,
"properties": {
"email": {"type": "string", "format": "email"},
"age": {"type": "integer", "minimum": 0, "maximum": 150},
},
}
validator = Draft202012Validator(
schema,
format_checker=Draft202012Validator.FORMAT_CHECKER,
)
def validate_signup(payload):
errors = list(validator.iter_errors(payload))
if errors:
return {"ok": False, "errors": [e.message for e in errors]}
return {"ok": True, "data": payload}iter_errors returns all violations. Use it. The default validate() raises on the first error, which is the wrong shape for API responses.
One caveat: jsonschema is slower than Ajv. If you are validating a high-volume stream, swap to fastjsonschema, which compiles the schema to Python code at import time. I have seen 10x throughput gains on tight ingestion paths.
Schema-Driven API Contract Testing
This is where JSON Schema actually earns its keep. Treat the schema as the contract, and use it in three places:
- Server-side request validation. Reject bad input at the boundary.
- Server-side response validation in tests. Make sure your handlers actually return the shape they promise. Schema-validate every response in your integration tests.
- Client-side test fixtures. Generate fake data that matches the schema so tests do not drift from production reality.
Real example from the eCommerce platform. Every payment provider sent webhooks with a different shape: Stripe nested everything under data.object, PayPal had a top-level resource, Google Pay used its own envelope. We wrote three schemas, one per provider, and validated every incoming webhook against the matching schema before a single line of business logic ran.
When Stripe added a new field, our schema rejected it until we updated the schema explicitly. That is not a bug, that is the system working. We got an alert, looked at the new field, decided whether to accept it, and updated the schema as a versioned commit. Opt-in evolution instead of accidental coupling.
If you maintain side-by-side YAML and JSON configs, you can validate both: convert with the YAML to JSON tool, then run the same schema against the JSON output. One contract, two formats.
Pitfalls That Have Bitten Me
1. additionalProperties defaults to true
I have said this twice already because it matters. The default accepts unknown fields silently. Set additionalProperties: false everywhere unless you specifically want pass-through.
2. Compiling schemas inside handlers
Compile at startup. Cache the validator. If your hot path calls ajv.compile() or builds a Python Validator per request, you are paying the slow path on every request for no reason.
3. Default error messages are not user-facing
must have required property 'foo' is fine for logs, terrible for API responses. Either install a translator (ajv-errors in Node, jsonschema-errors in Python) or write your own mapping from (keyword, path) to a readable sentence.
4. $ref resolution silently fails
When schemas reference each other, the validator needs to resolve them. Cross-file refs work, but you have to load every dependent schema into the same validator instance with addSchema (Ajv) or a RefResolver (jsonschema). Forgetting this gives you "can't resolve reference" errors at runtime, often only on certain code paths that nobody tests.
5. Strict mode catches real bugs
Ajv's strict mode flags unknown keywords, ambiguous patterns, and useless schema constructs. Turn it on. Every complaint it surfaces is a real bug or a real misunderstanding of the spec.
6. Don't validate twice
If you validate at the API boundary, do not re-validate in your service layer. Two validators is two sources of truth for the contract, and they always drift. Validate at the boundary, trust the data inside.
Key Takeaways
- Schema validation is your boundary defense. Catch bad data at the door.
additionalProperties: falseis the most important line. Set it explicitly.- Use
oneOfwhen overlap means a bug.anyOfonly when alternatives are provably disjoint. - Format validation is opt-in. Enable it explicitly.
- Compile schemas once. Validate many.
- Treat schemas as contracts. Server validates input, tests validate output, fixtures match the schema.
Closing Thoughts
JSON Schema is older than most of the code I work with, and it shows. The syntax is verbose. The default error messages need work. The draft inconsistencies still trip me up sometimes.
But it is the only validation tool I trust at scale. It is portable, language-agnostic, and turns "we will validate this somewhere" into "the contract is checked into git."
Most of my JSON Schema scars came from skipping the unglamorous parts: setting additionalProperties: false, enabling format validation, picking oneOf over anyOf when overlap matters. None of those are exciting. All of them prevent bugs that cost real money.
If you are already using JSON Schema, the upgrade path is usually about turning on stricter checks, not writing more schemas. If you are not using it yet, start at one boundary (a webhook, a public API, a config file) and grow from there.
The blast radius of bad input is always larger than you think.
What is the worst piece of "valid" JSON that ever made it into your production database?