Human History According to AI
A machine-generated chronicle of 5,226 years of human civilization
A Python daemon and Claude API pipeline generating a structured JSON corpus of all 5,226 years of recorded human history — with sources, confidence levels, and documented gaps.
“History is written by the victors — but a structured knowledge corpus must account for everyone else.
The Challenge
Structured, machine-readable historical knowledge at scale does not exist. Encyclopedic resources are written for human readers and resist programmatic querying. No standardized format captures the confidence levels, causal relationships, and documented gaps that rigorous historical reasoning requires. Building a 5,226-year corpus manually would take decades; doing it carelessly with AI would produce confident-sounding noise.
The Approach
Designed the ICCRA schema — a JSON format requiring source citations, confidence levels (confirmed, probable, approximate, traditional, legendary), causal relationships, and explicit geographic gap declarations for every year. Built a Python async daemon that calls Claude Sonnet 4.6 via the Anthropic API, validates output against the schema, tracks progress in an append-only ledger, and recovers from failures without data loss. Achieved a 99% cost reduction through direct API integration with batch processing, making a 5,226-year generation run economically viable.
Outcomes
What Is This?
Human History According to AI is an autonomous research daemon that generates a structured, machine-readable chronicle of human civilization — year by year, from 2025 CE back to approximately 3200 BCE. Each of the 5,226 years receives its own JSON file, populated by Claude Sonnet 4.6 through the Anthropic API.
This is not a narrative history. It is a knowledge corpus designed for graph databases, timelines, adversarial review, and further AI reasoning. Every event cites its sources, declares its confidence level, and surfaces disconfirming evidence where it exists.
Why Build This?
Historical knowledge is abundant but unstructured. Existing resources — encyclopedias, academic papers, Wikipedia — are written for human readers: narrative, discursive, and difficult to query at scale. This project asks a different question: what does a structured, machine-readable substrate of human history look like?
The answer is the ICCRA schema — a JSON format that captures events, causal relationships, geographic coverage, confidence levels, and explicit declarations of what we don't know. The geographic gaps field is not an afterthought: it is a deliberate acknowledgment that the documentary record is not evenly distributed across the world's populations.
Architecture
The system is a Python async daemon that orchestrates API calls, tracks progress in an append-only ledger file, validates output against the ICCRA schema, and recovers gracefully from failures. A Next.js 16 frontend provides an interactive timeline visualization of the generated data.
The project achieved a 99% cost reduction by migrating from a third-party orchestration layer to direct Anthropic API calls with batch processing — a change that made sustained, long-running generation economically viable.
Confidence Levels
Every historical claim in the corpus is tagged with one of five confidence levels:
- Confirmed: Primary sources, physical evidence, multiple independent attestations.
- Probable: Strong circumstantial or secondary evidence.
- Approximate: General scholarly consensus, imprecise dating.
- Traditional: Preserved in cultural memory but not independently verified.
- Legendary: Mythological or folkloric — included for completeness, clearly flagged.
This tiered system means the corpus can be queried by epistemic quality, not just by date or region.
Geographic Gaps
One of the schema's most important fields is the explicit declaration of geographic coverage gaps. For any given year, the daemon is required to state which regions and populations are under-documented — not because the history didn't happen, but because the surviving record doesn't capture it. This is an acknowledgment built into the structure of the data rather than hidden in caveats.
Progress
As of April 2026, the daemon has completed 1,160 of 5,226 years — roughly 22% of the full corpus. Years are processed in reverse chronological order, so recent history (where sources are most abundant and verifiable) came first. The generated data is available on GitHub and structured for direct ingestion into graph databases or timeline tools.
The project is open-source. Contributions, schema critiques, and adversarial review of generated content are welcome.
Technology Stack
Resources
Lessons Learned
- AI-generated historical content requires explicit confidence signaling built into the data schema. Without declared uncertainty levels, outputs read as authoritative when they should not be.
- Geographic coverage gaps are a feature, not a bug. Making them a required schema field forces honest accounting of what the documentary record does and does not capture.
- An append-only ledger is the right pattern for long-running AI generation jobs — it makes the process resumable, auditable, and cost-predictable without complex state management.
- 99% cost reduction through direct API access vs. third-party orchestration layers changes what is economically viable in AI research projects. Infrastructure choices have research-scope consequences.
- Reverse chronological processing is the right order — start with the years where you can verify quality before committing to the ancient record where verification is harder.