OSINT tool that mines historical contacts (email, phone numbers) for a domain out of web.archive.org snapshots. Built for investigations where the current site no longer shows contact details (or shows different ones) but earlier versions are preserved in the archive.
Full docs: docs/USAGE.md · docs/ARCHITECTURE.md.
Quick start
pip install -e .
# Default scan: 5-minute timeout, contact-URL filter on.
kronikier theranos.com
# How did one specific page evolve across snapshots?
kronikier --single-url https://www.theranos.com/contact-us
# Long-running thorough scan, every URL the host has.
kronikier wirecard.com --exhaustive
# Batch a list of targets.
kronikier --targets-file targets.txt
The first ever invocation runs a one-time latency calibration (~3-5 s, cached for 14 days). Every run writes a CSV next to the current directory and prints a summary table to the terminal:
Contact First seen Last seen
───────────────────────── ────────── ──────────
email [email protected] 2014-02-08 2015-02-24
phone +1 855-843-7200 2014-09-02 2015-11-18
CSV saved: ./theranos.com_20260529_200437.csv
[*] Cache: 3 hit / 0 miss (saved 3 fetches on web.archive.org)
That's it for getting started. See docs/USAGE.md for the full CLI reference, batch-input format, snapshot-cache controls, and troubleshooting.
What you get
Every run produces:
- A monospace summary table on the terminal — first/last seen timestamps per contact, ordered chronologically.
- A CSV file with nine columns:
kind, value, value_human, value_raw, first_seen, last_seen, sightings_count, first_archive_url, last_archive_url. UTF-8 with BOM so Excel opens non-ASCII content cleanly. - A live contact feed in the terminal (
+ email …,+ phone …) as results stream in.-vadds the snapshot date + URL beneath each one.
Phone numbers are reconstructed, not copied. The
valuecolumn is a normalised E.164 form produced by libphonenumber from whatever digits and punctuation appeared on the page — there's a guess about country code, trunk prefix, and grouping involved. Most of the time it's right, but on weird formats (truncated numbers, ambiguous regions, OCR-style typos) it can land on a plausible-but-wrong number. If a phone in thevaluecolumn looks off, always cross-checkvalue_rawto see exactly how it appeared on the page — that's the ground truth.
--json switches the table to a machine-readable JSON object on stdout;
CSV is still written and the path is reported on stderr.
Empty results print No contacts found. and skip the CSV entirely.
When this is the right tool
The Wayback Machine is often the only source of authoritative owner information when a site:
- is dead;
- was re-registered and the contacts scrubbed;
- changed editorial team and removed the previous people in charge;
- surfaced in a dubious story and the phone numbers vanished overnight.
For everything else (a live site whose contacts page works), just open the live site.
Snapshot cache
Wayback snapshots are immutable, so rerunning the same scan can answer
without spending more IA bytes. The CLI keeps an on-disk cache, on by
default, at ~/.cache/kronikier/snapshots/. One HTML file per
snapshot, browsable on disk, grouped by host.
Disable with --no-cache. Wipe with --clear-cache.
Full layout, env-var override, and best-effort semantics are in docs/USAGE.md.
Rate limits and ethics
- Default
--rate 4(4 req/sec to wayback) is polite. Don't raise it without a reason — the archive serves everyone. - All data extracted is already public. The tool doesn't bypass paywalls, crawler blocks, or any access control.
- Contacts in the archive may belong to people no longer associated
with the domain. The timeline columns (
first_seen/last_seen) are there so you can judge what was current when. - Use only within the scope of legal investigations.
OSINT precedents
The "recover contacts from archives" technique is documented across many published investigations. A few:
- John Carreyrou / WSJ — Theranos. Material from archived
theranos.comused as evidence after the company shut down. - Brian Krebs (KrebsOnSecurity). Multiple investigations of malware and fraud site operators identified via archived contact emails on early versions of their landing pages.
- OCCRP shell-company investigations. "Contact" pages of shell sites routinely name the human beneficiary before being scrubbed.
- Bellingcat. Applied the same technique across multiple investigations.
The end-to-end test suite (pytest -m e2e) hits the live wayback for
theranos.com and enron.com and asserts that the canonical contacts
of both still surface — the OSINT use case is the regression test.

Comments