E.S.T.E.R.O.I.D.E.S. (acronym)
- E — Entities (identification of entities, aliases, emails, IPs).
- S — Signals (capture of digital footprints and metadata).
- T — Targeted (focused on specific objectives).
- E — Extraction (automated extraction from web sources).
- R — Reconnaissance (recon and footprinting).
- O — Open-source (the nature of the OSINT engine).
- I — Intelligence (processing and correlation of data).
- D — Data (massive ingestion of unstructured records).
- E — Engine (the central engine that orchestrates queries).
- S — Scraper (automated and persistent collection).
Open-source intelligence (OSINT) aggregator and correlation engine
inspired by Palantir, Bellingcat, Maltego, and Citizen Lab workflows.
A pure open-source re-imagining of the original fucklantir /
osint_palantir toolchain, with a much bigger source catalogue, a
proper knowledge graph, structured parsers, and a multi-backend LLM
analyst.
No payloads. No active scanning. Just 99+ free public OSINT sources, fanned out in parallel, fused into a single intelligence picture.
+--------------------------+
query "example.com" | Estorides Orchestrator | -> STIX 2.1 bundle
---------------------> | - async fanout | -> MISP event JSON
| - 99+ free sources | -> GraphML for Gephi
| - structured parsers | -> JSONL for training
| - entity resolution |
| - knowledge graph |
| - ontology engine | <- OFAC SDN cross-check
| - MITRE ATT&CK mapper | <- technique auto-tagging
| - SSRF guard | <- blocklist at egress
| - audit log + RL | <- per-IP trail
| - multi-LLM analyst | <- BLUF / tactical / system
+--------------------------+
|
v
Web UI: map / graph / timeline / results
Architecture highlights (state-level)
Estorides is structured around small, single-responsibility registries so adding a new source, backend, inferer, or feed never requires touching the central orchestrator. The five plug-in surfaces are:
| Surface | Decorator | File | Used for |
|---|---|---|---|
| Source parsers | @register_parser("name") |
estorides_core/parsers.py |
Translate raw HTTP into structured dicts |
| LLM backends | @register("name") |
estorides_llm/manager.py |
Add an LLM provider (ollama, openai, …) |
| Relationship inferers | @register_inferer("source") |
estorides_core/relationship_inference.py |
Source -> graph edges |
| Real-time feeds | subclass Feed |
estorides_core/feeds.py |
Map layers (quakes, fires, news) |
| Encrypted exporters | estorides_export.encryption |
estorides_export/encryption.py |
STIX/MISP + age encryption |
What you get that the original does not
| Capability | Original | Estorides |
|---|---|---|
| Number of free OSINT sources | ~20 | 99 |
| Intelligence categories | 6 | 12 |
| HTTP fanout model | sequential | async |
| Retries + backoff + circuit breaker | basic | yes |
| Response cache (SQLite) | none | yes |
| Per-source parsers | none | 50+ |
| Entity extraction (IP, domain, CVE…) | regex only | structured + dedup |
| Knowledge graph | none | NetworkX + GraphML |
| STIX 2.1 / MISP export | none | yes |
| Multi-LLM (Ollama / OpenAI / Anthropic) | Ollama only | 4 backends + stub |
| Map (geolocation results) | PyVista 3D | Leaflet 2D |
| Force-directed graph view | none | D3.js |
| API key handling | none | per-source env vars |
| Paid source support | none | flag-based opt-in |
| OFAC SDN sanctions cross-check | none | ontology engine |
| MITRE ATT&CK technique auto-tagging | none | ~40 techniques |
| SSRF / private-NW egress guard | none | RFC1918 + cloud IMDS blocked |
| Audit log (per request, append-only) | none | JSONL with IP+query+latency |
| Per-IP rate limit (sliding window) | none | default 30/min, env-tunable |
| Encrypted export (age) | none | opt-in via ?key=age1… |
| Real-time feed layers | none | earthquakes + fires + news |
| Encrypted export (age) | none | opt-in via ?key=age1… |
What v1.1 adds on top
| Capability | v1.0 | v1.1 (this) |
|---|---|---|
| Persistent graph (Cypher queries) | NetworkX dump | Kùzu embedded DB, cross-run joins |
| Run persistence | JSONL append | SQLite cases with FK observations/entities |
| Cross-feed entity resolver | none | Wikidata + OFAC + IP-API + NVD via intel_resolver |
| Fuzzy entity clustering | exact dedup | difflib SequenceMatcher, 0.85 threshold, aliases surfaced |
| Extra OSINT endpoints (keyless) | 99 YAML sources | +7 (BGP, MAC, phone, GitHub, leaks, CISA KEV, malware C2) |
| Read-only Cypher endpoint | none | /api/intel/graph?q=... with write-keyword guard |
| Case history UI | none | Cases tab + full-entity inspector |
v1.1 architecture
+--------------------------+
query "example.com" | Estorides Orchestrator | -> STIX 2.1 / MISP / GraphML / JSON
---------------------> | + async fanout |
| + 99 free sources |
| + 7 Osiris-style probes |
| + SSRF guard + audit |
| + ontology engine |
| + MITRE ATT&CK mapper |
| + multi-LLM analyst |
| + cross-feed resolver | <- Wikidata SPARQL + OFAC + IP-API + NVD
| + fuzzy entity cluster | <- difflib SequenceMatcher
+-----------+--------------+
|
+----------------+-----------------+--------------------+
v v v
+------------------+ +------------------+ +------------------+
| Kùzu graph DB | | SQLite case store| | In-memory NX |
| (Cypher queries) | | (FK observations)| | (per-run working)|
| 99 node labels | | search by entity | | per-run edges |
| 9 REL types | | search by query | | |
+------------------+ +------------------+ +------------------+
^ ^
+---------/api/intel/resolve-------+
+---------/api/cases/...-----------+
v1.1 API additions
| Endpoint | Purpose |
|---|---|
GET /api/cases?q=<substr>&type=<qtype> |
List past runs. Searchable by query substring. |
GET /api/cases/<id>?full=1 |
Replay a case. full=1 includes observations + entities. |
DELETE /api/cases/<id> |
Drop a case. |
GET /api/intel/resolve?type=<t>&id=<v> |
Cross-feed resolution. type is one of ip, domain, company, person, country, cve, btc_address, eth_address. |
GET /api/intel/graph?q=<cypher> |
Read-only Cypher against the Kùzu graph. Mutations (CREATE/MERGE/SET/DELETE) are rejected. |
GET /api/intel/stats |
One-glance dashboard: case count, Kùzu node/edge counts, resolver cache size. |
GET /api/osiris/bgp?query=<ip|ASxxxxx> |
BGP / ASN lookup via bgpview.io. |
GET /api/osiris/mac?mac=00:1A:... |
MAC OUI vendor via macvendors.co. |
GET /api/osiris/phone?number=+14155552671 |
Phone geolocation (NANP area code → lat/lng). |
GET /api/osiris/github?user=torvalds |
GitHub user + 5 most recent repos. |
GET /api/osiris/leaks?email=... |
XposedOrNot breach analytics (more detail than HIBP). |
GET /api/osiris/cisa-kev?limit=10&days=30 |
CISA Known Exploited Vulnerabilities, recent window. |
GET /api/osiris/malware?limit=200 |
Feodo Tracker + URLhaus active C2, geolocated. |
v1.1 install
pip install -r requirements.txt
The only new required dep is kuzu>=0.11. The orchestrator falls
back to in-memory NetworkX if Kùzu is not importable, but a persistent
cross-run graph only happens with Kùzu present.
Quickstart
1. Install (no extra packages needed; the project uses Flask + NetworkX + requests)
cd estorides
python3 -m pip install flask networkx requests pyyaml
Optional, for a real LLM:
# pick one
ollama serve && ollama pull llama3.1:8b
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export OPENROUTER_API_KEY=sk-or-...
2. CLI
# 99 sources, 12 categories
python3 estorides_cli.py status
# run a query (free sources only)
python3 estorides_cli.py run 8.8.8.8
# enable sources that need an API key
python3 estorides_cli.py run [email protected] --include-paid
# only a subset of sources
python3 estorides_cli.py run example.com \
--only-sources crt_sh_certificates,shodan_internetdb,ipapi_free
# export the latest run as STIX 2.1 or MISP
python3 estorides_cli.py stix --out my_bundle.json
python3 estorides_cli.py misp --out my_event.json
3. Web UI
python3 estorides_cli.py serve --port 5050
# open http://127.0.0.1:5050
UI features:
- 2D map (Leaflet) of every geolocated result
- D3.js force-directed knowledge graph (drag, zoom, hover)
- Timeline of source acquisition
- Source results panel with per-source parsed output
- Filterable entity list
- LLM analysis with backend / model badge
- One-click export: STIX 2.1, MISP, GraphML, JSON
Graph intelligence (Maltego-style)
The Graph canvas turns observations into an interactive intelligence workbench:
- Clusters — nodes are grouped into communities (translucent hulls) and coloured by cluster. Inter-cluster links are dashed/highlighted; click one to see a cross-reference tooltip explaining how two clusters relate (the bridge entities + relation).
- Click to enrich — left-click a node to resolve it (cross-feed + VirusTotal relationships) and merge the new nodes/links into both the graph and the map. Each new node is itself clickable, so exploration is recursive.
- Intelligence tiers — every node carries an auto-computed level shown
as a coloured ring:
data→information(≥2 corroborating sources) →intelligence(cross-cluster / resolved) →counter-intelligence(sanction / threat / VirusTotal-malicious). Override any node's level from the right-click menu or the inspector (persists in the browser). - Transforms — right-click a node (or use the side inspector panel) for transforms grouped by tier: data → information → intelligence → counter-intelligence.
VirusTotal
VirusTotal is integrated both as a source (vt_ip, vt_domain,
vt_file) and as the relationship engine behind graph expansion
(resolved domains/IPs, communicating/dropped files, contacted infra).
It needs a free API key; without it VirusTotal stays inactive and the
rest of the platform is unaffected:
export VT_API_KEY=... # https://www.virustotal.com/gui/my-apikey
99 sources, 12 categories
- DNS Intelligence (9) - Google DoH, Cloudflare DoH, HackerTarget, crt.sh, Cert Spotter, RDAP, DNS Dumpster, host search
- IP & Infrastructure (13) - ip-api, ipinfo, ipapi.co, ipwho.is, Shodan InternetDB, GreyNoise, ipwhois, Robtex, RDAP, AS lookup, AbuseIPDB, MAC OUI, RIPE Stat
- Web Intelligence (10) - urlscan, Wayback CDX, Wayback availability, HTTP headers, whois, geoip, traceroute, nping, Microlink, Google cache
- Social Media (13) - GitHub, Reddit, Mastodon, Keybase, HackerNews, Telegram, Pinterest, WordPress, Medium, DEV.to
- Threat Intelligence (13) - ThreatFox, URLhaus, payloads, PhishTank, OpenPhish, OTX (+passive domain/IP, no key), MalwareBazaar, Feodo, SSLBL, Emerging Threats, blocklist.de
- Breach Intelligence (6) - HIBP breaches, HIBP pastes, Phonebook email, Phonebook domain, DeHashed, IntelligenceX
- Geolocation (5) - Nominatim search + reverse, OpenWeather geocoding, TimeZoneDB, Wikidata
- Knowledge (12) - Wikipedia, summary, DuckDuckGo IA, OpenAlex, Crossref, arXiv, GitHub advisories, NVD CVE, cve.circl, ExploitDB, Reddit subreddit search
- Wireless (5) - WiGLE, IEEE OUI, OpenSky, MarineTraffic, N2YO
- Blockchain (5) - blockchain.info (balance + tx), Blockstream, Ethplorer, mempool.space
- Paste & Leaks (4) - psbdmp, GitHub gist search, TGStat, LeakCheck
- Visual (4) - ScreenshotMachine, Microlink, TinEye, EXIF
Sources are addons: one YAML file per source, organised into category
subdirectories under sources/ (lazyaddons-style). The loader recurses, so
add a new source by dropping sources/<NN_category>/<name>.yaml — no central
registry to edit. Grouped multi-document files still load if present. Point
ESTORIDES_SOURCES_DIR at another tree to use your own addon set. The schema
is documented at the top of estorides_core/source_loader.py.
sources/
01_dns/
dns_google.yaml
crt_sh_certificates.yaml
02_ip_infra/
shodan_internetdb.yaml
...
tools/split_sources.py migrates legacy grouped files into this layout.
Architecture
sources/ one YAML per addon, grouped by category dir (99 addons)
estorides_core/
config.py every tunable (env-overridable)
source_loader.py registry, validation, lookup
async_client.py aiohttp + circuit breaker + SQLite cache
parsers.py 50+ structured parsers (ipapi, dns_json, crtsh…)
entity_extraction.py regex-based entity finder with dedup
knowledge_graph.py NetworkX MultiDiGraph + GraphML export
orchestrator.py glues everything, infers higher-level relations
estorides_llm/
manager.py multi-backend LLM (Ollama → OpenRouter → Anthropic → OpenAI → stub)
estorides_export/
stix.py STIX 2.1 bundle export
misp.py MISP event JSON export
estorides_cli.py argparse CLI
estorides_web.py Flask app
templates/index.html UI
static/{css,js}/estorides.* UI styles + D3 controller
Tips for a real run
- Start with the free-tier sources (default) — that is 80+ endpoints.
- Set
ESTORIDES_PARALLEL=16for faster fanout. - Set
ESTORIDES_TIMEOUT=20if your network is slow. - Disable paid sources you don't have keys for by setting
ESTORIDES_DISABLE_BACKENDS=openai,anthropic(or by leaving--include-paidoff in the CLI). - The SQLite cache lives in
data/estorides_cache.sqlite— delete it to force fresh fetches. - The LLM stage needs a generative model. Ollama auto-selects an
installed model if
ESTORIDES_OLLAMA_MODELis missing, but an embedding-only model (e.g.*:e2b) returns no text and the run falls back to the stub.ollama pull llama3.1:8bfor real analysis.
Performance knobs (bounds that keep a run from stalling)
| Env var | Default | What it caps |
|---|---|---|
ESTORIDES_DEADLINE via --deadline |
30s | hard wall-clock cap for the whole fanout |
ESTORIDES_ENTITY_MAX_SCAN |
120000 | chars scanned per response (huge crt.sh/wayback dumps) |
ESTORIDES_ENTITY_MAX_PER_TYPE |
750 | entities kept per type per source |
ESTORIDES_KG_MAX_COOCCUR |
30 | entities per source in the co-occurrence clique (O(n²) guard) |
ESTORIDES_LLM_REQUEST_TIMEOUT |
12s | per-call LLM HTTP timeout (no orphaned threads) |
Passive recon & operator OPSEC (bug-bounty mode)
For attack-surface scoping the two things that matter are: never let the target observe a probe attributable to your recon window, and never let a queried broker tie the lookups back to your real IP. Estorides enforces both at the engine level.
Contact classification
Every source declares how its traffic reaches the target:
contact |
Meaning | In --passive-only? |
|---|---|---|
none (default) |
Only a third-party DB / resolver / CT log is hit; the target sees nothing | kept |
broker |
A third party actively probes the target on your behalf (ping, traceroute, header fetch) | excluded |
active |
The engine connects to the target's own infrastructure directly | excluded |
An unknown/typo class is treated as active, so a passive-only run can
never be silently widened. --passive-only is enforced even for an
explicit --only-sources list. Sources that log your lookups also carry
logs_queries: true (surfaced in status).
# scope a domain without ever touching its infrastructure
python3 estorides_cli.py discover example.com --passive-only --out-json surface.json
python3 estorides_cli.py run example.com --passive-only
Egress anonymisation
Route every outbound request through a proxy so brokers never see your
real IP. SOCKS (Tor) needs aiohttp_socks; HTTP/HTTPS proxies work with
stock aiohttp and a comma-separated pool rotates per request.
python3 estorides_cli.py run example.com --passive-only --tor
python3 estorides_cli.py run example.com --proxy socks5://127.0.0.1:9050
export ESTORIDES_HTTP_PROXY_POOL="http://p1:8080,http://p2:8080"
Fail-closed: if a SOCKS proxy is requested without aiohttp_socks
installed, the client refuses to run rather than fall back to a
deanonymising direct connection. When proxying, the SSRF guard's local
DNS-resolution leg is skipped (ESTORIDES_PROXY_REMOTE_DNS=1, default) so
your resolver never learns which targets you are investigating — the
literal-host guard still runs and the exit node resolves the name.
Scope classification
Turn a discovered surface into in/out-of-scope flat lists you can pipe into the active phase. Out-of-scope always wins, so an excluded asset is never targeted by accident.
python3 estorides_cli.py scope \
--assets surface.json \
--scope program_scope.txt \
--out scope_result.json \
--flat-dir ./scope_out
# -> scope_out/in_scope_hosts.txt, in_scope_ips.txt, unknown.txt
Rules file grammar (one per line, # comments; a ## out-of-scope
divider separates the two lists):
*.example.com wildcard host suffix (apex + subdomains)
api.example.com exact host
192.0.2.0/24 CIDR (IPv4 or IPv6)
re:^staging-[0-9]+\.ex regex (prefix re:)
## out-of-scope
blog.example.com
192.0.2.200/32
Hard rules
- This is a passive intelligence tool. It does not probe, exploit, or interact with the target beyond what the public sources allow.
- All API keys stay in environment variables; they are never written to disk.
- Respect the rate limits of the upstream services. The circuit breaker will back off automatically when a host starts returning errors.
- Output is for legitimate OSINT, threat intelligence, journalism, academic research, and defensive security work.
Security & operations
| Concern | Control | Where |
|---|---|---|
| Outbound to RFC1918 / loopback / cloud IMDS | SSRF guard runs on every URL before fetch (allowlist override via ESTORIDES_ALLOWED_HOSTS) |
estorides_core/ssrf_guard.py |
| Web DoS / scraping | Sliding-window per-IP rate limit (default 30/min; tune via ESTORIDES_RATE_LIMIT) |
estorides_core/audit.py |
| Compliance trail | Append-only JSONL audit log of every API call (timestamp, IP, query, sources, status, latency) at data/audit.jsonl |
estorides_core/audit.py |
| Adversarial input | validate_query() rejects empty, oversize, control-char, bidi-override, and unsupported-type queries; bidi is rejected outright rather than silently stripped |
estorides_core/validation.py |
| API key leakage | Keys read from env at call time, never logged, never written to disk | estorides_core/orchestrator.py (_resolve_auth) |
| Encrypted report delivery | age (https://age-encryption.org) opt-in via ?key=age1… on the export endpoint; graceful fallback to plaintext when age is missing |
estorides_export/encryption.py |
| Trusting X-Forwarded-For | Only honoured when ESTORIDES_TRUST_PROXY=1 is set explicitly |
estorides_web.py |
| Target observing a recon probe | Per-source contact class; --passive-only (or ESTORIDES_PASSIVE_ONLY=1) keeps only none |
estorides_core/source_loader.py, orchestrator._select_sources |
| Broker tying lookups to operator IP | Egress proxy/Tor (--proxy/--tor, ESTORIDES_HTTP_PROXY[_POOL]); fail-closed on missing SOCKS lib |
estorides_core/async_client.py |
| DNS leak of investigated targets | Local resolution skipped when proxying (ESTORIDES_PROXY_REMOTE_DNS=1, default); literal-host guard still runs |
estorides_core/async_client.py |
Intelligence features
Ontology engine — OFAC SDN cross-check
estorides_core/ontology.py loads the OpenSanctions OFAC SDN list
(CC-BY 4.0) once, indexes it by normalised name + alias, and stamps
every observation with {sanctioned, hits, fields}. The LLM analyst
stage then writes a "SANCTIONED — OFAC SDN match on …" line into the
brief so sanctions exposure is impossible to miss in the report.
Index characteristics:
- ~7 MB, low-tens-of-thousands of entries
- 24h lazy refresh
- Single-flight: concurrent first-loads share one fetch
- Best-effort disk cache at
data/ontology_sdn.json - Stale-on-error: keeps the previous snapshot if a refresh fails
MITRE ATT&CK auto-tagging
estorides_core/mitre_attack.py maps every observation to the
ATT&CK techniques it might support, by both source-keyed table
(40+ techniques across the threat-intel, breach, and web sources)
and keyword scan (catches malware families: mimikatz, cobalt
strike, lockbit, …). Aggregated techniques are exposed at the top
of the orchestrator result as result.mitre.techniques.
Real-time feeds
estorides_core/feeds.py ships three keyless feeds that the map
UI can layer on top of OSINT results:
| Feed | Source | Refresh | Notes |
|---|---|---|---|
| Earthquakes | USGS M2.5+ GeoJSON | 10 min | Always on |
| Fires | NASA FIRMS VIIRS_NOAA20_NRT CSV | 30 min | Requires ESTORIDES_FIRMS_KEY |
| News | GDELT 2.0 article list | 15 min | Coords unavailable; surfaces at (0,0) |
Endpoint: GET /api/feeds?bbox=min_lon,min_lat,max_lon,max_lat&no_cache=1.
LLM prompt flavours
estorides_llm/intelligence_prompts.py ships three prompt styles:
system— the default Palantir-grade analyst with BLUF + confidence-graded findings.bluf— single-paragraph BLUF only, for time-critical briefs.tactical— adds THREAT PICTURE + COA-1/2/3 + IMMEDIATE ACTION.
Backend priority is configurable: ESTORIDES_BACKEND_PRIORITY=openai,ollama
or via the LLMManager constructor.
Tests
# All tests, ~10s
python3 _validate.py
# Individual suites
python3 _test_ssrf.py # 20 SSRF cases
python3 _test_validation.py # 16 input-validation cases
python3 _test_feeds.py # 3 real-time feeds
python3 _test_encryption.py # age encryption + graceful degradation
python3 _test_routes.py # Flask route table
python3 _multi_test.sh # end-to-end: 5 query types through the orchestrator
The validator exits 0 only when every check passes. CI runners can
grep FAIL to surface regressions.
Comments