grisuno/estorides: open-source intelligence (OSINT) aggregator and cor

E.S.T.E.R.O.I.D.E.S. (acronym)

E — Entities (identification of entities, aliases, emails, IPs).
S — Signals (capture of digital footprints and metadata).
T — Targeted (focused on specific objectives).
E — Extraction (automated extraction from web sources).
R — Reconnaissance (recon and footprinting).
O — Open-source (the nature of the OSINT engine).
I — Intelligence (processing and correlation of data).
D — Data (massive ingestion of unstructured records).
E — Engine (the central engine that orchestrates queries).
S — Scraper (automated and persistent collection).

Open-source intelligence (OSINT) aggregator and correlation engine inspired by Palantir, Bellingcat, Maltego, and Citizen Lab workflows. A pure open-source re-imagining of the original fucklantir / osint_palantir toolchain, with a much bigger source catalogue, a proper knowledge graph, structured parsers, and a multi-backend LLM analyst.

No payloads. No active scanning. Just 99+ free public OSINT sources, fanned out in parallel, fused into a single intelligence picture.

                       +--------------------------+
   query "example.com" |  Estorides Orchestrator  |   -> STIX 2.1 bundle
---------------------> |  - async fanout          |   -> MISP event JSON
                       |  - 99+ free sources      |   -> GraphML for Gephi
                       |  - structured parsers    |   -> JSONL for training
                       |  - entity resolution     |
                       |  - knowledge graph       |
                       |  - ontology engine       |   <- OFAC SDN cross-check
                       |  - MITRE ATT&CK mapper   |   <- technique auto-tagging
                       |  - SSRF guard            |   <- blocklist at egress
                       |  - audit log + RL        |   <- per-IP trail
                       |  - multi-LLM analyst     |   <- BLUF / tactical / system
                       +--------------------------+
                                 |
                                 v
                          Web UI: map / graph / timeline / results

Architecture highlights (state-level)

Estorides is structured around small, single-responsibility registries so adding a new source, backend, inferer, or feed never requires touching the central orchestrator. The five plug-in surfaces are:

Surface	Decorator	File	Used for
Source parsers	`@register_parser("name")`	`estorides_core/parsers.py`	Translate raw HTTP into structured dicts
LLM backends	`@register("name")`	`estorides_llm/manager.py`	Add an LLM provider (ollama, openai, …)
Relationship inferers	`@register_inferer("source")`	`estorides_core/relationship_inference.py`	Source -> graph edges
Real-time feeds	subclass `Feed`	`estorides_core/feeds.py`	Map layers (quakes, fires, news)
Encrypted exporters	`estorides_export.encryption`	`estorides_export/encryption.py`	STIX/MISP + age encryption

What you get that the original does not

Capability	Original	Estorides
Number of free OSINT sources	~20	99
Intelligence categories	6	12
HTTP fanout model	sequential	async
Retries + backoff + circuit breaker	basic	yes
Response cache (SQLite)	none	yes
Per-source parsers	none	50+
Entity extraction (IP, domain, CVE…)	regex only	structured + dedup
Knowledge graph	none	NetworkX + GraphML
STIX 2.1 / MISP export	none	yes
Multi-LLM (Ollama / OpenAI / Anthropic)	Ollama only	4 backends + stub
Map (geolocation results)	PyVista 3D	Leaflet 2D
Force-directed graph view	none	D3.js
API key handling	none	per-source env vars
Paid source support	none	flag-based opt-in
OFAC SDN sanctions cross-check	none	ontology engine
MITRE ATT&CK technique auto-tagging	none	~40 techniques
SSRF / private-NW egress guard	none	RFC1918 + cloud IMDS blocked
Audit log (per request, append-only)	none	JSONL with IP+query+latency
Per-IP rate limit (sliding window)	none	default 30/min, env-tunable
Encrypted export (age)	none	opt-in via `?key=age1…`
Real-time feed layers	none	earthquakes + fires + news
Encrypted export (age)	none	opt-in via `?key=age1…`

What v1.1 adds on top

Capability	v1.0	v1.1 (this)
Persistent graph (Cypher queries)	NetworkX dump	Kùzu embedded DB, cross-run joins
Run persistence	JSONL append	SQLite cases with FK observations/entities
Cross-feed entity resolver	none	Wikidata + OFAC + IP-API + NVD via `intel_resolver`
Fuzzy entity clustering	exact dedup	`difflib` SequenceMatcher, 0.85 threshold, aliases surfaced
Extra OSINT endpoints (keyless)	99 YAML sources	+7 (BGP, MAC, phone, GitHub, leaks, CISA KEV, malware C2)
Read-only Cypher endpoint	none	`/api/intel/graph?q=...` with write-keyword guard
Case history UI	none	Cases tab + full-entity inspector

v1.1 architecture

                       +--------------------------+
   query "example.com" |  Estorides Orchestrator  |   -> STIX 2.1 / MISP / GraphML / JSON
---------------------> |  + async fanout          |
                       |  + 99 free sources       |
                       |  + 7 Osiris-style probes |
                       |  + SSRF guard + audit     |
                       |  + ontology engine       |
                       |  + MITRE ATT&CK mapper   |
                       |  + multi-LLM analyst     |
                       |  + cross-feed resolver   |   <- Wikidata SPARQL + OFAC + IP-API + NVD
                       |  + fuzzy entity cluster  |   <- difflib SequenceMatcher
                       +-----------+--------------+
                                   |
                  +----------------+-----------------+--------------------+
                  v                                  v                    v
        +------------------+              +------------------+    +------------------+
        | Kùzu graph DB    |              | SQLite case store|    | In-memory NX     |
        | (Cypher queries) |              | (FK observations)|    | (per-run working)|
        | 99 node labels   |              | search by entity |    | per-run edges    |
        | 9 REL types      |              | search by query  |    |                  |
        +------------------+              +------------------+    +------------------+
                  ^                                  ^
                  +---------/api/intel/resolve-------+
                  +---------/api/cases/...-----------+

v1.1 API additions

Endpoint	Purpose
`GET /api/cases?q=<substr>&type=<qtype>`	List past runs. Searchable by query substring.
`GET /api/cases/<id>?full=1`	Replay a case. `full=1` includes observations + entities.
`DELETE /api/cases/<id>`	Drop a case.
`GET /api/intel/resolve?type=<t>&id=<v>`	Cross-feed resolution. `type` is one of `ip`, `domain`, `company`, `person`, `country`, `cve`, `btc_address`, `eth_address`.
`GET /api/intel/graph?q=<cypher>`	Read-only Cypher against the Kùzu graph. Mutations (`CREATE`/`MERGE`/`SET`/`DELETE`) are rejected.
`GET /api/intel/stats`	One-glance dashboard: case count, Kùzu node/edge counts, resolver cache size.
`GET /api/osiris/bgp?query=<ip\|ASxxxxx>`	BGP / ASN lookup via `bgpview.io`.
`GET /api/osiris/mac?mac=00:1A:...`	MAC OUI vendor via `macvendors.co`.
`GET /api/osiris/phone?number=+14155552671`	Phone geolocation (NANP area code → lat/lng).
`GET /api/osiris/github?user=torvalds`	GitHub user + 5 most recent repos.
`GET /api/osiris/leaks?email=...`	XposedOrNot breach analytics (more detail than HIBP).
`GET /api/osiris/cisa-kev?limit=10&days=30`	CISA Known Exploited Vulnerabilities, recent window.
`GET /api/osiris/malware?limit=200`	Feodo Tracker + URLhaus active C2, geolocated.

v1.1 install

pip install -r requirements.txt

The only new required dep is kuzu>=0.11. The orchestrator falls back to in-memory NetworkX if Kùzu is not importable, but a persistent cross-run graph only happens with Kùzu present.

Quickstart

1. Install (no extra packages needed; the project uses Flask + NetworkX + requests)

cd estorides
python3 -m pip install flask networkx requests pyyaml

Optional, for a real LLM:

# pick one
ollama serve && ollama pull llama3.1:8b
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export OPENROUTER_API_KEY=sk-or-...

2. CLI

# 99 sources, 12 categories
python3 estorides_cli.py status

# run a query (free sources only)
python3 estorides_cli.py run 8.8.8.8

# enable sources that need an API key
python3 estorides_cli.py run [email protected] --include-paid

# only a subset of sources
python3 estorides_cli.py run example.com \
    --only-sources crt_sh_certificates,shodan_internetdb,ipapi_free

# export the latest run as STIX 2.1 or MISP
python3 estorides_cli.py stix --out my_bundle.json
python3 estorides_cli.py misp --out my_event.json

3. Web UI

python3 estorides_cli.py serve --port 5050
# open http://127.0.0.1:5050

UI features:

2D map (Leaflet) of every geolocated result
D3.js force-directed knowledge graph (drag, zoom, hover)
Timeline of source acquisition
Source results panel with per-source parsed output
Filterable entity list
LLM analysis with backend / model badge
One-click export: STIX 2.1, MISP, GraphML, JSON

Graph intelligence (Maltego-style)

The Graph canvas turns observations into an interactive intelligence workbench:

Clusters — nodes are grouped into communities (translucent hulls) and coloured by cluster. Inter-cluster links are dashed/highlighted; click one to see a cross-reference tooltip explaining how two clusters relate (the bridge entities + relation).
Click to enrich — left-click a node to resolve it (cross-feed + VirusTotal relationships) and merge the new nodes/links into both the graph and the map. Each new node is itself clickable, so exploration is recursive.
Intelligence tiers — every node carries an auto-computed level shown as a coloured ring: data → information (≥2 corroborating sources) → intelligence (cross-cluster / resolved) → counter-intelligence (sanction / threat / VirusTotal-malicious). Override any node's level from the right-click menu or the inspector (persists in the browser).
Transforms — right-click a node (or use the side inspector panel) for transforms grouped by tier: data → information → intelligence → counter-intelligence.

VirusTotal

VirusTotal is integrated both as a source (vt_ip, vt_domain, vt_file) and as the relationship engine behind graph expansion (resolved domains/IPs, communicating/dropped files, contacted infra). It needs a free API key; without it VirusTotal stays inactive and the rest of the platform is unaffected:

export VT_API_KEY=...   # https://www.virustotal.com/gui/my-apikey

99 sources, 12 categories

DNS Intelligence (9) - Google DoH, Cloudflare DoH, HackerTarget, crt.sh, Cert Spotter, RDAP, DNS Dumpster, host search
IP & Infrastructure (13) - ip-api, ipinfo, ipapi.co, ipwho.is, Shodan InternetDB, GreyNoise, ipwhois, Robtex, RDAP, AS lookup, AbuseIPDB, MAC OUI, RIPE Stat
Web Intelligence (10) - urlscan, Wayback CDX, Wayback availability, HTTP headers, whois, geoip, traceroute, nping, Microlink, Google cache
Social Media (13) - GitHub, Reddit, Mastodon, Keybase, HackerNews, Telegram, Pinterest, WordPress, Medium, DEV.to
Threat Intelligence (13) - ThreatFox, URLhaus, payloads, PhishTank, OpenPhish, OTX (+passive domain/IP, no key), MalwareBazaar, Feodo, SSLBL, Emerging Threats, blocklist.de
Breach Intelligence (6) - HIBP breaches, HIBP pastes, Phonebook email, Phonebook domain, DeHashed, IntelligenceX
Geolocation (5) - Nominatim search + reverse, OpenWeather geocoding, TimeZoneDB, Wikidata
Knowledge (12) - Wikipedia, summary, DuckDuckGo IA, OpenAlex, Crossref, arXiv, GitHub advisories, NVD CVE, cve.circl, ExploitDB, Reddit subreddit search
Wireless (5) - WiGLE, IEEE OUI, OpenSky, MarineTraffic, N2YO
Blockchain (5) - blockchain.info (balance + tx), Blockstream, Ethplorer, mempool.space
Paste & Leaks (4) - psbdmp, GitHub gist search, TGStat, LeakCheck
Visual (4) - ScreenshotMachine, Microlink, TinEye, EXIF

Sources are addons: one YAML file per source, organised into category subdirectories under sources/ (lazyaddons-style). The loader recurses, so add a new source by dropping sources/<NN_category>/<name>.yaml — no central registry to edit. Grouped multi-document files still load if present. Point ESTORIDES_SOURCES_DIR at another tree to use your own addon set. The schema is documented at the top of estorides_core/source_loader.py.

sources/
  01_dns/
    dns_google.yaml
    crt_sh_certificates.yaml
  02_ip_infra/
    shodan_internetdb.yaml
  ...

tools/split_sources.py migrates legacy grouped files into this layout.

Architecture

sources/                  one YAML per addon, grouped by category dir (99 addons)
estorides_core/
    config.py             every tunable (env-overridable)
    source_loader.py      registry, validation, lookup
    async_client.py       aiohttp + circuit breaker + SQLite cache
    parsers.py            50+ structured parsers (ipapi, dns_json, crtsh…)
    entity_extraction.py  regex-based entity finder with dedup
    knowledge_graph.py    NetworkX MultiDiGraph + GraphML export
    orchestrator.py       glues everything, infers higher-level relations
estorides_llm/
    manager.py            multi-backend LLM (Ollama → OpenRouter → Anthropic → OpenAI → stub)
estorides_export/
    stix.py               STIX 2.1 bundle export
    misp.py               MISP event JSON export
estorides_cli.py          argparse CLI
estorides_web.py          Flask app
templates/index.html      UI
static/{css,js}/estorides.*  UI styles + D3 controller

Tips for a real run

Start with the free-tier sources (default) — that is 80+ endpoints.
Set ESTORIDES_PARALLEL=16 for faster fanout.
Set ESTORIDES_TIMEOUT=20 if your network is slow.
Disable paid sources you don't have keys for by setting ESTORIDES_DISABLE_BACKENDS=openai,anthropic (or by leaving --include-paid off in the CLI).
The SQLite cache lives in data/estorides_cache.sqlite — delete it to force fresh fetches.
The LLM stage needs a generative model. Ollama auto-selects an installed model if ESTORIDES_OLLAMA_MODEL is missing, but an embedding-only model (e.g. *:e2b) returns no text and the run falls back to the stub. ollama pull llama3.1:8b for real analysis.

Performance knobs (bounds that keep a run from stalling)

Env var	Default	What it caps
`ESTORIDES_DEADLINE` via `--deadline`	30s	hard wall-clock cap for the whole fanout
`ESTORIDES_ENTITY_MAX_SCAN`	120000	chars scanned per response (huge crt.sh/wayback dumps)
`ESTORIDES_ENTITY_MAX_PER_TYPE`	750	entities kept per type per source
`ESTORIDES_KG_MAX_COOCCUR`	30	entities per source in the co-occurrence clique (O(n²) guard)
`ESTORIDES_LLM_REQUEST_TIMEOUT`	12s	per-call LLM HTTP timeout (no orphaned threads)

Passive recon & operator OPSEC (bug-bounty mode)

For attack-surface scoping the two things that matter are: never let the target observe a probe attributable to your recon window, and never let a queried broker tie the lookups back to your real IP. Estorides enforces both at the engine level.

Contact classification

Every source declares how its traffic reaches the target:

`contact`	Meaning	In `--passive-only`?
`none` (default)	Only a third-party DB / resolver / CT log is hit; the target sees nothing	kept
`broker`	A third party actively probes the target on your behalf (ping, traceroute, header fetch)	excluded
`active`	The engine connects to the target's own infrastructure directly	excluded

An unknown/typo class is treated as active, so a passive-only run can never be silently widened. --passive-only is enforced even for an explicit --only-sources list. Sources that log your lookups also carry logs_queries: true (surfaced in status).

# scope a domain without ever touching its infrastructure
python3 estorides_cli.py discover example.com --passive-only --out-json surface.json
python3 estorides_cli.py run example.com --passive-only

Egress anonymisation

Route every outbound request through a proxy so brokers never see your real IP. SOCKS (Tor) needs aiohttp_socks; HTTP/HTTPS proxies work with stock aiohttp and a comma-separated pool rotates per request.

python3 estorides_cli.py run example.com --passive-only --tor
python3 estorides_cli.py run example.com --proxy socks5://127.0.0.1:9050
export ESTORIDES_HTTP_PROXY_POOL="http://p1:8080,http://p2:8080"

Fail-closed: if a SOCKS proxy is requested without aiohttp_socks installed, the client refuses to run rather than fall back to a deanonymising direct connection. When proxying, the SSRF guard's local DNS-resolution leg is skipped (ESTORIDES_PROXY_REMOTE_DNS=1, default) so your resolver never learns which targets you are investigating — the literal-host guard still runs and the exit node resolves the name.

Scope classification

Turn a discovered surface into in/out-of-scope flat lists you can pipe into the active phase. Out-of-scope always wins, so an excluded asset is never targeted by accident.

python3 estorides_cli.py scope \
    --assets surface.json \
    --scope program_scope.txt \
    --out scope_result.json \
    --flat-dir ./scope_out
# -> scope_out/in_scope_hosts.txt, in_scope_ips.txt, unknown.txt

Rules file grammar (one per line, # comments; a ## out-of-scope divider separates the two lists):

*.example.com            wildcard host suffix (apex + subdomains)
api.example.com          exact host
192.0.2.0/24             CIDR (IPv4 or IPv6)
re:^staging-[0-9]+\.ex   regex (prefix re:)

## out-of-scope
blog.example.com
192.0.2.200/32

Hard rules

This is a passive intelligence tool. It does not probe, exploit, or interact with the target beyond what the public sources allow.
All API keys stay in environment variables; they are never written to disk.
Respect the rate limits of the upstream services. The circuit breaker will back off automatically when a host starts returning errors.
Output is for legitimate OSINT, threat intelligence, journalism, academic research, and defensive security work.

Security & operations

Concern	Control	Where
Outbound to RFC1918 / loopback / cloud IMDS	SSRF guard runs on every URL before fetch (allowlist override via `ESTORIDES_ALLOWED_HOSTS`)	`estorides_core/ssrf_guard.py`
Web DoS / scraping	Sliding-window per-IP rate limit (default 30/min; tune via `ESTORIDES_RATE_LIMIT`)	`estorides_core/audit.py`
Compliance trail	Append-only JSONL audit log of every API call (timestamp, IP, query, sources, status, latency) at `data/audit.jsonl`	`estorides_core/audit.py`
Adversarial input	`validate_query()` rejects empty, oversize, control-char, bidi-override, and unsupported-type queries; bidi is rejected outright rather than silently stripped	`estorides_core/validation.py`
API key leakage	Keys read from env at call time, never logged, never written to disk	`estorides_core/orchestrator.py` (`_resolve_auth`)
Encrypted report delivery	`age` (https://age-encryption.org) opt-in via `?key=age1…` on the export endpoint; graceful fallback to plaintext when `age` is missing	`estorides_export/encryption.py`
Trusting X-Forwarded-For	Only honoured when `ESTORIDES_TRUST_PROXY=1` is set explicitly	`estorides_web.py`
Target observing a recon probe	Per-source `contact` class; `--passive-only` (or `ESTORIDES_PASSIVE_ONLY=1`) keeps only `none`	`estorides_core/source_loader.py`, `orchestrator._select_sources`
Broker tying lookups to operator IP	Egress proxy/Tor (`--proxy`/`--tor`, `ESTORIDES_HTTP_PROXY[_POOL]`); fail-closed on missing SOCKS lib	`estorides_core/async_client.py`
DNS leak of investigated targets	Local resolution skipped when proxying (`ESTORIDES_PROXY_REMOTE_DNS=1`, default); literal-host guard still runs	`estorides_core/async_client.py`

Intelligence features

Ontology engine — OFAC SDN cross-check

estorides_core/ontology.py loads the OpenSanctions OFAC SDN list (CC-BY 4.0) once, indexes it by normalised name + alias, and stamps every observation with {sanctioned, hits, fields}. The LLM analyst stage then writes a "SANCTIONED — OFAC SDN match on …" line into the brief so sanctions exposure is impossible to miss in the report.

Index characteristics:

~7 MB, low-tens-of-thousands of entries
24h lazy refresh
Single-flight: concurrent first-loads share one fetch
Best-effort disk cache at data/ontology_sdn.json
Stale-on-error: keeps the previous snapshot if a refresh fails

MITRE ATT&CK auto-tagging

estorides_core/mitre_attack.py maps every observation to the ATT&CK techniques it might support, by both source-keyed table (40+ techniques across the threat-intel, breach, and web sources) and keyword scan (catches malware families: mimikatz, cobalt strike, lockbit, …). Aggregated techniques are exposed at the top of the orchestrator result as result.mitre.techniques.

Real-time feeds

estorides_core/feeds.py ships three keyless feeds that the map UI can layer on top of OSINT results:

Feed	Source	Refresh	Notes
Earthquakes	USGS M2.5+ GeoJSON	10 min	Always on
Fires	NASA FIRMS VIIRS_NOAA20_NRT CSV	30 min	Requires `ESTORIDES_FIRMS_KEY`
News	GDELT 2.0 article list	15 min	Coords unavailable; surfaces at (0,0)

Endpoint: GET /api/feeds?bbox=min_lon,min_lat,max_lon,max_lat&no_cache=1.

LLM prompt flavours

estorides_llm/intelligence_prompts.py ships three prompt styles:

system — the default Palantir-grade analyst with BLUF + confidence-graded findings.
bluf — single-paragraph BLUF only, for time-critical briefs.
tactical — adds THREAT PICTURE + COA-1/2/3 + IMMEDIATE ACTION.

Backend priority is configurable: ESTORIDES_BACKEND_PRIORITY=openai,ollama or via the LLMManager constructor.

Tests

# All tests, ~10s
python3 _validate.py

# Individual suites
python3 _test_ssrf.py        # 20 SSRF cases
python3 _test_validation.py  # 16 input-validation cases
python3 _test_feeds.py       # 3 real-time feeds
python3 _test_encryption.py  # age encryption + graceful degradation
python3 _test_routes.py      # Flask route table
python3 _multi_test.sh       # end-to-end: 5 query types through the orchestrator

The validator exits 0 only when every check passes. CI runners can grep FAIL to surface regressions.

grisuno/estorides: open-source intelligence (OSINT) aggregator and correlation engine

E.S.T.E.R.O.I.D.E.S. (acronym)

Architecture highlights (state-level)

What you get that the original does not

What v1.1 adds on top

v1.1 architecture

v1.1 API additions

v1.1 install

Quickstart

1. Install (no extra packages needed; the project uses Flask + NetworkX + requests)

2. CLI

3. Web UI

Graph intelligence (Maltego-style)

VirusTotal

99 sources, 12 categories

Architecture

Tips for a real run

Performance knobs (bounds that keep a run from stalling)

Passive recon & operator OPSEC (bug-bounty mode)

Contact classification

Egress anonymisation

Scope classification

Hard rules

Security & operations

Intelligence features

Ontology engine — OFAC SDN cross-check

MITRE ATT&CK auto-tagging

Real-time feeds

LLM prompt flavours

Tests

Comments

E.S.T.E.R.O.I.D.E.S. (acronym)

Architecture highlights (state-level)

What you get that the original does not

What v1.1 adds on top

v1.1 architecture

v1.1 API additions

v1.1 install

Quickstart

1. Install (no extra packages needed; the project uses Flask + NetworkX + requests)

2. CLI

3. Web UI

Graph intelligence (Maltego-style)

VirusTotal

99 sources, 12 categories

Architecture

Tips for a real run

Performance knobs (bounds that keep a run from stalling)

Passive recon & operator OPSEC (bug-bounty mode)

Contact classification

Egress anonymisation

Scope classification

Hard rules

Security & operations

Intelligence features

Ontology engine — OFAC SDN cross-check

MITRE ATT&CK auto-tagging

Real-time feeds

LLM prompt flavours

Tests

Comments

Related Posts

barrel-db/barrel: self-hosted tool for developers and homelabbers

apitap/apitap-lib: apitap is the open-source transfer engine behind apitap cloud — a Rust core with Python

Torusnokindle/FIFA-All-in-One-Editor: fIFA All-in-One Editor is a complete modding solution for EA Sports

DitchBreakerWish/FIFA-Ultimate-Editor-: fIFA Ultimate Editor is the most advanced modding suite for EA