VoidAccess

A self-hosted OSINT platform for dark web threat intelligence.

Automate the entire investigation workflow from query refinement to relationship mapping in 13 autonomous pipeline steps.

The OSINT Powerhouse

Commercial threat intelligence platforms often charge prohibitive annual fees for capabilities that can be run on private hardware. VoidAccess democratizes high-end dark web intelligence by providing an automated, end-to-end workflow:

Query Refinement: Intelligent search term optimization using LLMs.
Multilingual Search: Deep-web fan-out across English, Russian, and Chinese engines.
Entity Extraction: Autonomous identification of wallets, IOCs, PGP keys, and more.
Relationship Mapping: Dynamic graph generation from extracted data co-occurrence.
Structured Export: STIX 2.1, MISP, Sigma, and CSV support.

Visual Walkthrough

1. Intuitive Dashboard

Start investigations with a clean, dark-themed interface designed for high-stakes research. Homepage

2. Intelligent Scoping

Refine queries and select investigation depth with precision. Topic Selection

3. Real-time Pipeline Tracking

Monitor the 13-step autonomous pipeline as it crawls and extracts intelligence.

4. Interactive Graph Intelligence

Explore connections between entities, onion sites, and threat actors in a dynamic, high-contrast graph. Node Selection

5. Comprehensive Intel Reports

Get structured summaries and actionable artifacts once the scan completes. Scan Completed

How It Works (The 13-Step Pipeline)

VoidAccess handles the complexity of dark web research through a rigorous sequence:

LLM Query Refinement: Optimizes search terms for .onion engine indexing.
Parallel Collection: Queries 16+ Tor search engines simultaneously with paste sites (Pastebin, dpaste, paste.ee), GitHub, GitLab, and curated RSS security feeds.
Intelligence Filtering: LLM filters noise, keeping only relevant intelligence pages.
Multi-Source Enrichment: Pulls from AlienVault OTX, abuse.ch, ransomware.live, CISA KEV, Shodan, GreyNoise, AbuseIPDB, Feodo Tracker, C2IntelFeeds, and more — running in parallel with collection.
Recursive .onion Discovery: Discovers hidden links via seed URL crawling.
Vector Cache Check: Avoids redundant scraping for recently visited pages (24h TTL).
Tor-Routed Scraping: Safely fetches page content with a 1MB safety cap.
Persistence: Stores new content in the local vector cache.
Intelligence Merging: Combines scraped and enriched data for processing.
Advanced Extraction: Regex, NER, and LLM-based entity identification.
Historical Cross-Referencing: Validates data against seed datasets.
Graph Construction: Builds relationship nodes based on co-occurrence.
Final Intelligence Summary: LLM generates a structured technical briefing.

What It Extracts

The extraction pipeline identifies these entity types:

Category	Examples
Cryptocurrency	Bitcoin, Ethereum, Monero wallet addresses
Network Indicators	IPv4 addresses, .onion URLs, domains, email addresses, PGP keys
File Indicators	MD5, SHA1, SHA256 hashes
Vulnerabilities	CVE numbers, MITRE ATT&CK techniques
Threat Actors	Actor handles, malware families, ransomware group names
Paste Sites	Pastebin, Ghostbin, Rentry, and similar links
People/Orgs	Named persons, organization names, locations

Parallel collection sources (run alongside Tor search):

Paste sites — Pastebin, dpaste, paste.ee, Rentry
GitHub — code search and repository READMEs
GitLab — code search and project pages
RSS security feeds — 20 curated feeds (Krebs, BleepingComputer, Talos, Mandiant, CrowdStrike, Unit 42, CISA, and more)
Curated .onion seed catalogue — 31 vetted seeds across 8 categories, scored per query

Enrichment and quality sources (19 total):

AlienVault OTX — threat pulses and malware families
MalwareBazaar — malware samples and signatures
ThreatFox — recent IOC feed
URLhaus — malicious URL database
ransomware.live — ransomware group tracking and leak-site seeds
CISA KEV — known exploited vulnerabilities catalog
Shodan InternetDB — passive vulnerability signatures
VirusTotal — file hash AV detection ratio (API key required)
GreyNoise — suppresses known benign scanner IPs from results (API key required)
AbuseIPDB — community IP abuse reports; 1,000 checks/day free
Feodo Tracker + C2IntelFeeds — confirmed C2 IPs for 6 major frameworks; no key required
crt.sh — certificate transparency logs; subdomain enumeration; free
URLScan.io — live domain scan data and malicious verdicts
Wayback Machine — historical domain snapshots for taken-down infrastructure
Hybrid Analysis — behavioral sandbox verdict and AV detection ratio for file hashes
HaveIBeenPwned — breach history for email addresses (paid API key)
EmailRep — email reputation scoring and disposable detection
CIRCL PDNS + RDAP — passive DNS history and WHOIS registration data; free
BlockCypher + Etherscan — blockchain wallet balance and transaction graph

Export formats:

STIX 2.1 — bundles with indicators, threat actors, malware objects
MISP JSON — events with galaxies for direct import
Sigma rules — auto-generated detection rules from extracted IOCs
CSV — flat entity dumps for spreadsheet analysis

LLM & Enrichment Ecosystem

Supported LLM Providers

Provider	Models	Notes
OpenRouter	DeepSeek, Llama 3.3, Claude Haiku	Recommended default; free models available
Groq	Llama 3.3, Llama 3.1	Fast inference; free tier
OpenAI	GPT-4o Mini	API key required
Anthropic	Claude Haiku	Haiku is the tested default; other models work via manual override.
Google Gemini	Gemini 1.5 Flash, 2.5 Pro	Free tier via AI Studio
Ollama	Any local model	Air-gapped; no API key needed

The default is DeepSeek via OpenRouter — fast and strong on technical security content. With free-tier LLMs (Groq free, OpenRouter free models, or Ollama) the cost is $0. With paid models like DeepSeek via OpenRouter it is under $0.50 per investigation. For fully air-gapped deployments, Ollama runs entirely locally.

Optional Enrichment API Keys

All enrichment sources that require a key degrade gracefully when the key is absent — they are skipped without failing the investigation. Keys marked "free" require registration but have no cost.

Key	What it does	Free	Sign up
`OTX_API_KEY`	AlienVault OTX threat pulses	Yes	otx.alienvault.com
`VT_API_KEY`	VirusTotal file hash AV detections	Yes (4 req/min)	virustotal.com
`ABUSECH_API_KEY`	MalwareBazaar, ThreatFox, URLhaus rate limits	Yes	abuse.ch
`ABUSEIPDB_API_KEY`	Community IP abuse reports; 1,000 checks/day	Yes	abuseipdb.com/register
`GREYNOISE_API_KEY`	Suppresses known scanner/researcher IPs	Free tier available	greynoise.io/pricing
`URLSCAN_API_KEY`	Higher rate limits for URLScan.io domain scans	Yes (public results without key)	urlscan.io/user/signup
`HYBRID_ANALYSIS_API_KEY`	Behavioral sandbox analysis for file hashes	Yes	hybrid-analysis.com/signup
`HIBP_API_KEY`	Email breach history — the most valuable email enrichment	No ($3.50/month)	haveibeenpwned.com/API/Key
`EMAILREP_API_KEY`	Email reputation scoring; increases rate limits	Yes (reduced rate without key)	emailrep.io/key
`SECURITYTRAILS_API_KEY`	Richer DNS history for domains	Yes (50 queries/month)	securitytrails.com/corp/api
`GITHUB_TOKEN`	Raises GitHub scraping from 10 to 30 req/min	Free	github.com/settings/tokens
`GITLAB_TOKEN`	Raises GitLab scraping from 15 to 60 req/min	Free	gitlab.com/profile/personal_access_tokens
`BLOCKCYPHER_TOKEN`	BTC/ETH wallet balance and transaction graph	Yes	blockcypher.com
`ETHERSCAN_API_KEY`	ETH wallet lookups	Yes	etherscan.io/apis

Cost Comparison

Platform	Annual Cost	Self-Hosted	Open Source
Recorded Future	~$25,000	No	No
DarkOwl	~$15,000	No	No
Flare	~$8,000	No	No
VoidAccess	Free	Yes	Yes

Free with Groq, OpenRouter free models, or Ollama. Under $0.50 per investigation with paid models like DeepSeek.

What's New in v1.3

10 new enrichment sources: GreyNoise (scanner suppression), AbuseIPDB, Feodo Tracker, C2IntelFeeds, crt.sh, URLScan.io, Wayback Machine, Hybrid Analysis, HaveIBeenPwned, EmailRep
4 new clearnet collection sources: paste sites, GitHub code search, GitLab code search, and 20 curated RSS security feeds
Curated .onion seed list — 31 seeds across 8 categories, relevance-scored per query
CIRCL passive DNS + RDAP WHOIS — infrastructure cluster detection for IPs and domains
Investigation cancellation — cancel a running pipeline at any checkpoint; partial results are preserved
Sources panel — per-investigation breakdown of which sources ran and what each returned
Infrastructure clusters panel — groups IPs and domains sharing ASN, CIDR block, or WHOIS registrant
Entity quality badges — C2, Malicious, Breached, Disposable, Archived, Taken Down, AV ratio
GreyNoise suppression — known benign scanner IPs are filtered from entity results automatically
MALWARE_FAMILY auto-creation from confirmed family names returned by hash enrichment

Quick Start

Prerequisites

Docker and Docker Compose
Python 3 (recommended — used by setup.sh for secret generation; Linux/macOS fall back to /dev/urandom if absent, Windows setup.bat may require it)
One LLM API key — or Ollama for fully local operation (free)

Free LLM options (no credit card required):

Groq — fast, free tier, Llama 3.3 70B
OpenRouter — free models including DeepSeek and Llama 3.3
Google AI Studio — Gemini free tier
Ollama — fully local, no internet required

Installation

macOS / Linux / WSL:

bash setup.sh

Windows (native):

setup.bat

The interactive wizard creates .env, generates JWT_SECRET and POSTGRES_PASSWORD, prompts for your LLM provider (one of: Groq, OpenRouter, Anthropic, OpenAI, Google Gemini, or Ollama), optionally collects threat-intel keys (OTX_API_KEY, VT_API_KEY), optionally enables Redis, sets the admin password, and starts the Docker stack.

Starting and Stopping

macOS / Linux / WSL:

./start.sh    # build and start all services
./stop.sh     # stop all services

Windows (native):

start.bat     :: build and start all services
stop.bat      :: stop all services

Once running, open http://localhost:3001 in your browser.

Getting a JWT (API access)

setup.sh creates a default admin account at [email protected] with the password you provided during the wizard.

curl -X POST http://localhost:8000/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email": "[email protected]", "password": "yourpassword"}'

Use the returned token in an Authorization: Bearer <token> header for API requests.

Running your first investigation (API)

curl -X POST http://localhost:8000/investigations \
  -H "Authorization: Bearer <your_jwt>" \
  -H "Content-Type: application/json" \
  -d '{"query": "LockBit ransomware infrastructure 2024"}'

The investigation starts in pending, moves to processing, and completes in 3–5 minutes with a summary, extracted entities, relationship graph, and export-ready artifacts.

Architecture

Four Docker services:

Service	Technology	Port
postgres	PostgreSQL 16	5433
tor	Tor SOCKS5 proxy	9050
fastapi	Python 3.11, FastAPI, SQLAlchemy	8000
nextjs	Next.js 14, TypeScript, Tailwind	3001

The FastAPI backend runs a 13-step pipeline triggered by POST /investigations. Every external call has try/except with graceful fallback — the pipeline never hard-crashes. API docs are available at http://localhost:8000/docs when running.

Source Tree

voidaccess/
├── analysis/      # Temporal patterns, OPSEC failure detection, anomaly scoring
├── api/           # FastAPI routes; investigation pipeline orchestrator
├── auth/          # JWT authentication and user management
├── crawler/       # Recursive .onion link discovery spider
├── db/            # SQLAlchemy ORM models and Alembic migrations
├── docs/          # Contributing, security, and usage policy documents
├── export/        # STIX 2.1, MISP, Sigma, and CSV artifact generation
├── extractor/     # Regex → NER → LLM entity extraction pipeline
├── fingerprint/   # Stylometry vectors and actor style profiling
├── graph/         # NetworkX MultiDiGraph builder and pyvis visualization
├── i18n/          # Language detection, translation, multilingual query expansion
├── infra/         # Docker Compose, Tor config, Postgres init
├── monitor/       # APScheduler watches, change diffing, Telegram/SMTP alerts
├── public/        # Logo, walkthrough screenshots, demo media
├── scraper/       # Async aiohttp and Playwright scrapers over Tor
├── scripts/       # Seed imports and operational utilities
├── search/        # 16+ .onion search engine fan-out with circuit breaker
├── sources/       # DarkSearch, Telegram, paste sites, threat-intel feeds
├── tests/         # Pytest suite (one test file per module)
├── utils/         # Async helpers, content safety, encryption, defang
├── vector/        # ChromaDB cache with sentence-transformer embeddings
├── voidaccess/    # LangChain LLM wrappers and provider registry
└── web/           # Next.js 14 + TypeScript + Tailwind frontend

Note on voidaccess/voidaccess/ — the nested directory holds the core LLM utilities (llm.py, llm_utils.py) and is imported at runtime by the API routes (from voidaccess.llm import ...). The nested naming reflects the original package structure from the project's pre-API baseline.

Troubleshooting

Services won't start:

docker compose -f infra/docker-compose.yml --project-directory . ps
docker compose -f infra/docker-compose.yml --project-directory . logs -f

Port conflicts (3001 or 8000 already in use):

macOS/Linux: lsof -i :3001 to find what's using it
Windows: netstat -ano | findstr :3001

Tor not connecting: The Tor service takes 30–60 seconds to bootstrap on first start. Check health with ./check_health.sh. This script verifies Tor proxy connectivity, LLM provider reachability, and dark web search engine availability.

No .env file: Run bash setup.sh (macOS/Linux/WSL) or setup.bat (Windows) before starting.

Docker build takes a long time: First build downloads ~3GB of layers. Subsequent builds use the Docker layer cache and are much faster.

Content Safety

Every investigation runs through mandatory content safety filters before results reach the UI or appear in the graph. CSAM, gore, snuff content, and other prohibited material are blocked at the query stage, URL validation, content scanning, and post-extraction entity filtering. These filters are mandatory and cannot be disabled.

Acceptable Use

VoidAccess is for authorized security research, threat intelligence gathering, and law enforcement purposes only. Users are responsible for ensuring compliance with all local laws and ethical standards. See docs/USAGE_POLICY.md for the full policy.

KatrielMoses/voidaccess: <img src="https://raw.githubusercontent.com/KatrielMoses/voidaccess/ma

VoidAccess

The OSINT Powerhouse

Visual Walkthrough

1. Intuitive Dashboard

2. Intelligent Scoping

3. Real-time Pipeline Tracking

4. Interactive Graph Intelligence

5. Comprehensive Intel Reports

How It Works (The 13-Step Pipeline)

What It Extracts

LLM & Enrichment Ecosystem

Supported LLM Providers

Optional Enrichment API Keys

Cost Comparison

What's New in v1.3

Quick Start

Prerequisites

Installation

Starting and Stopping

Getting a JWT (API access)

Running your first investigation (API)

Architecture

Source Tree

Troubleshooting

Content Safety

Acceptable Use

Comments

VoidAccess

The OSINT Powerhouse

Visual Walkthrough

1. Intuitive Dashboard

2. Intelligent Scoping

3. Real-time Pipeline Tracking

4. Interactive Graph Intelligence

5. Comprehensive Intel Reports

How It Works (The 13-Step Pipeline)

What It Extracts

LLM & Enrichment Ecosystem

Supported LLM Providers

Optional Enrichment API Keys

Cost Comparison

What's New in v1.3

Quick Start

Prerequisites

Installation

Starting and Stopping

Getting a JWT (API access)

Running your first investigation (API)

Architecture

Source Tree

Troubleshooting

Content Safety

Acceptable Use

Comments

Related Posts

HasnainDarkNet/HDN-Locator: hDN-Locator - Advanced GPS Location Tracker with Anti-Spoofing | Ca

netbiosX/CyberSec-Blogs: a curated list of independent cybersecurity blogs covering threat inte

tg12/phantomstars: a <a href="https://labs.jamessawyer.co.uk/">JS Labs</a> project &mdash;

Siv-nick/WhoCord: **Turn any username, email, domain, phone number, image, URL, or probe string

tg12/phantomstars: a <a href="https://labs.jamessawyer.co.uk/">JS Labs</a> project —