anansi: open-source project for self-hosters

Anansi is a web scraper built on the premise that websites are inherently hostile and unstable, and instead of requiring constant maintenance when layouts change or bot detection kicks in, the tool should handle those problems automatically. Unlike traditional scrapers that break when selectors fail or when sites deploy anti-bot measures, Anansi repairs itself through adaptive parsing, switches to browser rendering seamlessly, and mimics Chrome’s TLS fingerprint to evade detection at the network level. It also includes an MCP server so LLMs can drive crawls through conversation, making it accessible beyond just Python scripting.

Key capabilities

Anansi combines several specialized features to stay operational on difficult sites. Its self-healing parser maintains CSS selectors with confidence scores and runs four recovery strategies when extraction fails, storing the winning selector for future use. Structured data extraction pulls JSON-LD, Open Graph, and Microdata automatically, bypassing CSS entirely for fields found in schema.org markup. TLS fingerprint mimicry uses curl-cffi to replicate Chrome’s ClientHello and HTTP/2 settings, targeting the network layer where most scrapers are vulnerable. When pages need JavaScript or contain SPA markers, the system auto-upgrades to a browser silently, caching the decision per domain. Built-in anti-bot bypass strips webdriver fingerprints and handles Cloudflare challenges automatically. The adaptive rate limiter doubles delays on 429s and adjusts based on error rates, while incremental crawling uses ETags and content hashes to skip unchanged pages. Additional tooling includes URL canonicalization that strips tracking parameters, Pydantic-based validation for extracted items, proxy rotation with auto-quarantine, and a concurrent crawler that survives process restarts via SQLite persistence.

Under the hood

The extraction pipeline begins with a structured data pre-pass that checks for JSON-LD, Open Graph, and Microdata before attempting any CSS selectors. If no structured match exists, known selectors are tried in order of confidence score, falling back to healing strategies when needed. Healing includes text-pattern matching, fuzzy CSS class similarity, structural context, and XPath conversion. The auto-browser upgrade layer inspects HTML for React/Vue markers, low text density, or noscript redirects, triggering a stealth Playwright instance on demand. Rate limiting uses a per-domain sliding window that reacts to HTTP status codes and response patterns, gradually returning to baseline after clean windows. The MCP server exposes seventeen tools—fetch, extract, crawl, screenshot, and cache controls—over stdio for LLM integration.

Who it fits (and who it doesn’t)

Anansi suits developers running long-term crawls on sites known for bot detection, frequent redesigns, or JavaScript-heavy interfaces. It’s valuable when you need low-maintenance extraction and want to hand crawl control to an LLM agent. The trade-off is complexity: dependencies include Playwright, curl-cffi, and optional LLM SDKs, and TLS impersonation is operator-gated for authorized use. If you’re scraping simple, static pages with stable selectors, Anansi adds overhead without proportional benefit. The tool also assumes some familiarity with async Python and Pydantic models, and the [tls] extra restricts certain functionality to appropriate environments.

Setup, briefly

Anansi is a Python package installed directly from GitHub (it’s not on PyPI yet). Core install requires pip install "git+https://github.com/mdowis/anansi" plus playwright install chromium for browser support. The [tls] and [openai] extras must be installed explicitly from the same source. The MCP server becomes available as anansi-mcp or via python -m anansi.mcp_server.server. Full install commands and extra configuration are in the README.

In the broader scraper landscape, Anansi positions itself between lightweight HTTP libraries like httpx and full browser automation frameworks like Playwright. It’s more sophisticated than Scrapy for adversarial scraping scenarios but less opinionated than commercial solutions like Apify or Scrapfly. The MCP integration is unusual for the space, offering a bridge to LLM agents that most competitors don’t provide. The source is on GitHub.

Key capabilities

Under the hood

Who it fits (and who it doesn’t)

Setup, briefly

Comments

Related Posts

saudi-legal-ai-framework brings automation features to open-source workflows

pinrule: open-source project for self-hosters

An overview of OWASP Top 10 attack chain demonstration, an open-source project on GitHub

An overview of Appsmith, an open-source project on GitHub