Acon — The Intelligent Brain for Any Scraper

Acon doesn't replace Scrapling or Firecrawl. It tells them where to look.

Why Acon?

Most crawlers are dumb. They follow links blindly, return raw HTML, and break the moment a site changes its structure. Before you can extract anything useful, you need to understand what you're dealing with.

Acon is a site intelligence engine. It maps the structural "skeleton" of a website automatically — before any data extraction happens — so your scraper always knows where to look.

🏗️ The Core Thesis

Most modern web scrapers suffer from "URL Exhaustion" — they spend 90% of their bandwidth fetching identical product or blog pages. Acon introduces a Topology Orchestrator that maps, classifies, and samples site structures, then stops the moment it has fully learned the site's DNA — no wasted requests.

📊 Real-World Benchmark Results (v0.1.2 - Final 10/10 Polish)

The correct question: How many pages does each engine need to fully map a site's structure?

Both crawlers given an uncapped budget. BFS runs until exhaustion. Acon stops the moment low_information_gain fires — meaning the site's structural DNA is fully mapped.

Comparison Summary (4 Representative Sites)

Site	BFS Pages	Acon Pages	Request Reduction	Time Saved	Stopped By
books.toscrape.com	200	6	97.0%	93.7%	`low_information_gain`
Hacker News	50	9	82.0%	89.0%	`low_information_gain`
Wikipedia	100	8	92.0%	93.7%	`low_information_gain`
PyPI	100	20	80.0%	93.4%	`queue_exhausted`

Deep Dive: books.toscrape.com (E-Commerce)

	Blind BFS	Acon
Pages Crawled	200	6
Time Taken	54.1s	3.4s
Stopped by	budget cap	`low_information_gain`
Topology Detected	—	`deep_uniform`

97% fewer requests. Acon stopped at 6 pages because it detected that the structural DNA (product pages, category pages) was already fully mapped.

Deep Dive: PyPI (Multi-Template Registry)

	Blind BFS	Acon
Pages Crawled	100	20
Time Taken	100.7s	6.6s
Stopped by	budget cap	`queue_exhausted`
Topology Detected	—	`thin`

80% fewer requests. Acon classified the site and exhausted the relevant discovery queue in just 20 pages.

The key insight: a blind crawler keeps crawling because it doesn't know what it doesn't know. Acon tracks information gain in a sliding window — once new pages stop adding structural novelty, it stops and hands you the map.

🚀 Use Cases

Price Monitoring & E-Commerce Intelligence Acon detects pagination patterns and repeating product templates automatically. No manual selector configuration per site.

Content Archival & Research Feed Acon a publication's root URL. It identifies the site's content structure, prioritizes article pages over navigation noise, and hands you a clean discovery map.

Site Auditing & SEO Analysis Get an instant structural report — template count, link depth, topology classification (SPA vs static vs paginated) — in a single run.

⚡ What Makes Acon Different

Capability	Typical Crawler	Acon
JS-rendered sites	Manual Playwright setup	Autonomous escalation
Site structure	Unknown until scraped	Detected before extraction
Large site performance	Degrades at scale	O(log N) priority queue
Bandwidth efficiency	Downloads everything	Asset blocking (Discovery mode)
Discovery Latency	Static only	Static-First Hybrid Escalation
Failed crawls	Lost progress	SQLite resumption (WAL)
Budget waste	Crawls until cap	Stops when structure is learned

🏗️ The Efficiency Pillars

Acon is optimized for production environments where every request costs money:

⚡ Static-First Discovery: Acon probes pages with raw HTTP first. It only launches a browser if the site is a SPA, saving 90% of compute on standard sites.
🚫 Intelligent Asset Blocking: During discovery, Acon automatically aborts requests for images, fonts, and CSS to slash bandwidth and CPU usage.
📉 Adaptive Early Stop (low_information_gain): Acon tracks structural novelty across a sliding window. When new pages stop adding unique signal, crawling stops — before the budget is spent.
🧬 Debounced Topology Detection: Structural analysis (DNA mapping) is throttled to key milestones (1, 10, 25, 50 pages) to ensure max throughput.

🏗️ The Unified Intelligence Stack (The Acon Alliance)

Acon doesn't just map sites; it orchestrates the most powerful open-source scraping tools into a single, high-fidelity pipeline.

🕵️ Stealth (Camoufox): Enable use_stealth=True to launch an "invisible" browser engine that bypasses Cloudflare and Akamai automatically.
📄 Content (Trafilatura): Enable extract_content=True to get clean, LLM-ready Markdown from every discovered page natively.
🚀 Speed (Scrapling): Use the scrapling_adapter to export Acon's "DNA Map" into Scrapling for turbo-charged mass extraction.

🛠️ Installation

pip install acon-intel

pip install trafilatura camoufox scrapling
playwright install chromium

⚡ Quick Start

import asyncio
from acon import SiteCrawlOrchestrator, CrawlConfig

async def main():
    config = CrawlConfig(
        max_pages=50,          # Hard ceiling
        extract_content=True,  # Trafilatura: clean Markdown per page
        use_stealth=True       # Camoufox: bypass bot detection
    )

    brain = SiteCrawlOrchestrator()
    result = await brain.crawl_site("https://news.ycombinator.com", config)

    print(f"Topology: {result['topology']}")
    print(f"Pages crawled: {result['pages_crawled']}")
    print(f"Stopped by: {result['crawl_meta']['early_stop_reason']}")

    for page in result["page_summaries"]:
        print(f"  {page['url']} — {page['page_type']}")
        if page['content']:
            print(f"    {page['content'][:80]}...")

if __name__ == "__main__":
    asyncio.run(main())

📦 The Output Shape

{
  "topology": "multi_template",
  "pages_crawled": 12,
  "pages_failed": 0,
  "page_summaries": [
    {
      "url": "https://pypi.org/project/requests/",
      "page_type": "standard",
      "js_required": false,
      "content": "# requests 2.31.0...",
      "parent_url": "https://pypi.org"
    }
  ],
  "crawl_meta": {
    "early_stop_reason": "low_information_gain",
    "crawl_duration_s": 29.5,
    "reflection": {
      "intelligence_score": 0.33,
      "failure_rate": 0.0,
      "advice": "Continue current strategy."
    }
  }
}

🛣️ Roadmap

Stealth Integration: Native support for Camoufox (Fingerprint bypass).
LLM-Ready Pipeline: Native Trafilatura integration for high-fidelity Markdown output.
Speed Pillar: Official Scrapling adapter for mass extraction.
Session Persistence: SQLite WAL-mode crawl resumption across process restarts.
Adaptive Intelligence: low_information_gain early stop — avoids burning crawl budgets.
Discovery API: Expose Acon as a standalone Discovery microservice.

Acon: The connective tissue of the intelligent web.

WillyEverGreen/acon: <img src="https://raw.githubusercontent.com/WillyEverGreen/acon/main/logo.

Acon — The Intelligent Brain for Any Scraper

Why Acon?

🏗️ The Core Thesis

📊 Real-World Benchmark Results (v0.1.2 - Final 10/10 Polish)

Comparison Summary (4 Representative Sites)

Deep Dive: books.toscrape.com (E-Commerce)

Deep Dive: PyPI (Multi-Template Registry)

🚀 Use Cases

⚡ What Makes Acon Different

🏗️ The Efficiency Pillars

🏗️ The Unified Intelligence Stack (The Acon Alliance)

🛠️ Installation

⚡ Quick Start

📦 The Output Shape

🛣️ Roadmap

Comments

Acon — The Intelligent Brain for Any Scraper

Why Acon?

🏗️ The Core Thesis

📊 Real-World Benchmark Results (v0.1.2 - Final 10/10 Polish)

Comparison Summary (4 Representative Sites)

Deep Dive: books.toscrape.com (E-Commerce)

Deep Dive: PyPI (Multi-Template Registry)

🚀 Use Cases

⚡ What Makes Acon Different

🏗️ The Efficiency Pillars

🏗️ The Unified Intelligence Stack (The Acon Alliance)

🛠️ Installation

⚡ Quick Start

📦 The Output Shape

🛣️ Roadmap

Comments

Related Posts

kao273183/mk-qa-master: <em>AI 測試大師 — your AI QA loop, from analyze to advise.</em>

darkamenosa/realbrowser: realbrowser is an agent skill and small local CLI for fast target-firs

Ayyouboss0011/SherlockMaps: a professional, open-source Google Maps web crawler that extracts c

ltczding-gif/ref-downloader: > **Stop losing an afternoon to chasing dozens of reference PDFs b