Acon Logo

Acon — The Intelligent Brain for Any Scraper

Acon doesn't replace Scrapling or Firecrawl. It tells them where to look.


Why Acon?

Most crawlers are dumb. They follow links blindly, return raw HTML, and break the moment a site changes its structure. Before you can extract anything useful, you need to understand what you're dealing with.

Acon is a site intelligence engine. It maps the structural "skeleton" of a website automatically — before any data extraction happens — so your scraper always knows where to look.


🏗️ The Core Thesis

Most modern web scrapers suffer from "URL Exhaustion" — they spend 90% of their bandwidth fetching identical product or blog pages. Acon introduces a Topology Orchestrator that maps, classifies, and samples site structures, then stops the moment it has fully learned the site's DNA — no wasted requests.


📊 Real-World Benchmark Results (v0.1.2 - Final 10/10 Polish)

The correct question: How many pages does each engine need to fully map a site's structure?

Both crawlers given an uncapped budget. BFS runs until exhaustion. Acon stops the moment low_information_gain fires — meaning the site's structural DNA is fully mapped.

Comparison Summary (4 Representative Sites)

Site BFS Pages Acon Pages Request Reduction Time Saved Stopped By
books.toscrape.com 200 6 97.0% 93.7% low_information_gain
Hacker News 50 9 82.0% 89.0% low_information_gain
Wikipedia 100 8 92.0% 93.7% low_information_gain
PyPI 100 20 80.0% 93.4% queue_exhausted

Deep Dive: books.toscrape.com (E-Commerce)

Blind BFS Acon
Pages Crawled 200 6
Time Taken 54.1s 3.4s
Stopped by budget cap low_information_gain
Topology Detected — deep_uniform

97% fewer requests. Acon stopped at 6 pages because it detected that the structural DNA (product pages, category pages) was already fully mapped.


Deep Dive: PyPI (Multi-Template Registry)

Blind BFS Acon
Pages Crawled 100 20
Time Taken 100.7s 6.6s
Stopped by budget cap queue_exhausted
Topology Detected — thin

80% fewer requests. Acon classified the site and exhausted the relevant discovery queue in just 20 pages.


The key insight: a blind crawler keeps crawling because it doesn't know what it doesn't know. Acon tracks information gain in a sliding window — once new pages stop adding structural novelty, it stops and hands you the map.


🚀 Use Cases

Price Monitoring & E-Commerce Intelligence Acon detects pagination patterns and repeating product templates automatically. No manual selector configuration per site.

Content Archival & Research Feed Acon a publication's root URL. It identifies the site's content structure, prioritizes article pages over navigation noise, and hands you a clean discovery map.

Site Auditing & SEO Analysis Get an instant structural report — template count, link depth, topology classification (SPA vs static vs paginated) — in a single run.


⚡ What Makes Acon Different

Capability Typical Crawler Acon
JS-rendered sites Manual Playwright setup Autonomous escalation
Site structure Unknown until scraped Detected before extraction
Large site performance Degrades at scale O(log N) priority queue
Bandwidth efficiency Downloads everything Asset blocking (Discovery mode)
Discovery Latency Static only Static-First Hybrid Escalation
Failed crawls Lost progress SQLite resumption (WAL)
Budget waste Crawls until cap Stops when structure is learned

🏗️ The Efficiency Pillars

Acon is optimized for production environments where every request costs money:

  • ⚡ Static-First Discovery: Acon probes pages with raw HTTP first. It only launches a browser if the site is a SPA, saving 90% of compute on standard sites.
  • 🚫 Intelligent Asset Blocking: During discovery, Acon automatically aborts requests for images, fonts, and CSS to slash bandwidth and CPU usage.
  • 📉 Adaptive Early Stop (low_information_gain): Acon tracks structural novelty across a sliding window. When new pages stop adding unique signal, crawling stops — before the budget is spent.
  • 🧬 Debounced Topology Detection: Structural analysis (DNA mapping) is throttled to key milestones (1, 10, 25, 50 pages) to ensure max throughput.

🏗️ The Unified Intelligence Stack (The Acon Alliance)

Acon doesn't just map sites; it orchestrates the most powerful open-source scraping tools into a single, high-fidelity pipeline.

  • 🕵️ Stealth (Camoufox): Enable use_stealth=True to launch an "invisible" browser engine that bypasses Cloudflare and Akamai automatically.
  • 📄 Content (Trafilatura): Enable extract_content=True to get clean, LLM-ready Markdown from every discovered page natively.
  • 🚀 Speed (Scrapling): Use the scrapling_adapter to export Acon's "DNA Map" into Scrapling for turbo-charged mass extraction.

🛠️ Installation

pip install acon-intel

pip install trafilatura camoufox scrapling
playwright install chromium

⚡ Quick Start

import asyncio
from acon import SiteCrawlOrchestrator, CrawlConfig

async def main():
    config = CrawlConfig(
        max_pages=50,          # Hard ceiling
        extract_content=True,  # Trafilatura: clean Markdown per page
        use_stealth=True       # Camoufox: bypass bot detection
    )

    brain = SiteCrawlOrchestrator()
    result = await brain.crawl_site("https://news.ycombinator.com", config)

    print(f"Topology: {result['topology']}")
    print(f"Pages crawled: {result['pages_crawled']}")
    print(f"Stopped by: {result['crawl_meta']['early_stop_reason']}")

    for page in result["page_summaries"]:
        print(f"  {page['url']} — {page['page_type']}")
        if page['content']:
            print(f"    {page['content'][:80]}...")

if __name__ == "__main__":
    asyncio.run(main())

📦 The Output Shape

{
  "topology": "multi_template",
  "pages_crawled": 12,
  "pages_failed": 0,
  "page_summaries": [
    {
      "url": "https://pypi.org/project/requests/",
      "page_type": "standard",
      "js_required": false,
      "content": "# requests 2.31.0...",
      "parent_url": "https://pypi.org"
    }
  ],
  "crawl_meta": {
    "early_stop_reason": "low_information_gain",
    "crawl_duration_s": 29.5,
    "reflection": {
      "intelligence_score": 0.33,
      "failure_rate": 0.0,
      "advice": "Continue current strategy."
    }
  }
}

🛣️ Roadmap

  • Stealth Integration: Native support for Camoufox (Fingerprint bypass).
  • LLM-Ready Pipeline: Native Trafilatura integration for high-fidelity Markdown output.
  • Speed Pillar: Official Scrapling adapter for mass extraction.
  • Session Persistence: SQLite WAL-mode crawl resumption across process restarts.
  • Adaptive Intelligence: low_information_gain early stop — avoids burning crawl budgets.
  • Discovery API: Expose Acon as a standalone Discovery microservice.

Acon: The connective tissue of the intelligent web.