Acon â The Intelligent Brain for Any Scraper
Acon doesn't replace Scrapling or Firecrawl. It tells them where to look.
Why Acon?
Most crawlers are dumb. They follow links blindly, return raw HTML, and break the moment a site changes its structure. Before you can extract anything useful, you need to understand what you're dealing with.
Acon is a site intelligence engine. It maps the structural "skeleton" of a website automatically â before any data extraction happens â so your scraper always knows where to look.
đď¸ The Core Thesis
Most modern web scrapers suffer from "URL Exhaustion" â they spend 90% of their bandwidth fetching identical product or blog pages. Acon introduces a Topology Orchestrator that maps, classifies, and samples site structures, then stops the moment it has fully learned the site's DNA â no wasted requests.
đ Real-World Benchmark Results (v0.1.2 - Final 10/10 Polish)
The correct question: How many pages does each engine need to fully map a site's structure?
Both crawlers given an uncapped budget. BFS runs until exhaustion. Acon stops the moment low_information_gain fires â meaning the site's structural DNA is fully mapped.
Comparison Summary (4 Representative Sites)
| Site | BFS Pages | Acon Pages | Request Reduction | Time Saved | Stopped By |
|---|---|---|---|---|---|
| books.toscrape.com | 200 | 6 | 97.0% | 93.7% | low_information_gain |
| Hacker News | 50 | 9 | 82.0% | 89.0% | low_information_gain |
| Wikipedia | 100 | 8 | 92.0% | 93.7% | low_information_gain |
| PyPI | 100 | 20 | 80.0% | 93.4% | queue_exhausted |
Deep Dive: books.toscrape.com (E-Commerce)
| Blind BFS | Acon | |
|---|---|---|
| Pages Crawled | 200 | 6 |
| Time Taken | 54.1s | 3.4s |
| Stopped by | budget cap | low_information_gain |
| Topology Detected | â | deep_uniform |
97% fewer requests. Acon stopped at 6 pages because it detected that the structural DNA (product pages, category pages) was already fully mapped.
Deep Dive: PyPI (Multi-Template Registry)
| Blind BFS | Acon | |
|---|---|---|
| Pages Crawled | 100 | 20 |
| Time Taken | 100.7s | 6.6s |
| Stopped by | budget cap | queue_exhausted |
| Topology Detected | â | thin |
80% fewer requests. Acon classified the site and exhausted the relevant discovery queue in just 20 pages.
The key insight: a blind crawler keeps crawling because it doesn't know what it doesn't know. Acon tracks information gain in a sliding window â once new pages stop adding structural novelty, it stops and hands you the map.
đ Use Cases
Price Monitoring & E-Commerce Intelligence Acon detects pagination patterns and repeating product templates automatically. No manual selector configuration per site.
Content Archival & Research Feed Acon a publication's root URL. It identifies the site's content structure, prioritizes article pages over navigation noise, and hands you a clean discovery map.
Site Auditing & SEO Analysis Get an instant structural report â template count, link depth, topology classification (SPA vs static vs paginated) â in a single run.
⥠What Makes Acon Different
| Capability | Typical Crawler | Acon |
|---|---|---|
| JS-rendered sites | Manual Playwright setup | Autonomous escalation |
| Site structure | Unknown until scraped | Detected before extraction |
| Large site performance | Degrades at scale | O(log N) priority queue |
| Bandwidth efficiency | Downloads everything | Asset blocking (Discovery mode) |
| Discovery Latency | Static only | Static-First Hybrid Escalation |
| Failed crawls | Lost progress | SQLite resumption (WAL) |
| Budget waste | Crawls until cap | Stops when structure is learned |
đď¸ The Efficiency Pillars
Acon is optimized for production environments where every request costs money:
- ⥠Static-First Discovery: Acon probes pages with raw HTTP first. It only launches a browser if the site is a SPA, saving 90% of compute on standard sites.
- đŤ Intelligent Asset Blocking: During discovery, Acon automatically aborts requests for images, fonts, and CSS to slash bandwidth and CPU usage.
- đ Adaptive Early Stop (
low_information_gain): Acon tracks structural novelty across a sliding window. When new pages stop adding unique signal, crawling stops â before the budget is spent. - đ§Ź Debounced Topology Detection: Structural analysis (DNA mapping) is throttled to key milestones (1, 10, 25, 50 pages) to ensure max throughput.
đď¸ The Unified Intelligence Stack (The Acon Alliance)
Acon doesn't just map sites; it orchestrates the most powerful open-source scraping tools into a single, high-fidelity pipeline.
- đľď¸ Stealth (Camoufox): Enable
use_stealth=Trueto launch an "invisible" browser engine that bypasses Cloudflare and Akamai automatically. - đ Content (Trafilatura): Enable
extract_content=Trueto get clean, LLM-ready Markdown from every discovered page natively. - đ Speed (Scrapling): Use the
scrapling_adapterto export Acon's "DNA Map" into Scrapling for turbo-charged mass extraction.
đ ď¸ Installation
pip install acon-intel
pip install trafilatura camoufox scrapling
playwright install chromium
⥠Quick Start
import asyncio
from acon import SiteCrawlOrchestrator, CrawlConfig
async def main():
config = CrawlConfig(
max_pages=50, # Hard ceiling
extract_content=True, # Trafilatura: clean Markdown per page
use_stealth=True # Camoufox: bypass bot detection
)
brain = SiteCrawlOrchestrator()
result = await brain.crawl_site("https://news.ycombinator.com", config)
print(f"Topology: {result['topology']}")
print(f"Pages crawled: {result['pages_crawled']}")
print(f"Stopped by: {result['crawl_meta']['early_stop_reason']}")
for page in result["page_summaries"]:
print(f" {page['url']} â {page['page_type']}")
if page['content']:
print(f" {page['content'][:80]}...")
if __name__ == "__main__":
asyncio.run(main())
đŚ The Output Shape
{
"topology": "multi_template",
"pages_crawled": 12,
"pages_failed": 0,
"page_summaries": [
{
"url": "https://pypi.org/project/requests/",
"page_type": "standard",
"js_required": false,
"content": "# requests 2.31.0...",
"parent_url": "https://pypi.org"
}
],
"crawl_meta": {
"early_stop_reason": "low_information_gain",
"crawl_duration_s": 29.5,
"reflection": {
"intelligence_score": 0.33,
"failure_rate": 0.0,
"advice": "Continue current strategy."
}
}
}
đŁď¸ Roadmap
- Stealth Integration: Native support for Camoufox (Fingerprint bypass).
- LLM-Ready Pipeline: Native Trafilatura integration for high-fidelity Markdown output.
- Speed Pillar: Official Scrapling adapter for mass extraction.
- Session Persistence: SQLite WAL-mode crawl resumption across process restarts.
- Adaptive Intelligence:
low_information_gainearly stop â avoids burning crawl budgets. - Discovery API: Expose Acon as a standalone Discovery microservice.
Acon: The connective tissue of the intelligent web.
Comments