A professional, open-source Google Maps web crawler that extracts company information from Google Maps. Built with Playwright for browser automation.

Sherlock Maps

Open-Source Google Maps Webcrawler

Features

  • Object-Oriented - Cleanly structured with classes, dataclasses, and design patterns
  • Search - Search Google Maps with any search term
  • Detailed Company Information extraction:
    • Company name
    • Category / Industry
    • Address
    • Phone number
    • Website URL
    • Rating (stars)
    • Number of reviews
    • Plus Code
    • Opening hours
    • Attributes (wheelchair accessibility, etc.)
  • Deduplication based on company name + website
  • URL Validation (filters out invalid websites)
  • Multiple Output Formats: JSON, CSV, Pretty-Print, File, Print
  • REST API - Asynchronous job queue server with full-featured endpoints
  • Docker Support - Containerized deployment
  • Chrome Profile Persistence - Session data persists between runs

Quick Start

The easiest way to get started. Docker handles all dependencies, Playwright, and browser setup automatically.

Using docker-compose (Simplest)

git clone https://github.com/Ayyouboss0011/SherlockMaps.git
cd GoogleMapsCrawler

# Start the API server
docker compose up -d

# The API is now running at http://localhost:8000
# Interactive documentation: http://localhost:8000/docs

Start a crawl via API

curl -X POST http://localhost:8000/crawl \
  -H "Content-Type: application/json" \
  -d '{"prompt": "restaurants berlin"}'

Get results

# Check job status
curl http://localhost:8000/status

# Get all results
curl http://localhost:8000/results

Stop the container

docker compose down

Using Docker CLI (without docker-compose)

# Clone the repository
git clone https://github.com/Ayyouboss0011/SherlockMaps.git
cd GoogleMapsCrawler

# Build the image
cd core && docker build -t sherlock-maps . && cd ..

# Run as API server
docker run -d -p 8000:8000 --name sherlock-maps sherlock-maps

# Run in CLI mode (one-time crawl)
docker run --rm -e PROMPT="restaurants berlin" sherlock-maps python /app/core/main_cli.py

Option 2: Without Docker

Install Python dependencies manually and run the crawler directly.

Prerequisites

  • Python 3.9 or higher
  • Git

Installation

# Clone the repository
git clone https://github.com/Ayyouboss0011/SherlockMaps.git
cd GoogleMapsCrawler

# Install Python dependencies
cd core
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Run the CLI crawler

# Set search term
export PROMPT="restaurants berlin"

# Run the crawler
python main.py

Run the REST API server

# Start the API server on port 8000
python api_main.py

# The API is now running at http://localhost:8000
# Interactive documentation: http://localhost:8000/docs

Start a crawl via API

curl -X POST http://localhost:8000/crawl \
  -H "Content-Type: application/json" \
  -d '{"prompt": "restaurants berlin"}'

Use as a Python library

from core.crawler import run_crawler

results = run_crawler(
    prompt="restaurants berlin",
    headless=False,
    output_format="json"
)

for company in results:
    print(f"{company.name} - {company.website}")

CLI Mode

The crawler can be used directly from the command line. All results are output to stdout as JSON by default.

Output Formats

# JSON to stdout (default)
export PROMPT="restaurants berlin"
python main.py

# Save as JSON file
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="file"
python main.py

# Save as CSV file
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="csv"
python main.py

# Formatted output (one company per block)
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="print"
python main.py

# Human-readable output
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="pretty"
python main.py

# Headless mode (for production/servers)
export PROMPT="restaurants berlin"
export HEADLESS="true"
python main.py

Output Formats Overview

Format Description
json JSON array to stdout (default)
file Saves results as sherlock-maps_YYYYMMDD_HHMMSS.json
csv Saves results as sherlock-maps_YYYYMMDD_HHMMSS.csv
print Each company individually with separator
pretty Human-readable format with aligned fields

Environment Variables

Variable Description Default
PROMPT Search term for Google Maps Required
OUTPUT_FORMAT Output format: json, print, file, csv, pretty json
HEADLESS Run browser in headless mode false
GOOGLE_API_KEY Optional Google API key (empty)

Python Library

Simple (Convenience Function)

from core.crawler import run_crawler

results = run_crawler(
    prompt="restaurants berlin",
    headless=False,
    output_format="json"
)

for company in results:
    print(f"{company.name} - {company.website}")

Complete (With Configuration)

from core.models import CrawlerConfig, CompanyData
from core.crawler import GoogleMapsCrawler

# Create configuration
config = CrawlerConfig(
    search_prompt="restaurants berlin",
    headless=False,
    output_format="pretty",
)

# Use crawler with context manager
with GoogleMapsCrawler(config) as crawler:
    results = crawler.crawl()

# Process results
for company in results:
    if isinstance(company, CompanyData):
        print(f"{company.name}: {company.rating} stars ({company.reviews_count} reviews)")

Custom Search at Runtime

from core.models import CrawlerConfig
from core.crawler import GoogleMapsCrawler

config = CrawlerConfig(
    search_prompt="cafes berlin",
    output_format="json",
)

with GoogleMapsCrawler(config) as crawler:
    # First search
    results1 = crawler.crawl()

    # Second search with different term
    results2 = crawler.crawl(prompt="restaurants munich")

REST API

The crawler can run as a persistent service with REST API. The container starts as an API server and can process multiple crawl jobs sequentially.

Start the API

# Build the image
cd core
docker build -t sherlock-maps .

# Start API server (port 8000)
docker run -p 8000:8000 sherlock-maps

# With custom port
docker run -p 8080:8080 -e API_PORT=8080 sherlock-maps

API Endpoints

Health & Status

Method Path Description
GET /health Health check (for Docker orchestrators)
GET /status Current status (idle/busy), active jobs, queue length
GET /stats Detailed statistics

Crawler Control

Method Path Description
POST /crawl Start a new crawl job
GET /crawl/{job_id} Get job status
GET /crawl/{job_id}/results Get job results
DELETE /crawl/{job_id} Cancel a running job
GET /crawl/history List all jobs with pagination

Data Management

Method Path Description
GET /results Get all results
POST /results/export Export results
DELETE /results/clear Clear all results

Configuration

Method Path Description
GET /config Get current configuration
PUT /config Update configuration

Browser

Method Path Description
GET /browser/info Browser information
POST /browser/restart Restart browser

API Examples

# Start a new crawl job
curl -X POST http://localhost:8000/crawl \
  -H "Content-Type: application/json" \
  -d '{"prompt": "restaurants berlin", "output_format": "json"}'

# Get job status
curl http://localhost:8000/crawl/<job_id>

# Get results
curl http://localhost:8000/crawl/<job_id>/results

# Get all results as CSV
curl "http://localhost:8000/results?format=csv"

# Get status
curl http://localhost:8000/status

# Health check
curl http://localhost:8000/health

# Cancel job
curl -X DELETE http://localhost:8000/crawl/<job_id>

# Job history
curl "http://localhost:8000/crawl/history?limit=10&offset=0"

Request Example

{
  "prompt": "restaurants berlin",
  "output_format": "json",
  "headless": false,
  "locale": "de-DE",
  "max_results": 100
}

Response Example (Job Status)

{
  "job_id": "abc-123-def",
  "status": "completed",
  "prompt": "restaurants berlin",
  "created_at": "2026-01-15T10:30:00Z",
  "completed_at": "2026-01-15T10:31:30Z",
  "results_count": 42,
  "error": null
}

Job Status

Status Description
pending In the queue
running Currently running
completed Successfully completed
failed Failed
cancelled Cancelled

Interactive API Documentation

When the API server is running, interactive Swagger documentation is available:

http://localhost:8000/docs

How It Works

  1. Search - Navigates to Google Maps with the search term
  2. Scroll - Loads all search results by scrolling
  3. Extract - Navigates to each result's detail page and extracts:
    • Company name, category, address, phone, website
    • Rating and number of reviews
    • Opening hours
    • Attributes
  4. Filter - Removes duplicates and validates website URLs
  5. Output - Outputs results in the desired format

Architecture

Sherlock Maps/
├── .gitignore
├── docker-compose.yml
├── README.md
├── public/
│   └── SherlockMaps.png
└── core/
    ├── __init__.py                   # Package exports
    ├── main.py                       # CLI entry point
    ├── main_cli.py                   # CLI logic
    ├── crawler.py                    # Main crawler class
    ├── requirements.txt              # Python dependencies
    ├── api/
    │   ├── __init__.py
    │   ├── models.py                 # API data models
    │   ├── queue_manager.py          # Job queue management
    │   └── server.py                 # FastAPI server
    ├── browser/
    │   ├── __init__.py
    │   └── browser_manager.py        # Browser lifecycle management
    ├── exceptions/
    │   ├── __init__.py
    │   └── crawler_exceptions.py     # Custom exceptions
    ├── extractors/
    │   ├── __init__.py
    │   └── maps_extractor.py         # Google Maps data extraction
    ├── models/
    │   ├── __init__.py
    │   ├── company.py                # CompanyData model
    │   └── crawler_config.py         # CrawlerConfig model
    ├── output/
    │   ├── __init__.py
    │   └── output_handler.py         # Output formats
    └── processors/
        ├── __init__.py
        ├── url_validator.py          # URL validation
        └── deduplication_processor.py # Deduplication

Class Overview

Class Module Description
Sherlock Maps Open-Source Google Maps Webcrawler
GoogleMapsCrawler crawler.py Main class, orchestrates the entire crawling process
BrowserManager browser/browser_manager.py Manages Playwright browser lifecycle
MapsExtractor extractors/maps_extractor.py Extracts company data from Google Maps
CompanyData models/company.py Data model for a company
CrawlerConfig models/crawler_config.py Crawler configuration
URLValidator processors/url_validator.py Validates HTTP(S) URLs
DeduplicationProcessor processors/deduplication_processor.py Removes duplicates
OutputHandler output/output_handler.py Formats and outputs results
CrawlerBaseException exceptions/crawler_exceptions.py Base exception class

Configuration

CrawlerConfig Attributes

Attribute Type Default Description
search_prompt str "" The search term for Google Maps
headless bool False Run browser in headless mode
output_format Literal "json" Output format
chrome_profile_path str "Chrome_Profile" Path to Chrome user data directory
viewport ViewPort 1920x1080 Browser viewport dimensions
locale str "de-DE" Browser localization
page_timeout int 30000 Maximum navigation timeout in ms
selector_timeout int 15000 Maximum timeout for selectors in ms
scroll_timeout int 45 Maximum time for scrolling in seconds
max_scroll_attempts int 5 Number of scroll attempts before stop
max_retries int 3 Number of navigation retry attempts
request_timeout int 25000 Request timeout in ms

Example Output

[
  {
    "name": "Restaurant Name",
    "category": "Restaurant",
    "address": "Musterstrasse 1, 10115 Berlin",
    "phone": "+49 30 12345678",
    "website": "https://www.restaurant-example.de",
    "rating": "4.5",
    "reviews_count": "234",
    "plus_code": "GVMF+8H Berlin",
    "opening_hours": "Mon: 12:00-22:00, Tue: 12:00-22:00, ...",
    "attributes": ["Wheelchair accessible entrance"]
  }
]

Limitations

  • Google Maps UI changes may break selectors (CSS classes like h1.DUwDvf are Google-specific)
  • Rate limiting: Google may show CAPTCHAs for fast requests
  • German localization is hardcoded (hl=de), for other languages browser_manager.py must be modified
  • Requires a display or headless mode for Chromium

Resources