A professional, open-source Google Maps web crawler that extracts company information from Google Maps. Built with Playwright for browser automation.
Sherlock Maps
Open-Source Google Maps Webcrawler
Features
- Object-Oriented - Cleanly structured with classes, dataclasses, and design patterns
- Search - Search Google Maps with any search term
- Detailed Company Information extraction:
- Company name
- Category / Industry
- Address
- Phone number
- Website URL
- Rating (stars)
- Number of reviews
- Plus Code
- Opening hours
- Attributes (wheelchair accessibility, etc.)
- Deduplication based on company name + website
- URL Validation (filters out invalid websites)
- Multiple Output Formats: JSON, CSV, Pretty-Print, File, Print
- REST API - Asynchronous job queue server with full-featured endpoints
- Docker Support - Containerized deployment
- Chrome Profile Persistence - Session data persists between runs
Quick Start
Option 1: Docker (Recommended)
The easiest way to get started. Docker handles all dependencies, Playwright, and browser setup automatically.
Using docker-compose (Simplest)
git clone https://github.com/Ayyouboss0011/SherlockMaps.git
cd GoogleMapsCrawler
# Start the API server
docker compose up -d
# The API is now running at http://localhost:8000
# Interactive documentation: http://localhost:8000/docs
Start a crawl via API
curl -X POST http://localhost:8000/crawl \
-H "Content-Type: application/json" \
-d '{"prompt": "restaurants berlin"}'
Get results
# Check job status
curl http://localhost:8000/status
# Get all results
curl http://localhost:8000/results
Stop the container
docker compose down
Using Docker CLI (without docker-compose)
# Clone the repository
git clone https://github.com/Ayyouboss0011/SherlockMaps.git
cd GoogleMapsCrawler
# Build the image
cd core && docker build -t sherlock-maps . && cd ..
# Run as API server
docker run -d -p 8000:8000 --name sherlock-maps sherlock-maps
# Run in CLI mode (one-time crawl)
docker run --rm -e PROMPT="restaurants berlin" sherlock-maps python /app/core/main_cli.py
Option 2: Without Docker
Install Python dependencies manually and run the crawler directly.
Prerequisites
Installation
# Clone the repository
git clone https://github.com/Ayyouboss0011/SherlockMaps.git
cd GoogleMapsCrawler
# Install Python dependencies
cd core
pip install -r requirements.txt
# Install Playwright browsers
playwright install chromium
Run the CLI crawler
# Set search term
export PROMPT="restaurants berlin"
# Run the crawler
python main.py
Run the REST API server
# Start the API server on port 8000
python api_main.py
# The API is now running at http://localhost:8000
# Interactive documentation: http://localhost:8000/docs
Start a crawl via API
curl -X POST http://localhost:8000/crawl \
-H "Content-Type: application/json" \
-d '{"prompt": "restaurants berlin"}'
Use as a Python library
from core.crawler import run_crawler
results = run_crawler(
prompt="restaurants berlin",
headless=False,
output_format="json"
)
for company in results:
print(f"{company.name} - {company.website}")
CLI Mode
The crawler can be used directly from the command line. All results are output to stdout as JSON by default.
# JSON to stdout (default)
export PROMPT="restaurants berlin"
python main.py
# Save as JSON file
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="file"
python main.py
# Save as CSV file
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="csv"
python main.py
# Formatted output (one company per block)
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="print"
python main.py
# Human-readable output
export PROMPT="restaurants berlin"
export OUTPUT_FORMAT="pretty"
python main.py
# Headless mode (for production/servers)
export PROMPT="restaurants berlin"
export HEADLESS="true"
python main.py
| Format |
Description |
json |
JSON array to stdout (default) |
file |
Saves results as sherlock-maps_YYYYMMDD_HHMMSS.json |
csv |
Saves results as sherlock-maps_YYYYMMDD_HHMMSS.csv |
print |
Each company individually with separator |
pretty |
Human-readable format with aligned fields |
Environment Variables
| Variable |
Description |
Default |
PROMPT |
Search term for Google Maps |
Required |
OUTPUT_FORMAT |
Output format: json, print, file, csv, pretty |
json |
HEADLESS |
Run browser in headless mode |
false |
GOOGLE_API_KEY |
Optional Google API key |
(empty) |
Python Library
Simple (Convenience Function)
from core.crawler import run_crawler
results = run_crawler(
prompt="restaurants berlin",
headless=False,
output_format="json"
)
for company in results:
print(f"{company.name} - {company.website}")
Complete (With Configuration)
from core.models import CrawlerConfig, CompanyData
from core.crawler import GoogleMapsCrawler
# Create configuration
config = CrawlerConfig(
search_prompt="restaurants berlin",
headless=False,
output_format="pretty",
)
# Use crawler with context manager
with GoogleMapsCrawler(config) as crawler:
results = crawler.crawl()
# Process results
for company in results:
if isinstance(company, CompanyData):
print(f"{company.name}: {company.rating} stars ({company.reviews_count} reviews)")
Custom Search at Runtime
from core.models import CrawlerConfig
from core.crawler import GoogleMapsCrawler
config = CrawlerConfig(
search_prompt="cafes berlin",
output_format="json",
)
with GoogleMapsCrawler(config) as crawler:
# First search
results1 = crawler.crawl()
# Second search with different term
results2 = crawler.crawl(prompt="restaurants munich")
REST API
The crawler can run as a persistent service with REST API. The container starts as an API server and can process multiple crawl jobs sequentially.
Start the API
# Build the image
cd core
docker build -t sherlock-maps .
# Start API server (port 8000)
docker run -p 8000:8000 sherlock-maps
# With custom port
docker run -p 8080:8080 -e API_PORT=8080 sherlock-maps
API Endpoints
Health & Status
| Method |
Path |
Description |
| GET |
/health |
Health check (for Docker orchestrators) |
| GET |
/status |
Current status (idle/busy), active jobs, queue length |
| GET |
/stats |
Detailed statistics |
Crawler Control
| Method |
Path |
Description |
| POST |
/crawl |
Start a new crawl job |
| GET |
/crawl/{job_id} |
Get job status |
| GET |
/crawl/{job_id}/results |
Get job results |
| DELETE |
/crawl/{job_id} |
Cancel a running job |
| GET |
/crawl/history |
List all jobs with pagination |
Data Management
| Method |
Path |
Description |
| GET |
/results |
Get all results |
| POST |
/results/export |
Export results |
| DELETE |
/results/clear |
Clear all results |
Configuration
| Method |
Path |
Description |
| GET |
/config |
Get current configuration |
| PUT |
/config |
Update configuration |
Browser
| Method |
Path |
Description |
| GET |
/browser/info |
Browser information |
| POST |
/browser/restart |
Restart browser |
API Examples
# Start a new crawl job
curl -X POST http://localhost:8000/crawl \
-H "Content-Type: application/json" \
-d '{"prompt": "restaurants berlin", "output_format": "json"}'
# Get job status
curl http://localhost:8000/crawl/<job_id>
# Get results
curl http://localhost:8000/crawl/<job_id>/results
# Get all results as CSV
curl "http://localhost:8000/results?format=csv"
# Get status
curl http://localhost:8000/status
# Health check
curl http://localhost:8000/health
# Cancel job
curl -X DELETE http://localhost:8000/crawl/<job_id>
# Job history
curl "http://localhost:8000/crawl/history?limit=10&offset=0"
Request Example
{
"prompt": "restaurants berlin",
"output_format": "json",
"headless": false,
"locale": "de-DE",
"max_results": 100
}
Response Example (Job Status)
{
"job_id": "abc-123-def",
"status": "completed",
"prompt": "restaurants berlin",
"created_at": "2026-01-15T10:30:00Z",
"completed_at": "2026-01-15T10:31:30Z",
"results_count": 42,
"error": null
}
Job Status
| Status |
Description |
pending |
In the queue |
running |
Currently running |
completed |
Successfully completed |
failed |
Failed |
cancelled |
Cancelled |
Interactive API Documentation
When the API server is running, interactive Swagger documentation is available:
http://localhost:8000/docs
How It Works
- Search - Navigates to Google Maps with the search term
- Scroll - Loads all search results by scrolling
- Extract - Navigates to each result's detail page and extracts:
- Company name, category, address, phone, website
- Rating and number of reviews
- Opening hours
- Attributes
- Filter - Removes duplicates and validates website URLs
- Output - Outputs results in the desired format
Architecture
Sherlock Maps/
├── .gitignore
├── docker-compose.yml
├── README.md
├── public/
│ └── SherlockMaps.png
└── core/
├── __init__.py # Package exports
├── main.py # CLI entry point
├── main_cli.py # CLI logic
├── crawler.py # Main crawler class
├── requirements.txt # Python dependencies
├── api/
│ ├── __init__.py
│ ├── models.py # API data models
│ ├── queue_manager.py # Job queue management
│ └── server.py # FastAPI server
├── browser/
│ ├── __init__.py
│ └── browser_manager.py # Browser lifecycle management
├── exceptions/
│ ├── __init__.py
│ └── crawler_exceptions.py # Custom exceptions
├── extractors/
│ ├── __init__.py
│ └── maps_extractor.py # Google Maps data extraction
├── models/
│ ├── __init__.py
│ ├── company.py # CompanyData model
│ └── crawler_config.py # CrawlerConfig model
├── output/
│ ├── __init__.py
│ └── output_handler.py # Output formats
└── processors/
├── __init__.py
├── url_validator.py # URL validation
└── deduplication_processor.py # Deduplication
Class Overview
| Class |
Module |
Description |
Sherlock Maps |
|
Open-Source Google Maps Webcrawler |
GoogleMapsCrawler |
crawler.py |
Main class, orchestrates the entire crawling process |
BrowserManager |
browser/browser_manager.py |
Manages Playwright browser lifecycle |
MapsExtractor |
extractors/maps_extractor.py |
Extracts company data from Google Maps |
CompanyData |
models/company.py |
Data model for a company |
CrawlerConfig |
models/crawler_config.py |
Crawler configuration |
URLValidator |
processors/url_validator.py |
Validates HTTP(S) URLs |
DeduplicationProcessor |
processors/deduplication_processor.py |
Removes duplicates |
OutputHandler |
output/output_handler.py |
Formats and outputs results |
CrawlerBaseException |
exceptions/crawler_exceptions.py |
Base exception class |
Configuration
CrawlerConfig Attributes
| Attribute |
Type |
Default |
Description |
search_prompt |
str |
"" |
The search term for Google Maps |
headless |
bool |
False |
Run browser in headless mode |
output_format |
Literal |
"json" |
Output format |
chrome_profile_path |
str |
"Chrome_Profile" |
Path to Chrome user data directory |
viewport |
ViewPort |
1920x1080 |
Browser viewport dimensions |
locale |
str |
"de-DE" |
Browser localization |
page_timeout |
int |
30000 |
Maximum navigation timeout in ms |
selector_timeout |
int |
15000 |
Maximum timeout for selectors in ms |
scroll_timeout |
int |
45 |
Maximum time for scrolling in seconds |
max_scroll_attempts |
int |
5 |
Number of scroll attempts before stop |
max_retries |
int |
3 |
Number of navigation retry attempts |
request_timeout |
int |
25000 |
Request timeout in ms |
Example Output
[
{
"name": "Restaurant Name",
"category": "Restaurant",
"address": "Musterstrasse 1, 10115 Berlin",
"phone": "+49 30 12345678",
"website": "https://www.restaurant-example.de",
"rating": "4.5",
"reviews_count": "234",
"plus_code": "GVMF+8H Berlin",
"opening_hours": "Mon: 12:00-22:00, Tue: 12:00-22:00, ...",
"attributes": ["Wheelchair accessible entrance"]
}
]
Limitations
- Google Maps UI changes may break selectors (CSS classes like
h1.DUwDvf are Google-specific)
- Rate limiting: Google may show CAPTCHAs for fast requests
- German localization is hardcoded (
hl=de), for other languages browser_manager.py must be modified
- Requires a display or headless mode for Chromium
Resources
Comments