Stop losing an afternoon to chasing dozens of reference PDFs by hand. One DOI in, every reference PDF out — using your existing institutional access.
Status: beta (v0.2.0). Windows + Microsoft Edge verified path. macOS / Linux / Chromium untested. Expect rough edges around supplementary downloads and publisher-site changes. PR-worthy issues welcome.
Heads up — not a paywall bypass. ref-downloader uses your institutional access. If your university or organization subscribes to a journal, those refs work. If they don't, those refs become
manual_pendingfor you to follow up on by hand.
Demo (30-second console preview)
$ python run_ref_downloader.py 10.1021/jacs.5c05017
=== Ref Downloader Wrapper ===
DOI: 10.1021/jacs.5c05017
PROJECT: jacs.5c05017
Config: config.example.toml + config.local.toml
>>> extract_refs.py
Title: Designing Natural Cell-Inspired Heme-Spurred Membrane...
References found: 38
>>> validate_refs.py
Total: 38 Verified: 38 Failed: 0 No DOI: 0
>>> download_refs.py
[ 1] downloaded (842 KB) Lee2016_NatEnergy.pdf
[ 2] downloaded (1.2 MB) Wang2018_AdvMater.pdf
[ 3] manual_pending (auth_redirect)
[ 4] downloaded (655 KB) Chen2019_JACS.pdf
[ 5] failed (challenge_timeout)
[ 6] ignored (ignored_institution_access)
... 31 more refs processed ...
[38] downloaded (956 KB) Park2024_JElectrochemSoc.pdf
========== Download report ==========
Total references: 38
Main PDFs: 33 downloaded · 3 manual_pending · 1 failed · 1 ignored
SI files: 12 captured
PDFs land in: ./jacs.5c05017_refs/jacs.5c05017/
=====================================
What you get
- Paywalled refs work without setup. Drives your real Microsoft Edge profile, so any institutional login already in your browser carries through. No API keys, no proxies, no reverse engineering.
- One DOI in, every reference PDF out. Crossref-driven extraction + 17+ publisher-specific download paths (Wiley PDFDirect, Elsevier viewer, AIP loading-page wait — see per-publisher reliability tier), not generic scraping.
- You always know which refs failed and why.
download_report.csvgives every ref a status + reason (manual_pending (auth_redirect),failed (challenge_timeout),ignored);events.jsonlkeeps the per-ref event trace. - Pick up where you left off after a VPN drop, browser crash, or
Ctrl+C. State persists per project; rerunning skips already-downloaded refs and retries only the failures.
Why not Zotero, scihub, or generic scrapers?
- vs. Zotero's Find Available PDF — walks one paper at a time and silently gives up at SSO redirects. ref-downloader walks the whole reference list at once and treats SSO as a configurable step instead of a dead end.
- vs. scihub-style tools — don't carry your institutional license, so paywalled refs you legitimately have access to just fail. ref-downloader uses your authenticated browser session, so subscriptions you already pay for actually count.
- vs. generic web scrapers — don't know Wiley needs PDFDirect, Elsevier needs a viewer click, or AIP serves a Chinese loading page first. ref-downloader has 17+ publisher-specific paths plus hot-session retry for Elsevier.
Quick start
The skill is self-contained under skills/ref-downloader/. Pick the install path for your agent framework:
git clone https://github.com/ltczding-gif/ref-downloader.git
# Pick ONE install destination for your agent framework:
# Claude Code: cp -r ref-downloader/skills/ref-downloader ~/.claude/skills/
# Codex CLI: cp -r ref-downloader/skills/ref-downloader ~/.codex/skills/
# Copilot CLI / VSC: cp -r ref-downloader/skills/ref-downloader .github/skills/
# Project-local: cp -r ref-downloader/skills/ref-downloader .agents/skills/
cd ~/.claude/skills/ref-downloader # or wherever you copied it
pip install playwright pymupdf
playwright install msedge
cp config.example.toml config.local.toml # then set [crossref].mailto
# In your agent: just describe the task; the skill triggers via its description.
# Direct CLI for testing: python scripts/run_ref_downloader.py 10.1021/jacs.5c05017
What you'll see: 30–80 refs discovered for a typical chemistry/physics paper, then a mix of downloaded (refs your institution covers), manual_pending (SSO bounce or paywall), and occasional failed (publisher quirk). Run on a DOI from a journal your institution actually subscribes to for the highest hit rate. Details below.
Requirements
- OS: Windows 10/11 (verified). macOS / Linux untested — PRs welcome.
- Browser: Microsoft Edge (Stable channel). The script claims your persistent Edge profile, so close all Edge windows before running.
- Python: 3.11 or newer (uses stdlib
tomllib). - Optional: A Zotero installation (auto-detects DOI from a PDF's filename via Zotero's SQLite database — much faster than text extraction).
- Optional: PyMuPDF (
pip install pymupdf) for DOI extraction from PDF text when Zotero lookup is unavailable.
Install
As an agent skill (recommended)
Pick the install path for your agent framework:
| Framework | Install command |
|---|---|
| Claude Code | cp -r skills/ref-downloader ~/.claude/skills/ |
| Claude Agent SDK | same (auto-discovers ~/.claude/skills/) |
| Codex CLI | cp -r skills/ref-downloader ~/.codex/skills/ |
| Copilot CLI / VS Code agent | cp -r skills/ref-downloader .github/skills/ |
| Any framework (project-local) | cp -r skills/ref-downloader .agents/skills/ |
Then install Python prereqs INSIDE the copied skill folder (the skill protocol doesn't manage Python deps):
cd ~/.claude/skills/ref-downloader # or wherever you copied it
pip install playwright pymupdf # or use the source's requirements.txt
playwright install msedge
cp config.example.toml config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.
# Windows: notepad config.local.toml
# macOS / Linux: $EDITOR config.local.toml (or vim / nano / code / ...)
As a Python tool (for developers)
If you want to hack on the code, the skill folder is a runnable Python project:
git clone https://github.com/ltczding-gif/ref-downloader.git
cd ref-downloader
pip install -r requirements.txt -r requirements-dev.txt
playwright install msedge
cp skills/ref-downloader/config.example.toml skills/ref-downloader/config.local.toml
# Edit config.local.toml — at minimum set [crossref].mailto.
# Run the offline test suite
python -m pytest tests/ -v
# Run the tool directly
python skills/ref-downloader/scripts/run_ref_downloader.py 10.1021/jacs.5c05017
Usage examples
(After install — paths assume the skill is at <SKILL_DIR>, e.g. ~/.claude/skills/ref-downloader/. In source, <SKILL_DIR> = skills/ref-downloader/.)
Input: a DOI
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017
Default output: <cwd>/jacs.5c05017_refs/jacs.5c05017/
Input: a local PDF (with DOI in metadata or in PDF text)
python <SKILL_DIR>/scripts/run_ref_downloader.py "C:\path\to\your_paper.pdf"
Default output: <pdf_dir>/your_paper_refs/<doi-derived-name>/
Custom output directory
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --output-dir refs/
Non-interactive (CI / batch)
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --yes --auto
Alternate config file
python <SKILL_DIR>/scripts/run_ref_downloader.py 10.1021/jacs.5c05017 --config ./alt.toml
Configuration
All configuration lives in config.local.toml (gitignored). Copy config.example.toml to bootstrap.
| Section | Key | Purpose |
|---|---|---|
[crossref] |
mailto |
Your email — entry into Crossref polite pool |
[zotero] |
db_path |
Optional path to zotero.sqlite for DOI lookup from PDF filename |
[browser] |
edge_profile_dir |
Edge profile directory; empty = OS default |
[browser] |
disable_extensions |
Set true to launch with --disable-extensions |
[institution] |
auth_hosts |
Hostnames that mean "you got bounced to SSO" (e.g. ["sso.your-uni.edu"]) |
[institution] |
auth_url_fragments |
URL substrings indicating SSO (e.g. ["oauth", "saml"]) |
[institution] |
auth_page_titles |
<title> text for SSO pages (catches HTML served as PDF) |
[institution] |
auth_loading_titles |
Loading-page titles (also reused for AIP/AVS publisher loading detection) |
[institution] |
ignored_access_dois |
DOIs you know are paywalled at your institution; skipped without retry |
Environment variables override file values:
| Variable | Maps to |
|---|---|
REF_DOWNLOADER_MAILTO |
crossref.mailto |
REF_DOWNLOADER_ZOTERO_DB |
zotero.db_path |
REF_DOWNLOADER_EDGE_PROFILE |
browser.edge_profile_dir |
REF_DOWNLOADER_DISABLE_EXTENSIONS |
browser.disable_extensions (1/true to enable) |
REF_DOWNLOADER_CONFIG |
Path to alternate TOML file |
See skills/ref-downloader/config.example.toml for full documentation.
Architecture
Three-stage pipeline + a wrapper:
skills/ref-downloader/
├── SKILL.md agent runbook (slim entry)
├── references/agent-runbook.md extended manual flow + DOI fallback
├── config.example.toml config schema (copy to config.local.toml)
└── scripts/
├── run_ref_downloader.py entry — config + DOI resolution + sequencing
│ └─> extract_refs.py (1) Crossref API: fetch parent's reference list
│ └─> validate_refs.py (2) Crossref API: per-ref metadata + publisher classify
│ └─> download_refs.py (3) Playwright/Edge: download main PDF + SI per publisher
└── _config.py TOML + env-var loader
You can also run the three scripts manually for debugging or partial restarts. See the agent runbook in skills/ref-downloader/references/agent-runbook.md for the manual flow.
Agent users can install or inspect the packaged skill at skills/ref-downloader/SKILL.md. The repository root remains the human-facing Python project; the skill bundle is kept separate so Codex does not treat README, changelog, tests, and source files as always-associated skill context.
Supported publishers
ACS, Nature, Science, Elsevier, Wiley, RSC, Springer, PNAS, ECS, IOP, AIP, AVS, IEEE, OSA, KPS, Beilstein, APS, Annual Reviews, Taylor & Francis. Maturity varies — see docs/SUPPORTED_PUBLISHERS.md for the per-publisher tier table and known issues.
Known limitations
- Windows + Microsoft Edge only: that's the verified path. macOS / Linux / Chromium support has not been tested. If you try, please open an issue with results.
- Headed mode required: empirically,
headless=Trueyields empty results for Wiley / ACS supplementary downloads. The default is headed. - Edge must be fully closed before running: Playwright needs exclusive access to the persistent profile. Check Task Manager for any background
msedge.exeprocesses. - SSO redirects are detected, not solved: when the script bounces to your institution's SSO, the ref becomes
manual_pendingso you can sign in interactively. Configure[institution]to teach it which redirects to recognize. - SI download is the most fragile path: main PDFs are reliable; SI lookup varies by publisher and is the area most likely to need a tweak when a publisher updates their site.
- Paywalled content needs institutional access: this is not a bypass tool.
- Crossref dependency: papers with no reference list deposited at Crossref can't be processed automatically.
Security
This tool launches your real Edge profile, with all your cookies and saved sessions. Read SECURITY.md before running it against a profile you also use for daily browsing.
Comments