PDF & Word 文件結構化解析與管理工具
This skill parses PDF and Word (.docx) documents, slices them into structured Markdown sections based on document headings, stores the segments in a local SQLite database, and supports full-text search, TOC viewing, chunk retrieval, and cascading deletion.
本工具(Skill)專為 Hermes 代理人 (Agent) 設計,用於解析 PDF 與 Word (.docx) 文件。它能根據標題層級將長文件切割成結構化的 Markdown 區塊,存入本地 SQLite 資料庫,並支援全文檢索、目錄(TOC)檢視、特定區塊讀取以及階層式刪除。
English Guide
Features
- Batch Processing: Parses PDF pages in 50-page batches using
pymupdf4llm(~16% performance improvement on large files). - Heading Page-Number Synchronization: Syncs headings to exact physical pages via PDF's internal Table of Contents (Bookmarks).
- SQLite FTS5 Full-Text Search: Instantly searches across thousands of text chunks using keyword logic with wildcard and LIKE fallback.
- Crash-Safe Operations: Writes DB records before physical file updates, eliminating orphaned files on disk if the script crashes.
- Duplicate Document Guard: Automatically removes old database entries and old physical folder structures upon re-parsing same filenames.
Prerequisites
- Python: 3.10+ (Python 3.11+ recommended)
- Required Libraries:
PyMuPDF(for PDF rendering & layout parsing)pymupdf4llm(for Markdown conversions)python-docx(for Word document structure parsing)
Installation
1. Register Skill
Depending on your agent setup, place this skill directory in one of the following locations:
- Project Workspace Scope:
.agents/skills/document_structuring/ - Global User Scope:
~/.hermes/skills/document_structuring/or~/.gemini/config/skills/document_structuring/
2. Install Dependencies
Run the following command in your terminal to install the necessary packages:
pip install PyMuPDF pymupdf4llm python-docx
Usage Guide
Run the utility script via CLI. Always run commands from the project workspace root. Except for delete, all commands require a --output file path to write results.
1. Parse a Document
Parses a PDF/DOCX and stores chunks.
python scripts/document_tool.py parse --file "your-manual.pdf" --output "parse_result.json"
- Output JSON Format (
parse_result.json):{ "success": true, "document_id": 1, "filename": "your-manual.pdf", "chunk_count": 9834 }
2. List All Parsed Documents
Lists all documents saved in the SQLite database.
python scripts/document_tool.py list --output "documents_list.json"
- Output JSON Format (
documents_list.json):{ "documents": [ { "id": 1, "filename": "your-manual.pdf", "upload_time": "2026-06-19 07:44:28", "chunk_count": 9834, "status": "success" } ] }
3. Retrieve Table of Contents (TOC)
Returns all headings and chunks of a document, sorted by section numbers.
python scripts/document_tool.py toc --doc-id 1 --output "toc_data.json"
- Output JSON Format (
toc_data.json):{ "toc": [ { "id": 1, "section_number": "1", "title": "Introduction", "page_start": 1, "file_path": "output/1/chunks/1_1_Introduction.md" }, { "id": 2, "section_number": "1.1", "title": "Background", "page_start": 2, "file_path": "output/1/chunks/2_1.1_Background.md" } ] }
4. Search across Chunks (Full-Text)
Searches headings and contents using SQLite FTS5 index.
python scripts/document_tool.py search --query "ACPI" --output "search_results.json"
- Output JSON Format (
search_results.json):{ "results": [ { "id": 123, "document_id": 1, "section_number": "3.2", "title": "ACPI States", "page_start": 45, "file_path": "output/1/chunks/123_3.2_ACPI_States.md", "document_name": "your-manual.pdf", "snippet": "This section explains ==ACPI== states and sleep modes..." } ] }
5. Retrieve a Specific Chunk's Content
Reads the full markdown content of a single section block.
python scripts/document_tool.py get-chunk --chunk-id 123 --output "chunk_content.json"
- Output JSON Format (
chunk_content.json):{ "chunk": { "id": 123, "document_id": 1, "section_number": "3.2", "title": "ACPI States", "content": "Full markdown content of this section...", "page_start": 45, "file_path": "output/1/chunks/123_3.2_ACPI_States.md", "document_name": "your-manual.pdf" } }
6. Delete a Document
Performs a cascading deletion (removes database entries, cascading chunk records, and deletes files inside output/<doc_id>/ from disk).
python scripts/document_tool.py delete --doc-id 1
- Stdout Output:
Success: Document 'your-manual.pdf' (ID: 1) and all its associated chunks have been deleted.
Database Schema
The SQLite database file documents.db contains three tables:
_meta: Stores internal schema configuration (version checks).documents:id: Auto-incrementing primary key.filename: Name of the source file.upload_time: Formatted timestamp.chunk_count: Total chunks generated.status: Import status (success,failed).
chunks:id: Auto-incrementing primary key (sequential rowid).document_id: Foreign key cascading on delete.section_number: Normalised section heading index (e.g.1.1.2).title: Sanitised heading title text.content: Extracted text/markdown body.page_start: 1-based page start in the source PDF.file_path: Relative path of physical.mdfile.
中文使用指南
功能特點
- 批次 PDF 解析:改用 50 頁批次解析模式調用
pymupdf4llm���大幅降低排版分析開銷(大文件解析速度提升約 16%)。 - 書籤目錄頁碼同步:透過 PDF 內建的 TOC (Table of Contents / Bookmark) 書籤頁碼比對,將標題解析的
page_start與實體頁碼對齊。 - SQLite FTS5 全文檢索:將標題與內文同步更新至虛擬表,支援快速多詞 AND 檢索,並包含 LIKE 語法容錯。
- 斷電/崩潰安全設計:採用「先 Commit 資料庫,後寫入實體檔案」順序。即使寫檔途中崩潰,資料庫與磁碟狀態也保持一致,不產生多餘髒檔案。
- 重置除錯機制:重新 parse 同檔名文件時,會自動清除舊版資料庫列與實體磁碟目錄,防止空間洩漏。
環境要求
- Python 版本:3.10+ (建議 Python 3.11 以上)
- 依賴套件:
PyMuPDF(PDF 渲染及版面結構提取)pymupdf4llm(Markdown 文本轉換)python-docx(Word 文件結構解析)
安裝步驟
1. 置放 Skill
根據您的代理人配置,將此 skill 資料夾放入以下其中一個目錄:
- 本機專案 scope:
.agents/skills/document_structuring/ - 全域環境 scope:
~/.hermes/skills/document_structuring/或~/.gemini/config/skills/document_structuring/
2. 安裝 Python 依賴
在終端機中執行以下指令以安裝必要的套件:
pip install PyMuPDF pymupdf4llm python-docx
使用說明
透過命令列 CLI 執行工具,指令請一律在 專案工作目錄根路徑 (Workspace Root) 下執行。除了 delete 指令外,其餘指令皆必須提供 --output 參數來寫出 JSON 結果。
1. 解析檔案 (Parse)
將 PDF 或 DOCX 長文件切片存入資料庫與磁碟。
python scripts/document_tool.py parse --file "your-manual.pdf" --output "parse_result.json"
- 輸出 JSON (
parse_result.json):{ "success": true, "document_id": 1, "filename": "your-manual.pdf", "chunk_count": 9834 }
2. 列出已解析文件 (List)
列出 SQLite 資料庫中目前所有管理的文件。
python scripts/document_tool.py list --output "documents_list.json"
- 輸出 JSON (
documents_list.json):{ "documents": [ { "id": 1, "filename": "your-manual.pdf", "upload_time": "2026-06-19 07:44:28", "chunk_count": 9834, "status": "success" } ] }
3. 讀取目錄結構 (TOC)
取得指定文件底下的所有章節與檔案路徑,依章節號排序。
python scripts/document_tool.py toc --doc-id 1 --output "toc_data.json"
- 輸出 JSON (
toc_data.json):{ "toc": [ { "id": 1, "section_number": "1", "title": "Introduction", "page_start": 1, "file_path": "output/1/chunks/1_1_Introduction.md" }, { "id": 2, "section_number": "1.1", "title": "Background", "page_start": 2, "file_path": "output/1/chunks/2_1.1_Background.md" } ] }
4. 全文檢索章節 (Search)
在所有文件的章節標題與內文進行關鍵字全文檢索。
python scripts/document_tool.py search --query "ACPI" --output "search_results.json"
- 輸出 JSON (
search_results.json):{ "results": [ { "id": 123, "document_id": 1, "section_number": "3.2", "title": "ACPI States", "page_start": 45, "file_path": "output/1/chunks/123_3.2_ACPI_States.md", "document_name": "your-manual.pdf", "snippet": "This section explains ==ACPI== states and sleep modes..." } ] }
5. 獲取特定章節內文 (Get Chunk)
藉由資料庫 Chunk ID 提取該章節的完整 Markdown 格式內容。
python scripts/document_tool.py get-chunk --chunk-id 123 --output "chunk_content.json"
- 輸出 JSON (
chunk_content.json):{ "chunk": { "id": 123, "document_id": 1, "section_number": "3.2", "title": "ACPI States", "content": "Full markdown content of this section...", "page_start": 45, "file_path": "output/1/chunks/123_3.2_ACPI_States.md", "document_name": "your-manual.pdf" } }
6. 刪除文件項目 (Delete)
自資料庫刪除文件(階層刪除 chunks 外鍵),並抹除磁碟上的實體 output/<doc_id>/ 資料夾。
python scripts/document_tool.py delete --doc-id 1
- 終端機輸出:
Success: Document 'your-manual.pdf' (ID: 1) and all its associated chunks have been deleted.
資料庫結構
SQLite 資料庫檔案預設為 documents.db,由以下三張資料表組成:
_meta:記錄資料庫內部配置版本以提供升級防護。documents(文件主表):id: 資料庫自動遞增主鍵。filename: 來源檔名。upload_time: 上傳時間(YYYY-MM-DD HH:MM:SS)。chunk_count: 切片後的總章節數。status: 解析狀態 (success,failed)。
chunks(章節切片表):id: 資料庫遞增主鍵(實體 rowid)。document_id: 外鍵,串聯documents.id(外鍵 ON DELETE CASCADE)。section_number: 歸一化章節號 (例如1.1.2)。title: 標題文字。content: 切出的實體文字/Markdown 內容。page_start: 該段落起始頁碼 (1-based)。file_path: 磁碟上實體.md檔案的相對路徑。
Token Economics / Token 經濟效益
Large technical manuals (e.g., 2,000+ page PDFs) pose context limitations for LLMs. This tool compresses token usage drastically by serving targeted section chunks.
大型技術手冊(例如 2,000 頁以上之規格書)會帶來龐大的 Token 開銷。本工具利用全文檢索精確提取段落,可達到極高的 Token 壓縮效益:
| Scenario / 場景 | Raw PDF Tokens / 原始 Tokens | Via Chunk Retrieval / 使用切片提取 | Compression Ratio / 壓縮倍率 |
|---|---|---|---|
| Full 2,400-page PPR | ~4.5M tokens | FTS5 search → top 5 hits → ~2.5K tokens | ~1,800× |
| Chapter query (e.g. "MSR registers") | ~150K tokens (whole chapter) | get-chunk → ~3K tokens per section | ~50× |
| Specific register lookup | Full PDF needed otherwise | Targeted chunk → ~1.2K tokens | ~3,750× |
Troubleshooting / 常見問題與排除
ModuleNotFoundError: No module named 'fitz'/'docx'- Reason: Dependencies are missing in your active environment.
- Solution: Ensure your python environment is active and install libraries:
pip install PyMuPDF pymupdf4llm python-docx - 原因:目���作用中的 Python 環境尚未安裝必要的解析庫。
- 解法:請確保使用正確的 Python 環境,並執行:
pip install PyMuPDF pymupdf4llm python-docx
Missing
--outputParameter / 缺少--output參數- Reason: Except for
delete, all CLI queries must output JSON to a designated file. - Solution: Always append
--output result.jsonto your commands. - 原因:除了刪除指令之外,CLI 工具預期一律將結構化 JSON 資料導出至指定路徑。
- 解法:請務必在指令尾端加入
--output <檔案路徑.json>。
- Reason: Except for
Relative Path / Working Directory Issues / 工作路徑混亂
- Reason: Running CLI from inside
scripts/folder or global paths. - Solution: Always navigate to the project workspace root (where
documents.dbandoutput/should reside) before calling Python commands. - 原因:於
scripts/資料夾內或任意全域路徑執行指令,會導致資料庫與 Markdown 檔案被建立在錯誤的地方。 - 解法:請一律將終端機切換至 專案工作目錄根路徑,再行執行
python scripts/document_tool.py ...。
- Reason: Running CLI from inside
Comments