Skip to content

Commit 411ab43

Browse files
committed
Update README
1 parent d7edbe1 commit 411ab43

File tree

1 file changed

+57
-17
lines changed

1 file changed

+57
-17
lines changed

README.md

Lines changed: 57 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -67,20 +67,60 @@ uvx pdf2sqlite --offline -p ../data/*.pdf -d data.db -a
6767

6868
### Integration with an LLM
6969

70-
Some design guidelines:
71-
72-
1. Pass the database schema to the LLM. The schema will contain some comments
73-
that describe the different columns.
74-
75-
2. To get the most of the database, you will probably want to write a tool that
76-
your LLM can call to convert binary pdf and image data stored in the
77-
database into images and PDF pages. A good design is to allow the LLM to
78-
pass in a table name, row id and column name, and receive the relevant
79-
content as a response. The LLM will generally be able to discern the
80-
necessary inputs from the schema, so the tool will be robust against future
81-
schema changes.
82-
83-
3. A backend (like, e.g. Amazon Bedrock) that supports returning PDFs as the
84-
result of a tool call may be helpful, although it will probably work to
85-
return the PDF as a separate content block alongside a tool call result that
86-
just says "success, PDF will be delivered" or something similar.
70+
For many purposes, it should be enough to connect the LLM to a generic sqlite
71+
tool, either an MCP server like
72+
[this](https://github.com/modelcontextprotocol/servers-archived/tree/main/src/sqlite)
73+
reference server, or by giving a coding agent like Claude Code access to a cli
74+
tool like `sqlite3`. Ordinary sqlite queries will let the LLM access the full
75+
text of each page, along with any textual transcriptions of tables or
76+
descriptions of figures included in the database.
77+
78+
However, it's also possible for a vision model to directly examine the original
79+
pages, tables, or figures, since these are saved in the database. So, we ship a
80+
simple MCP server, that includes tools and resources for retrieving these kinds
81+
of data.
82+
83+
An example configuration for Claude desktop might be:
84+
85+
```json
86+
{
87+
"mcpServers": {
88+
"pdf2sqlite": {
89+
"command": "uvx",
90+
"args": [
91+
"--from",
92+
"pdf2sqlite",
93+
"pdf2sqlite-mcp",
94+
"--database",
95+
"MyDatabase.db"
96+
]
97+
}
98+
}
99+
}
100+
```
101+
102+
Full usage details are below.
103+
104+
```
105+
usage: pdf2sqlite-mcp [-h] [-d DATABASE] [--max-blob-bytes MAX_BLOB_BYTES]
106+
[--default-limit DEFAULT_LIMIT] [--max-limit MAX_LIMIT]
107+
[--transport {sse,stdio,streamable-http}] [--host HOST]
108+
[--port PORT]
109+
110+
Expose pdf2sqlite databases over the Model Context Protocol
111+
112+
options:
113+
-h, --help show this help message and exit
114+
-d, --database DATABASE
115+
Path to the sqlite database produced by pdf2sqlite
116+
--max-blob-bytes MAX_BLOB_BYTES
117+
Maximum blob size the server will return (bytes)
118+
--default-limit DEFAULT_LIMIT
119+
Default limit for listing queries
120+
--max-limit MAX_LIMIT
121+
Maximum limit for listing queries
122+
--transport {sse,stdio,streamable-http}
123+
Transport to use when running the server
124+
--host HOST Host name for SSE or HTTP transports
125+
--port PORT Port for SSE or HTTP transports
126+
```

0 commit comments

Comments
 (0)