@@ -67,20 +67,60 @@ uvx pdf2sqlite --offline -p ../data/*.pdf -d data.db -a
6767
6868### Integration with an LLM
6969
70- Some design guidelines:
71-
72- 1 . Pass the database schema to the LLM. The schema will contain some comments
73- that describe the different columns.
74-
75- 2 . To get the most of the database, you will probably want to write a tool that
76- your LLM can call to convert binary pdf and image data stored in the
77- database into images and PDF pages. A good design is to allow the LLM to
78- pass in a table name, row id and column name, and receive the relevant
79- content as a response. The LLM will generally be able to discern the
80- necessary inputs from the schema, so the tool will be robust against future
81- schema changes.
82-
83- 3 . A backend (like, e.g. Amazon Bedrock) that supports returning PDFs as the
84- result of a tool call may be helpful, although it will probably work to
85- return the PDF as a separate content block alongside a tool call result that
86- just says "success, PDF will be delivered" or something similar.
70+ For many purposes, it should be enough to connect the LLM to a generic sqlite
71+ tool, either an MCP server like
72+ [ this] ( https://github.com/modelcontextprotocol/servers-archived/tree/main/src/sqlite )
73+ reference server, or by giving a coding agent like Claude Code access to a cli
74+ tool like ` sqlite3 ` . Ordinary sqlite queries will let the LLM access the full
75+ text of each page, along with any textual transcriptions of tables or
76+ descriptions of figures included in the database.
77+
78+ However, it's also possible for a vision model to directly examine the original
79+ pages, tables, or figures, since these are saved in the database. So, we ship a
80+ simple MCP server, that includes tools and resources for retrieving these kinds
81+ of data.
82+
83+ An example configuration for Claude desktop might be:
84+
85+ ``` json
86+ {
87+ "mcpServers" : {
88+ "pdf2sqlite" : {
89+ "command" : " uvx" ,
90+ "args" : [
91+ " --from" ,
92+ " pdf2sqlite" ,
93+ " pdf2sqlite-mcp" ,
94+ " --database" ,
95+ " MyDatabase.db"
96+ ]
97+ }
98+ }
99+ }
100+ ```
101+
102+ Full usage details are below.
103+
104+ ```
105+ usage: pdf2sqlite-mcp [-h] [-d DATABASE] [--max-blob-bytes MAX_BLOB_BYTES]
106+ [--default-limit DEFAULT_LIMIT] [--max-limit MAX_LIMIT]
107+ [--transport {sse,stdio,streamable-http}] [--host HOST]
108+ [--port PORT]
109+
110+ Expose pdf2sqlite databases over the Model Context Protocol
111+
112+ options:
113+ -h, --help show this help message and exit
114+ -d, --database DATABASE
115+ Path to the sqlite database produced by pdf2sqlite
116+ --max-blob-bytes MAX_BLOB_BYTES
117+ Maximum blob size the server will return (bytes)
118+ --default-limit DEFAULT_LIMIT
119+ Default limit for listing queries
120+ --max-limit MAX_LIMIT
121+ Maximum limit for listing queries
122+ --transport {sse,stdio,streamable-http}
123+ Transport to use when running the server
124+ --host HOST Host name for SSE or HTTP transports
125+ --port PORT Port for SSE or HTTP transports
126+ ```
0 commit comments