diff --git a/README.md b/README.md index 3112f65..3661ad0 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,10 @@ A way to share content from a specific domain using SQLite as an alternative to RSS feeds. The purpose of this library is to simply create a dataset for all the content on your website, using the XML sitemap as a starting point. +Possibility to include vector search similarity features in the dataset very easily. + +Article that explains the rationale behind this type of datasets [here](https://philippeoger.com/pages/can-we-rag-the-whole-web/). + ## Installation @@ -15,15 +19,21 @@ pip install contentmap ## Quickstart -To build your contentmap.db that will contain all your content using your XML -sitemap as a starting point, you only need to write the following: +To build your contentmap.db with vector search capabilities and containing all +your content using your XML sitemap as a starting point, you only need to write the +following: ```python from contentmap.sitemap import SitemapToContentDatabase -database = SitemapToContentDatabase("https://yourblog.com/sitemap.xml") -database.load() +database = SitemapToContentDatabase( + sitemap_url="https://yourblog.com/sitemap.xml", + concurrency=10, + include_vss=True +) +database.build() ``` -You can control how many urls can be crawled concurrently and also set some timeout. \ No newline at end of file +This will automatically create the SQLite database file, with vector search +capabilities (piggybacking on sqlite-vss integration on Langchain).