Skip to content

Add docs on generating embeddings from web #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion docs/ai/build/rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,29 @@ Retrieval Augmented Generation (RAG) allows developers to provide a knowledge ba

Defining the RAG data set is largely up to the user to define. Currently only [Lance DB](https://lancedb.github.io/lancedb/) is supported. You can [review Lance DB's documentation](https://lancedb.github.io/lancedb/basic/) to determine the best way to ingest and embed your chosen RAG source data.

We do provide an off the shelf way to create a table from markdown files. This will parse and chunk the content appropriately and use the `nomic-embed-text` model to generate vectors.
We do provide a couple of tools to create a table from different sources.

### From Markdown

This will parse and chunk the content appropriately and use the `nomic-embed-text` model to generate vectors.

```shell
subql-ai embed-mdx -i ./path/to/dir/with/markdown -o ./db --table your-table-name --model nomic-embed-text
```

### From Web

This will parse all the visible text from the specified web page(s). You can specify the scope for how many links are followed to pull in more data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we scrape these pages, it would be good to provide some details on the libarary we use. And I imagine there are some limitations on dynamic websites, e.g. does this work with websites that need to execute JS.

Finally, how can i verify if this was able to scrape my website, do we export the page content as text somewhere so i can verify this?


Scope options:
- `none` - Only the page of the specified URL
- `domain` - Only the domain of the URL
- `subdomain` - Only the domain of the URL and any subdomains

```shell
subql-ai embed-web -i https://subquery.network -o ./db --table your-table-name --model nomic-embed-text --scope doamin
```

::: info

You can follow through a step by step tutorial on how parse, vectorise, and add the resulting RAG database to your AI App in our [RAG quick start guide](../guides/subquery-docs-rag.md).
Expand Down
1 change: 1 addition & 0 deletions docs/ai/run/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Run a SubQuery AI app
Commands:
subql-ai Run a SubQuery AI app [default]
subql-ai info Get information on a project
subql-ai embed-web Creates a Lance db table with emdeddings from a Web source
subql-ai embed-mdx Creates a Lance db table with embeddings from MDX files
subql-ai repl Creates a CLI chat with a running app
subql-ai publish Publishes a project to IPFS so it can be easily
Expand Down