Skip to content

Commit 0e7d22b

Browse files
authored
Jan documentation updates (#1612)
* Update workflow docs * Docs cleanup
1 parent 63042d2 commit 0e7d22b

File tree

3 files changed

+53
-53
lines changed

3 files changed

+53
-53
lines changed

Diff for: docs/developing.md

+1
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ Available scripts are:
5151
- `poetry run poe test_unit` - This will execute unit tests.
5252
- `poetry run poe test_integration` - This will execute integration tests.
5353
- `poetry run poe test_smoke` - This will execute smoke tests.
54+
- `poetry run poe test_verbs` - This will execute tests of the basic workflows.
5455
- `poetry run poe check` - This will perform a suite of static checks across the package, including:
5556
- formatting
5657
- documentation formatting

Diff for: docs/index/default_dataflow.md

+51-45
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ The knowledge model is a specification for data outputs that conform to our data
77
- `Document` - An input document into the system. These either represent individual rows in a CSV or individual .txt file.
88
- `TextUnit` - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set `CHUNK_BY_COLUMNS` to `id` so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.
99
- `Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
10-
- `Relationship` - A relationship between two entities. These are generated from the covariates.
10+
- `Relationship` - A relationship between two entities.
1111
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
1212
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
1313
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
@@ -24,7 +24,7 @@ title: Dataflow Overview
2424
flowchart TB
2525
subgraph phase1[Phase 1: Compose TextUnits]
2626
documents[Documents] --> chunk[Chunk]
27-
chunk --> embed[Embed] --> textUnits[Text Units]
27+
chunk --> textUnits[Text Units]
2828
end
2929
subgraph phase2[Phase 2: Graph Extraction]
3030
textUnits --> graph_extract[Entity & Relationship Extraction]
@@ -34,39 +34,37 @@ flowchart TB
3434
end
3535
subgraph phase3[Phase 3: Graph Augmentation]
3636
graph_outputs --> community_detect[Community Detection]
37-
community_detect --> graph_embed[Graph Embedding]
38-
graph_embed --> augmented_graph[Augmented Graph Tables]
37+
community_detect --> community_outputs[Communities Table]
3938
end
4039
subgraph phase4[Phase 4: Community Summarization]
41-
augmented_graph --> summarized_communities[Community Summarization]
42-
summarized_communities --> embed_communities[Community Embedding]
43-
embed_communities --> community_outputs[Community Tables]
40+
community_outputs --> summarized_communities[Community Summarization]
41+
summarized_communities --> community_report_outputs[Community Reports Table]
4442
end
4543
subgraph phase5[Phase 5: Document Processing]
4644
documents --> link_to_text_units[Link to TextUnits]
4745
textUnits --> link_to_text_units
48-
link_to_text_units --> embed_documents[Document Embedding]
49-
embed_documents --> document_graph[Document Graph Creation]
50-
document_graph --> document_outputs[Document Tables]
46+
link_to_text_units --> document_outputs[Documents Table]
5147
end
5248
subgraph phase6[Phase 6: Network Visualization]
53-
document_outputs --> umap_docs[Umap Documents]
54-
augmented_graph --> umap_entities[Umap Entities]
55-
umap_docs --> combine_nodes[Nodes Table]
56-
umap_entities --> combine_nodes
49+
graph_outputs --> graph_embed[Graph Embedding]
50+
graph_embed --> umap_entities[Umap Entities]
51+
umap_entities --> combine_nodes[Final Nodes]
52+
end
53+
subgraph phase7[Phase 7: Text Embeddings]
54+
textUnits --> text_embed[Text Embedding]
55+
graph_outputs --> description_embed[Description Embedding]
56+
community_report_outputs --> content_embed[Content Embedding]
5757
end
5858
```
5959

6060
## Phase 1: Compose TextUnits
6161

62-
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source tex.
62+
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.
6363

6464
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
6565

6666
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
6767

68-
Each of these text-units are text-embedded and passed into the next phase of the pipeline.
69-
7068
```mermaid
7169
---
7270
title: Documents into Text Chunks
@@ -95,43 +93,39 @@ flowchart LR
9593

9694
### Entity & Relationship Extraction
9795

98-
In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _name_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_.
96+
In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _title_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_.
9997

100-
These subgraphs are merged together - any entities with the same _name_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions.
98+
These subgraphs are merged together - any entities with the same _title_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions.
10199

102100
### Entity & Relationship Summarization
103101

104102
Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.
105103

106-
### Claim Extraction & Emission
104+
### Claim Extraction (optional)
107105

108106
Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called **Covariates**.
109107

110108
Note: claim extraction is _optional_ and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.
111109

112110
## Phase 3: Graph Augmentation
113111

114-
Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: _Community Detection_ and _Graph Embedding_. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph.
112+
Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the topological structure of our graph.
115113

116114
```mermaid
117115
---
118116
title: Graph Augmentation
119117
---
120118
flowchart LR
121-
cd[Leiden Hierarchical Community Detection] --> ge[Node2Vec Graph Embedding] --> ag[Graph Table Emission]
119+
cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]
122120
```
123121

124122
### Community Detection
125123

126124
In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.
127125

128-
### Graph Embedding
129-
130-
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
131-
132-
### Graph Tables Emission
126+
### Graph Tables
133127

134-
Once our graph augmentation steps are complete, the final **Entities** and **Relationships** tables are exported after their text fields are text-embedded.
128+
Once our graph augmentation steps are complete, the final **Entities**, **Relationships**, and **Communities** tables are exported.
135129

136130
## Phase 4: Community Summarization
137131

@@ -140,10 +134,10 @@ Once our graph augmentation steps are complete, the final **Entities** and **Rel
140134
title: Community Summarization
141135
---
142136
flowchart LR
143-
sc[Generate Community Reports] --> ss[Summarize Community Reports] --> ce[Community Embedding] --> co[Community Tables Emission]
137+
sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]
144138
```
145139

146-
At this point, we have a functional graph of entities and relationships, a hierarchy of communities for the entities, as well as node2vec embeddings.
140+
At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.
147141

148142
Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.
149143

@@ -155,13 +149,9 @@ In this step, we generate a summary of each community using the LLM. This will a
155149

156150
In this step, each _community report_ is then summarized via the LLM for shorthand use.
157151

158-
### Community Embedding
152+
### Community Reports Table
159153

160-
In this step, we generate a vector representation of our communities by generating text embeddings of the community report, the community report summary, and the title of the community report.
161-
162-
### Community Tables Emission
163-
164-
At this point, some bookkeeping work is performed and we export the **Communities** and **CommunityReports** tables.
154+
At this point, some bookkeeping work is performed and we export the **Community Reports** tables.
165155

166156
## Phase 5: Document Processing
167157

@@ -172,7 +162,7 @@ In this phase of the workflow, we create the _Documents_ table for the knowledge
172162
title: Document Processing
173163
---
174164
flowchart LR
175-
aug[Augment] --> dp[Link to TextUnits] --> de[Avg. Embedding] --> dg[Document Table Emission]
165+
aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]
176166
```
177167

178168
### Augment with Columns (CSV Only)
@@ -183,15 +173,11 @@ If the workflow is operating on CSV data, you may configure your workflow to add
183173

184174
In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.
185175

186-
### Document Embedding
187-
188-
In this step, we generate a vector representation of our documents using an average embedding of document slices. We re-chunk documents without overlapping chunks, and then generate an embedding for each chunk. We create an average of these chunks weighted by token-count and use this as the document embedding. This will allow us to understand the implicit relationship between documents, and will help us generate a network representation of our documents.
189-
190-
### Documents Table Emission
176+
### Documents Table
191177

192178
At this point, we can export the **Documents** table into the knowledge Model.
193179

194-
## Phase 6: Network Visualization
180+
## Phase 6: Network Visualization (optional)
195181

196182
In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the _Entity-Relationship_ graph and the _Document_ graph.
197183

@@ -200,7 +186,27 @@ In this phase of the workflow, we perform some steps to support network visualiz
200186
title: Network Visualization Workflows
201187
---
202188
flowchart LR
203-
nv[Umap Documents] --> ne[Umap Entities] --> ng[Nodes Table Emission]
189+
ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Nodes Table]
204190
```
205191

206-
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.
192+
### Graph Embedding
193+
194+
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
195+
196+
### Dimensionality Reduction
197+
198+
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include the UMAP dimensions as x/y coordinates.
199+
200+
## Phase 7: Text Embedding
201+
202+
For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.
203+
204+
```mermaid
205+
---
206+
title: Text Embedding Workflows
207+
---
208+
flowchart LR
209+
textUnits[Text Units] --> text_embed[Text Embedding]
210+
graph_outputs[Graph Tables] --> description_embed[Description Embedding]
211+
community_report_outputs[Community Reports] --> content_embed[Content Embedding]
212+
```

Diff for: docs/index/overview.md

+1-8
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,14 @@ Indexing Pipelines are configurable. They are composed of workflows, standard an
1010
- embed entities into a graph vector space
1111
- embed text chunks into a textual vector space
1212

13-
The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API.
13+
The outputs of the pipeline are stored as Parquet tables by default, and embeddings are written to your configured vector store.
1414

1515
## Getting Started
1616

1717
### Requirements
1818

1919
See the [requirements](../developing.md#requirements) section in [Get Started](../get_started.md) for details on setting up a development environment.
2020

21-
The Indexing Engine can be used in either a default configuration mode or with a custom pipeline.
2221
To configure GraphRAG, see the [configuration](../config/overview.md) documentation.
2322
After you have a config file you can run the pipeline using the CLI or the Python API.
2423

@@ -29,12 +28,6 @@ After you have a config file you can run the pipeline using the CLI or the Pytho
2928
```bash
3029
# Via Poetry
3130
poetry run poe cli --root <data_root> # default config mode
32-
poetry run poe cli --config your_pipeline.yml # custom config mode
33-
34-
# Via Node
35-
yarn run:index --root <data_root> # default config mode
36-
yarn run:index --config your_pipeline.yml # custom config mode
37-
3831
```
3932

4033
### Python API

0 commit comments

Comments
 (0)