You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index/default_dataflow.md
+51-45
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ The knowledge model is a specification for data outputs that conform to our data
7
7
-`Document` - An input document into the system. These either represent individual rows in a CSV or individual .txt file.
8
8
-`TextUnit` - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set `CHUNK_BY_COLUMNS` to `id` so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.
9
9
-`Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
10
-
-`Relationship` - A relationship between two entities. These are generated from the covariates.
10
+
-`Relationship` - A relationship between two entities.
11
11
-`Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
12
12
-`Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
13
13
-`Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source tex.
62
+
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.
63
63
64
64
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
65
65
66
66
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
67
67
68
-
Each of these text-units are text-embedded and passed into the next phase of the pipeline.
69
-
70
68
```mermaid
71
69
---
72
70
title: Documents into Text Chunks
@@ -95,43 +93,39 @@ flowchart LR
95
93
96
94
### Entity & Relationship Extraction
97
95
98
-
In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _name_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_.
96
+
In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _title_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_.
99
97
100
-
These subgraphs are merged together - any entities with the same _name_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions.
98
+
These subgraphs are merged together - any entities with the same _title_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions.
101
99
102
100
### Entity & Relationship Summarization
103
101
104
102
Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.
105
103
106
-
### Claim Extraction & Emission
104
+
### Claim Extraction (optional)
107
105
108
106
Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called **Covariates**.
109
107
110
108
Note: claim extraction is _optional_ and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.
111
109
112
110
## Phase 3: Graph Augmentation
113
111
114
-
Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: _Community Detection_ and _Graph Embedding_. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph.
112
+
Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the topological structure of our graph.
cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]
122
120
```
123
121
124
122
### Community Detection
125
123
126
124
In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.
127
125
128
-
### Graph Embedding
129
-
130
-
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
131
-
132
-
### Graph Tables Emission
126
+
### Graph Tables
133
127
134
-
Once our graph augmentation steps are complete, the final **Entities**and **Relationships** tables are exported after their text fields are text-embedded.
128
+
Once our graph augmentation steps are complete, the final **Entities**, **Relationships**, and **Communities** tables are exported.
135
129
136
130
## Phase 4: Community Summarization
137
131
@@ -140,10 +134,10 @@ Once our graph augmentation steps are complete, the final **Entities** and **Rel
140
134
title: Community Summarization
141
135
---
142
136
flowchart LR
143
-
sc[Generate Community Reports] --> ss[Summarize Community Reports] --> ce[Community Embedding] --> co[Community Tables Emission]
137
+
sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]
144
138
```
145
139
146
-
At this point, we have a functional graph of entities and relationships, a hierarchy of communities for the entities, as well as node2vec embeddings.
140
+
At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.
147
141
148
142
Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.
149
143
@@ -155,13 +149,9 @@ In this step, we generate a summary of each community using the LLM. This will a
155
149
156
150
In this step, each _community report_ is then summarized via the LLM for shorthand use.
157
151
158
-
### Community Embedding
152
+
### Community Reports Table
159
153
160
-
In this step, we generate a vector representation of our communities by generating text embeddings of the community report, the community report summary, and the title of the community report.
161
-
162
-
### Community Tables Emission
163
-
164
-
At this point, some bookkeeping work is performed and we export the **Communities** and **CommunityReports** tables.
154
+
At this point, some bookkeeping work is performed and we export the **Community Reports** tables.
165
155
166
156
## Phase 5: Document Processing
167
157
@@ -172,7 +162,7 @@ In this phase of the workflow, we create the _Documents_ table for the knowledge
aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]
176
166
```
177
167
178
168
### Augment with Columns (CSV Only)
@@ -183,15 +173,11 @@ If the workflow is operating on CSV data, you may configure your workflow to add
183
173
184
174
In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.
185
175
186
-
### Document Embedding
187
-
188
-
In this step, we generate a vector representation of our documents using an average embedding of document slices. We re-chunk documents without overlapping chunks, and then generate an embedding for each chunk. We create an average of these chunks weighted by token-count and use this as the document embedding. This will allow us to understand the implicit relationship between documents, and will help us generate a network representation of our documents.
189
-
190
-
### Documents Table Emission
176
+
### Documents Table
191
177
192
178
At this point, we can export the **Documents** table into the knowledge Model.
193
179
194
-
## Phase 6: Network Visualization
180
+
## Phase 6: Network Visualization (optional)
195
181
196
182
In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the _Entity-Relationship_ graph and the _Document_ graph.
197
183
@@ -200,7 +186,27 @@ In this phase of the workflow, we perform some steps to support network visualiz
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.
192
+
### Graph Embedding
193
+
194
+
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
195
+
196
+
### Dimensionality Reduction
197
+
198
+
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include the UMAP dimensions as x/y coordinates.
199
+
200
+
## Phase 7: Text Embedding
201
+
202
+
For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.
0 commit comments