Skip to content

Commit 1be8d58

Browse files
istranicalgitbook-bot
authored andcommitted
GITBOOK-12: change request with no subject merged in GitBook
1 parent ce01725 commit 1be8d58

File tree

1 file changed

+239
-7
lines changed

1 file changed

+239
-7
lines changed

tutorials/vector-store/improving-search-accuracy-using-deep-memory.md

Lines changed: 239 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,32 +4,264 @@ description: Using Deep Memory to improve the accuracy of your Vector Search
44

55
# Improving Search Accuracy using Deep Memory
66

7-
## How to Use Deep Memory to Improve the Accuracy of your Vector Search
7+
## How to Use Deep Memory to Improve the Accuracy of your Vector Search <a href="#how-to-use-deep-memory-to-improve-the-accuracy-of-your-vector-search" id="how-to-use-deep-memory-to-improve-the-accuracy-of-your-vector-search"></a>
88

9-
[Deep Memory](../../performance-features/deep-memory/) computes a transformation that converts your embeddings into an embedding space that is tailored for your use case, based on several examples for which the most relevant embedding is known. This increases the accuracy of your Vector Search by up to 22%.
9+
[Deep Memory](../../performance-features/deep-memory/) computes a transformation that converts your embeddings into an embedding space that is tailored for your use case, based on several examples for which the most relevant embedding is known. This can increase the accuracy of your Vector Search by up to 22%.
1010

11-
#### In this example, we'll use Deep Memory to improve the accuracy of Vector Search on the [SciFact](https://allenai.org/data/scifact) dataset, for which the&#x20;
11+
**In this example, we'll use Deep Memory to improve the accuracy of Vector Search on the SciFact dataset, where the input prompt is a scientific claim, and the search result is the corresponding abstract.**
1212

13-
### Downloading and Preprocessing the Data
13+
### Downloading the Data <a href="#downloading-the-data" id="downloading-the-data"></a>
1414

15-
First let's specify out Activeloop and OpenAI tokens. Make sure to install pip install datasets because we'll download teh source data from HuggingFace.&#x20;
15+
First let's specify out Activeloop and OpenAI tokens. Make sure to install `pip install datasets` because we'll download the source data from HuggingFace.
1616

1717
```python
1818
from deeplake import VectorStore
1919
import os
2020
import getpass
2121
import datasets
2222
import openai
23+
from pathlib import Path
24+
```
2325

24-
os.environ['OPENAI_API_KEY'] = getpass.getpass()
26+
```python
2527
os.environ['OPENAI_API_KEY'] = getpass.getpass()
2628
```
2729

28-
Next, let's download the dataset locally.
30+
```python
31+
# Skip this step if you logged in through the CLI
32+
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass()
33+
```
34+
35+
Next, let's download the dataset locally:
2936

3037
```python
3138
corpus = datasets.load_dataset("scifact", "corpus")
3239
```
3340

41+
### Creating the Vector Store <a href="#creating-the-vector-store" id="creating-the-vector-store"></a>
42+
43+
Now let's define an embedding function for the text data and create a Deep Lake Vector Store in our Managed Database. Deep Memory is only available for Vector Stores in our Managed Database.
44+
45+
```python
46+
def embedding_function(texts, model="text-embedding-ada-002"):
47+
48+
if isinstance(texts, str):
49+
texts = [texts]
50+
51+
texts = [t.replace("\n", " ") for t in texts]
52+
return [data['embedding']for data in openai.Embedding.create(input = texts, model=model)['data']]
53+
```
54+
55+
```python
56+
path = 'hub://<org_id>/<vector_store_name>'
57+
```
58+
59+
```python
60+
vectorstore = VectorStore(
61+
path=path,
62+
embedding_function=embedding_function,
63+
runtime={"tensor_db": True},
64+
)
65+
```
66+
67+
#### Adding data to the Vector Store <a href="#adding-data-to-the-vector-store" id="adding-data-to-the-vector-store"></a>
68+
69+
Next, let's extract the data from the SciFact dataset and add it to our Vector Store. In this example, we embed the abstracts of the scientific papers. Normally, the `id` tensor is auto-populated, but in this case, we want to use the ids in the SciFact dataset, in order to
70+
71+
```python
72+
ids = [f"{id_}" for id_ in corpus["train"]["doc_id"]]
73+
texts = [text[0] for text in corpus["train"]["abstract"]]
74+
metadata = [{"title": title} for title in corpus["train"]["title"]]
75+
```
76+
77+
```python
78+
vectorstore.add(
79+
text=texts,
80+
id=ids,
81+
embedding_data=texts,
82+
embedding_function=embedding_function,
83+
metadata=metadata,
84+
)
85+
```
86+
87+
#### Generating claims <a href="#generating-claims" id="generating-claims"></a>
88+
89+
We must create a relationship between the claims and their corresponding most relevant abstracts. This correspondence already exists in the SciFact dataset, and we extract that information using the helper function below.
90+
91+
```python
92+
def preprocess_scifact(claims_dataset, dataset_type="train"):
93+
94+
# Using a dictionary to store unique claims and their associated relevances
95+
claims_dict = {}
96+
97+
for item in claims_dataset[dataset_type]:
98+
claim = item['claim']
99+
relevance = (item['evidence_doc_id'], 1) # 1 indicates that the evidence is relevant to the question
100+
101+
# Check for non-empty relevance
102+
if relevance[0] != "":
103+
if claim not in claims_dict:
104+
claims_dict[claim] = [relevance]
105+
else:
106+
# If the does not exist in the dictionary, append the new relevance
107+
if relevance not in claims_dict[claim]:
108+
claims_dict[claim].append(relevance)
109+
110+
# Split the dictionary into two lists: claims and relevances
111+
claims = list(claims_dict.keys())
112+
relevances = list(claims_dict.values())
113+
return claims, relevances
114+
```
115+
116+
```python
117+
claims_dataset = datasets.load_dataset('scifact', 'claims')
118+
claims, relevances = preprocess_scifact(claims_dataset, dataset_type="train")
119+
```
120+
121+
Let's print the first 10 claims and their relevant abstracts. The relevances are a list of tuples, where each the id corresponds to the `id` tensor value in the Abstracts Vector Store, and 1 indicates a positive relevance.
122+
123+
```python
124+
claims[:10]
125+
```
126+
127+
```
128+
['1 in 5 million in UK have abnormal PrP positivity.',
129+
'32% of liver transplantation programs required patients to discontinue methadone treatment in 2001.',
130+
'40mg/day dosage of folic acid and 2mg/day dosage of vitamin B12 does not affect chronic kidney disease (CKD) progression.',
131+
'76-85% of people with severe mental disorder receive no treatment in low and middle income countries.',
132+
'A T helper 2 cell (Th2) environment impedes disease development in patients with systemic lupus erythematosus (SLE).',
133+
"A breast cancer patient's capacity to metabolize tamoxifen influences treatment outcome.",
134+
"A country's Vaccine Alliance (GAVI) eligibility is not indictivate of accelerated adoption of the Hub vaccine.",
135+
'A deficiency of folate increases blood levels of homocysteine.',
136+
'A diminished ovarian reserve does not solely indicate infertility in an a priori non-infertile population.',
137+
'A diminished ovarian reserve is a very strong indicator of infertility, even in an a priori non-infertile population.']
138+
```
139+
140+
```python
141+
relevances[:10]
142+
```
143+
144+
```
145+
[[('13734012', 1)],
146+
[('44265107', 1)],
147+
[('33409100', 1)],
148+
[('6490571', 1)],
149+
[('12670680', 1)],
150+
[('24341590', 1)],
151+
[('12428497', 1)],
152+
[('11705328', 1)],
153+
[('13497630', 1)],
154+
[('13497630', 1)]]
155+
```
156+
157+
### Running the Deep Memory Training <a href="#running-the-deep-memory-training" id="running-the-deep-memory-training"></a>
158+
159+
Now we can run a Deep Memory training, which runs asynchronously and executes on our managed service.
160+
161+
```python
162+
job_id = vectorstore.deep_memory.train(
163+
queries = claims,
164+
relevance = relevances,
165+
embedding_function = embedding_function,
166+
)
167+
```
168+
169+
All of the Deep Memory training jobs for this Vector Store can be listed using the command below. The PROGRESS tells us the state of the training job, as well as the recall improvement on the data.
170+
171+
**`recall@k` corresponds to the percentage of rows for which the correct (most relevant) answer was returned in the top `k` vector search results**
172+
173+
```python
174+
vectorstore.deep_memory.list_jobs()
175+
```
176+
177+
```
178+
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop-test/test-deepmemory-ivo
179+
ID STATUS RESULTS PROGRESS
180+
6525a94bbfacbf7e75a08c76 completed recall@10: 0.00% (+0.00%) eta: 45.5 seconds
181+
recall@10: 0.00% (+0.00%)
182+
6538186bc1d2ffd8e8cd3b49 completed recall@10: 85.81% (+21.78%) eta: 1.9 seconds
183+
recall@10: 85.81% (+21.78%)
184+
```
185+
186+
### Evaluating Deep Memory's Performance <a href="#evaluating-deep-memorys-performance" id="evaluating-deep-memorys-performance"></a>
187+
188+
Let's evaluate the recall improvement for an evaluation dataset that was not used in the training process. Deep Memory inference, and by extension this evaluation process, runs on the client.
189+
190+
```python
191+
validation_claims, validation_relevances = preprocess_scifact(claims_dataset, dataset_type="validation")
192+
```
193+
194+
<pre class="language-python"><code class="lang-python"><strong>recalls = vectorstore.deep_memory.evaluate(
195+
</strong> queries = validation_claims,
196+
relevance = validation_relevances,
197+
embedding_function = embedding_function,
198+
)
199+
</code></pre>
200+
201+
We observe that the recall has improved by p to 30%, depending on the `k` value.
202+
203+
```python
204+
recalls
205+
```
206+
207+
```
208+
---- Evaluating without model ----
209+
Recall@1: 29.5%
210+
Recall@3: 45.0%
211+
Recall@5: 51.8%
212+
Recall@10: 58.1%
213+
Recall@50: 77.4%
214+
Recall@100: 84.9%
215+
---- Evaluating with model ----
216+
Recall@1: 55.1%
217+
Recall@3: 68.2%
218+
Recall@5: 71.7%
219+
Recall@10: 77.9%
220+
Recall@50: 90.1%
221+
Recall@100: 92.6%
222+
```
223+
224+
### Using Deep Memory in your Application <a href="#using-deep-memory-in-your-application" id="using-deep-memory-in-your-application"></a>
225+
226+
To use Deep Memory in your applications, specify the `deep_memory = True` parameter during vector search. If you are using the LangChain integration, you may specify this parameter during Vector Store initialization. Let's try searching embedding using a prompt, with and without Deep Memory.
227+
228+
```python
229+
prompt = "Which diseases are inflammation-related processes"
230+
```
231+
232+
```python
233+
results = vectorstore.search(embedding_data = prompt)
234+
```
235+
236+
```python
237+
results['text']
238+
```
239+
240+
```
241+
['Inflammation is a fundamental protective response that sometimes goes awry and becomes a major cofactor in the pathogenesis of many chronic human diseases, including cancer.',
242+
'Kidney diseases, including chronic kidney disease (CKD) and acute kidney injury (AKI), are associated with inflammation.',
243+
'BACKGROUND Persistent inflammation has been proposed to contribute to various stages in the pathogenesis of cardiovascular disease.',
244+
'Inflammation accompanies obesity and its comorbidities-type 2 diabetes, non-alcoholic fatty liver disease and atherosclerosis, among others-and may contribute to their pathogenesis.']
245+
```
246+
247+
```python
248+
results_dm = vectorstore.search(embedding_data = prompt, deep_memory = True)
249+
```
250+
251+
```python
252+
results_dm['text']
253+
```
254+
255+
```
256+
['Kidney diseases, including chronic kidney disease (CKD) and acute kidney injury (AKI), are associated with inflammation.',
257+
'OBJECTIVES Calcific aortic valve (AV) disease is known to be an inflammation-related process.',
258+
"Crohn's disease and ulcerative colitis, the two main types of chronic inflammatory bowel disease, are multifactorial conditions of unknown aetiology.",
259+
'BACKGROUND Two inflammatory disorders, type 1 diabetes and celiac disease, cosegregate in populations, suggesting a common genetic origin.']
260+
```
261+
262+
We observe that there are overlapping results for both search methods, but 50% of the answers differ.
263+
264+
265+
34266
Congrats! You just used Deep Memory to improve the accuracy of Vector Search on a specific use-case! 🎉
35267

0 commit comments

Comments
 (0)