Skip to content

Commit 5367094

Browse files
Clean up code + add benchmarking
1 parent 5127301 commit 5367094

File tree

7 files changed

+1005
-52
lines changed

7 files changed

+1005
-52
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ node_modules
44
embeddings/
55
transcripts/
66
*.json
7-
*.zip
7+
*.zip
8+
!package.json
9+
!package-lock.json

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Semantic Retrieval
2+
3+
This repository features the testing code (and probably final code) we used for extracting the embeddings out of video transcripts for [Bizarro-Devin](https://github.com/CodingTrain/Bizarro-Devin/). There are a few files in this repository, all having their own purpose
4+
5+
[embeddings-transformers.js](/embeddings-transformers.js) is the file that generates embeddings from transcripts in the `transcripts` directory
6+
[semantic-retrieval.js](/semantic-retrieval.js) can be used for retrieving from the embeddings based on a query
7+
[semantic-retrieval-benchmark.js](/semantic-retrieval-benchmark.js) is used for benchmarking the retrieval, during my own tests it was ~180ms / retrieval
8+
9+
## How to use
10+
11+
### Generating embeddings
12+
13+
1. Make sure you've installed all dependencies by running `npm install`
14+
2. Create a directory called `transcripts` and insert all json transcript files in here. Each file being a transcript of a video. The transcript json should be in the following format:
15+
16+
```json
17+
{
18+
"text": "full transcript text",
19+
"chunks": [
20+
{
21+
"timestamp": [0.48, 7.04],
22+
"text": "..."
23+
}
24+
]
25+
}
26+
```
27+
28+
However, the chunks array is currently not used. So this can be left out. 3. Create a `embeddings` directory for the embeddings of each transcript to be written to 4. Run `node embeddings-transformers.js` to run the script that generates the embeddings.
29+
All embeddings should now be in the embeddings folder, as well as an `embeddings.json` file being present in the current working directory. This `embeddings.json` file is the combination of all embeddings generated from the transcripts.
30+
31+
### Semantic retrieval from embeddings
32+
33+
1. Make sure you've installed all dependencies by running `npm install`
34+
2. Make sure you have the embeddings you want to retrieve from in an `embeddings.json` file. This file is usually already generated if you've generated them using the previous [generating embeddings](#generating-embeddings) section.
35+
3. Open up the `semantic-retrieval.js` file and edit your query on line `25`.
36+
4. Save the file and run `node semantic-retrieval.js` to retrieve the top 5 results from the embeddings.

embeddings-transformers.js

Lines changed: 74 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,83 @@ import { pipeline } from '@xenova/transformers';
33

44
// Load the embeddings model
55
const extractor = await pipeline(
6-
'feature-extraction',
7-
'Xenova/bge-small-en-v1.5'
6+
'feature-extraction',
7+
'Xenova/bge-small-en-v1.5'
88
);
99

10-
const raw = fs.readFileSync('transcripts/_-AfhLQfb6w.json', 'utf-8');
11-
const json = JSON.parse(raw);
12-
const txt = json.text;
13-
// console.log(txt);
10+
const fullOutput = [];
1411

15-
const chunks = txt.split(/[.?!]/);
12+
(async () => {
13+
// Scan transcripts directory for all json files
14+
const files = fs.readdirSync('transcripts');
1615

17-
let outputJSON = { embeddings: [] };
16+
// Iterate through each file and calculate the embeddings
17+
for (const file of files) {
18+
const rawContents = fs.readFileSync(`transcripts/${file}`, 'utf-8');
19+
const json = JSON.parse(rawContents);
1820

19-
for (let chunk of chunks) {
20-
let output = await extractor(chunk, {
21-
pooling: 'mean',
22-
normalize: true,
23-
});
24-
const embedding = output.tolist()[0];
25-
outputJSON.embeddings.push({ text: chunk, embedding });
26-
}
21+
const text = json.text;
22+
23+
// Calculate chunks based on this text
24+
const chunks = calculateChunks(text);
25+
26+
// Extract embeddings for each chunk
27+
const output = [];
28+
29+
for (const chunk of chunks) {
30+
const embeddingOutput = await extractor(chunk, {
31+
pooling: 'mean',
32+
normalize: true,
33+
});
34+
35+
const embedding = embeddingOutput.tolist()[0];
36+
output.push({ text: chunk, embedding });
37+
fullOutput.push({ text: chunk, embedding });
38+
}
39+
40+
// Save the embeddings to a file
41+
const fileOut = `embeddings/${file}`;
42+
fs.writeFileSync(fileOut, JSON.stringify(output));
2743

28-
const fileOut = `embeddings.json`;
29-
fs.writeFileSync(fileOut, JSON.stringify(outputJSON));
30-
console.log(`Embeddings saved to ${fileOut}`);
44+
console.log(
45+
`Embeddings saved for ${file} to ${fileOut} (${
46+
output.length
47+
} chunks) (${files.indexOf(file) + 1}/${files.length})`
48+
);
49+
}
50+
51+
// Save the full output to a single file
52+
const fileOut = `embeddings.json`;
53+
fs.writeFileSync(fileOut, JSON.stringify(fullOutput));
54+
console.log(`Complete embeddings saved to ${fileOut}`);
55+
})();
56+
57+
function calculateChunks(text) {
58+
// We want to split the text into chunks of at least 100 characters, after this we will keep adding to the chunk until we find a sentence boundary
59+
const chunks = [];
60+
let chunk = '';
61+
for (let i = 0; i < text.length; i++) {
62+
chunk += text[i];
63+
64+
// If our current character is a punctuation mark, we will split the chunk here
65+
if (
66+
chunk.length >= 100 &&
67+
(text[i] === '.' || text[i] === '?' || text[i] === '!')
68+
) {
69+
chunks.push(chunk.trim());
70+
chunk = '';
71+
}
72+
73+
// If we are exceeding 150 characters and we haven't found a punctuation mark, we will split the chunk at the last space
74+
if (chunk.length >= 150) {
75+
let lastSpace = chunk.lastIndexOf(' ');
76+
if (lastSpace === -1) {
77+
lastSpace = chunk.length;
78+
}
79+
chunks.push(chunk.slice(0, lastSpace).trim());
80+
chunk = chunk.slice(lastSpace).trim();
81+
}
82+
}
83+
84+
return chunks;
85+
}

0 commit comments

Comments
 (0)