1
+ %% Information Retrieval Using OpenAI Document Embedding
2
+ % This example shows how to find documents to answer queries using the 'text-embedding-3-small'
3
+ % document embedding model. Embeddings are used to represent documents and queries
4
+ % in a high-dimensional space, allowing for the efficient retrieval of relevant
5
+ % information based on semantic similarity.
6
+ %
7
+ % The example consists of four steps:
8
+ %%
9
+ % * Download and preprocess text from several MATLAB documentation pages.
10
+ % * Embed query document and document corpus using the "text-embedding-3-small"
11
+ % document embedding.
12
+ % * Find the most documentation page most relevant to the query using cosine
13
+ % similarity scores.
14
+ % * Generate an answer to the query based on the most relevant documentation
15
+ % page.
16
+ %%
17
+ % This process is sometimes referred to as Retrieval-Augmented Generation (RAG),
18
+ % similar to the application found in the example <./ExampleRetrievalAugmentedGeneration.mlx
19
+ % ExampleRetrievalAugmentedGeneration.mlx>.
20
+ %
21
+ % This example requires Text Analytics Toolbox™.
22
+ %
23
+ % To run this example, you need a valid API key from a paid OpenAI API account.
24
+
25
+ loadenv(" .env" )
26
+ addpath(' ..' )
27
+ %% Embed Query Document
28
+ % Convert the query into a numerical vector using the extractOpenAIEmbeddings
29
+ % function. Specify the model as "text-embedding-3-small".
30
+
31
+ query = " What is the best way to store data made up of rows and columns?" ;
32
+ [qEmb , ~ ] = extractOpenAIEmbeddings(query , ModelName= " text-embedding-3-small" );
33
+ qEmb(1 : 5 )
34
+ %% Download and Embed Source Text
35
+ % In this example, we will scrape content from several MATLAB documentation
36
+ % pages.
37
+ %
38
+ % This requires the following steps:
39
+ %%
40
+ % # Start with a list of websites. This examples uses pages from MATLAB documentation.
41
+ % # Extract the context of the pags using |extractHTMLText|.
42
+ % # Embed the websites using |extractOpenAIEmbeddings|.
43
+
44
+ metadata = [" https://www.mathworks.com/help/matlab/numeric-types.html" ;
45
+ " https://www.mathworks.com/help/matlab/characters-and-strings.html" ;
46
+ " https://www.mathworks.com/help/matlab/date-and-time-operations.html" ;
47
+ " https://www.mathworks.com/help/matlab/categorical-arrays.html" ;
48
+ " https://www.mathworks.com/help/matlab/tables.html" ];
49
+ id = (1 : numel(metadata ))' ;
50
+ document = strings(numel(metadata ),1 );
51
+ embedding = [];
52
+ for ii = id '
53
+ page = webread(metadata(ii ));
54
+ tree = htmlTree(page );
55
+ subtree = findElement(tree ," body" );
56
+ document(ii ) = extractHTMLText(subtree , ExtractionMethod= " article" );
57
+ try
58
+ [emb , ~ ] = extractOpenAIEmbeddings(document(ii ),ModelName= " text-embedding-3-small" );
59
+ embedding = [embedding ; emb ];
60
+ catch
61
+ end
62
+ end
63
+ vectorTable = table(id ,document ,metadata ,embedding );
64
+ %% Generate Answer to Query
65
+ % Define the system prompt in |openAIChat| to answer questions based on context.
66
+
67
+ chat = openAIChat(" You are a helpful MATLAB assistant. You will get a context for each question" );
68
+ %%
69
+ % Calculate the cosine similarity scores between the query and each of the documentation
70
+ % page using the |cosineSimilarity| function.
71
+
72
+ s = cosineSimilarity(vectorTable .embedding ,qEmb );
73
+ %%
74
+ % Use the most similar documentation content to feed extra context into the
75
+ % prompt for generation.
76
+
77
+ [~ ,idx ] = max(s );
78
+ context = vectorTable .document(idx );
79
+ prompt = " Context: " ...
80
+ + context + newline + " Answer the following question: " + query ;
81
+ wrapText(prompt )
82
+ %%
83
+ % Pass the question and the context for generation to get a contextualized answer.
84
+
85
+ response = generate(chat , prompt );
86
+ wrapText(response )
87
+ %% Helper Function
88
+ % Helper function to wrap text for easier reading in the live script.
89
+
90
+ function wrappedText = wrapText(text )
91
+ wrappedText = splitSentences(text );
92
+ wrappedText = join(wrappedText ,newline );
93
+ end
94
+ %%
95
+ % _Copyright 2024 The MathWorks, Inc._
96
+ %
97
+ %
0 commit comments