You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/sql-reference/20-sql-functions/11-ai-functions/04-ai-cosine-distance.md
+58-42Lines changed: 58 additions & 42 deletions
Original file line number
Diff line number
Diff line change
@@ -3,64 +3,80 @@ title: 'COSINE_DISTANCE'
3
3
description: 'Measuring similarity using the cosine_distance function in Databend'
4
4
---
5
5
6
-
This document provides an overview of the cosine_distance function in Databend and demonstrates how to measure document similarity using this function.
6
+
Calculates the cosine distance between two vectors, measuring how dissimilar they are.
7
7
8
-
:::info
8
+
## Syntax
9
+
10
+
```sql
11
+
COSINE_DISTANCE(vector1, vector2)
12
+
```
9
13
10
-
The cosine_distance function performs vector computations within Databend and does not rely on the (Azure) OpenAI API.
14
+
## Arguments
11
15
12
-
:::
16
+
-`vector1`: First vector (ARRAY(FLOAT32 NOT NULL))
17
+
-`vector2`: Second vector (ARRAY(FLOAT32 NOT NULL))
18
+
19
+
## Returns
20
+
21
+
Returns a FLOAT value between 0 and 1:
22
+
- 0: Identical vectors (completely similar)
23
+
- 1: Orthogonal vectors (completely dissimilar)
24
+
25
+
## Description
26
+
27
+
The cosine distance measures the dissimilarity between two vectors based on the angle between them, regardless of their magnitude. The function:
28
+
29
+
1. Verifies that both input vectors have the same length
30
+
2. Computes the sum of element-wise products (dot product) of the two vectors
31
+
3. Calculates the square root of the sum of squares for each vector (vector magnitudes)
The cosine_distance function in Databend is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.
Where v1ᵢ and v2ᵢ are the elements of the input vectors.
41
+
42
+
:::info
43
+
This function performs vector computations within Databend and does not rely on external APIs.
44
+
:::
15
45
16
-
Cosine distance is a measure of similarity between two vectors, based on the cosine of the angle between them. The function takes two input vectors and returns a value between 0 and 1, with 0 indicating identical vectors and 1 indicating orthogonal (completely dissimilar) vectors.
17
46
18
47
## Examples
19
48
20
-
**Creating a Table and Inserting Sample Data**
49
+
Create a table with vector data:
21
50
22
-
Let's create a table to store some sample text documents and their corresponding embeddings:
23
51
```sql
24
-
CREATETABLEarticles (
52
+
CREATE OR REPLACETABLEvectors (
25
53
id INT,
26
-
title VARCHAR,
27
-
content VARCHAR,
28
-
embedding ARRAY(FLOAT32)
54
+
vec ARRAY(FLOAT32 NOT NULL)
29
55
);
30
-
```
31
56
32
-
Now, let's insert some sample documents into the table:
33
-
```sql
34
-
INSERT INTO articles (id, title, content, embedding)
35
-
VALUES
36
-
(1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
37
-
(2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
38
-
(3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
57
+
INSERT INTO vectors VALUES
58
+
(1, [1.0000, 2.0000, 3.0000]),
59
+
(2, [1.0000, 2.2000, 3.0000]),
60
+
(3, [4.0000, 5.0000, 6.0000]);
39
61
```
40
62
41
-
**Querying for Similar Documents**
63
+
Find the vector most similar to [1, 2, 3]:
42
64
43
-
Now, let's find the documents that are most similar to a given query using the cosine_distance function:
44
65
```sql
45
-
SELECT
46
-
id,
47
-
title,
48
-
content,
49
-
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
50
-
FROM
51
-
articles
52
-
ORDER BY
53
-
similarity ASC
54
-
LIMIT3;
66
+
SELECT
67
+
vec,
68
+
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 |
63
-
| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 |
64
-
| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
0 commit comments