Skip to content

Commit c5e9f5f

Browse files
authored
Update 04-ai-cosine-distance.md
1 parent e3e3fa7 commit c5e9f5f

File tree

1 file changed

+58
-42
lines changed

1 file changed

+58
-42
lines changed

docs/en/sql-reference/20-sql-functions/11-ai-functions/04-ai-cosine-distance.md

Lines changed: 58 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -3,64 +3,80 @@ title: 'COSINE_DISTANCE'
33
description: 'Measuring similarity using the cosine_distance function in Databend'
44
---
55

6-
This document provides an overview of the cosine_distance function in Databend and demonstrates how to measure document similarity using this function.
6+
Calculates the cosine distance between two vectors, measuring how dissimilar they are.
77

8-
:::info
8+
## Syntax
9+
10+
```sql
11+
COSINE_DISTANCE(vector1, vector2)
12+
```
913

10-
The cosine_distance function performs vector computations within Databend and does not rely on the (Azure) OpenAI API.
14+
## Arguments
1115

12-
:::
16+
- `vector1`: First vector (ARRAY(FLOAT32 NOT NULL))
17+
- `vector2`: Second vector (ARRAY(FLOAT32 NOT NULL))
18+
19+
## Returns
20+
21+
Returns a FLOAT value between 0 and 1:
22+
- 0: Identical vectors (completely similar)
23+
- 1: Orthogonal vectors (completely dissimilar)
24+
25+
## Description
26+
27+
The cosine distance measures the dissimilarity between two vectors based on the angle between them, regardless of their magnitude. The function:
28+
29+
1. Verifies that both input vectors have the same length
30+
2. Computes the sum of element-wise products (dot product) of the two vectors
31+
3. Calculates the square root of the sum of squares for each vector (vector magnitudes)
32+
4. Returns `1 - (dot_product / (magnitude1 * magnitude2))`
33+
34+
The mathematical formula implemented is:
1335

14-
The cosine_distance function in Databend is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.
36+
```
37+
cosine_distance(v1, v2) = 1 - (Σ(v1ᵢ * v2ᵢ) / (√Σ(v1ᵢ²) * √Σ(v2ᵢ²)))
38+
```
39+
40+
Where v1ᵢ and v2ᵢ are the elements of the input vectors.
41+
42+
:::info
43+
This function performs vector computations within Databend and does not rely on external APIs.
44+
:::
1545

16-
Cosine distance is a measure of similarity between two vectors, based on the cosine of the angle between them. The function takes two input vectors and returns a value between 0 and 1, with 0 indicating identical vectors and 1 indicating orthogonal (completely dissimilar) vectors.
1746

1847
## Examples
1948

20-
**Creating a Table and Inserting Sample Data**
49+
Create a table with vector data:
2150

22-
Let's create a table to store some sample text documents and their corresponding embeddings:
2351
```sql
24-
CREATE TABLE articles (
52+
CREATE OR REPLACE TABLE vectors (
2553
id INT,
26-
title VARCHAR,
27-
content VARCHAR,
28-
embedding ARRAY(FLOAT32)
54+
vec ARRAY(FLOAT32 NOT NULL)
2955
);
30-
```
3156

32-
Now, let's insert some sample documents into the table:
33-
```sql
34-
INSERT INTO articles (id, title, content, embedding)
35-
VALUES
36-
(1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
37-
(2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
38-
(3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
57+
INSERT INTO vectors VALUES
58+
(1, [1.0000, 2.0000, 3.0000]),
59+
(2, [1.0000, 2.2000, 3.0000]),
60+
(3, [4.0000, 5.0000, 6.0000]);
3961
```
4062

41-
**Querying for Similar Documents**
63+
Find the vector most similar to [1, 2, 3]:
4264

43-
Now, let's find the documents that are most similar to a given query using the cosine_distance function:
4465
```sql
45-
SELECT
46-
id,
47-
title,
48-
content,
49-
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
50-
FROM
51-
articles
52-
ORDER BY
53-
similarity ASC
54-
LIMIT 3;
66+
SELECT
67+
vec,
68+
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
69+
FROM
70+
vectors
71+
ORDER BY
72+
distance ASC
73+
LIMIT 1;
5574
```
5675

57-
Result:
58-
```sql
59-
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
60-
| id | title | content | similarity |
61-
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
62-
| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 |
63-
| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 |
64-
| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
65-
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
66-
```
76+
```
77+
+-------------------------+----------+
78+
| vec | distance |
79+
+-------------------------+----------+
80+
| [1.0000,2.2000,3.0000] | 0.0 |
81+
+-------------------------+----------+
82+
```

0 commit comments

Comments
 (0)