Skip to content

Documentation for Blobs #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion docs/deployments/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -821,7 +821,7 @@ storage:
prefetchWrites: true
```

`path` - _Type_: string; _Default_: `<rootPath>/schema`
`path` - _Type_: string; _Default_: `<rootPath>/database`

The `path` configuration sets where all database files should reside.

Expand All @@ -831,6 +831,15 @@ storage:
```
_**Note:**_ This configuration applies to all database files, which includes system tables that are used internally by HarperDB. For this reason if you wish to use a non default `path` value you must move any existing schemas into your `path` location. Existing schemas is likely to include the system schema which can be found at `<rootPath>/schema/system`.

`blobPaths` - _Type_: string; _Default_: `<rootPath>/blobs`

The `blobPaths` configuration sets where all the blob files should reside. This can be an array of paths, and if there are multiple, the blobs will be distributed across the paths.

```yaml
storage:
blobPaths:
- /users/harperdb/big-storage
```

`pageSize` - _Type_: number; _Default_: Defaults to the default page size of the OS

Expand Down
1 change: 1 addition & 0 deletions docs/developers/applications/defining-schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ HarperDB supports the following field types in addition to user defined (object)
* `Any`: Any primitive, object, or array is allowed.
* `Date`: A Date object.
* `Bytes`: Binary data (as a Buffer or Uint8Array).
* `Blob`: Binary data designed for large blocks of data that can be streamed. It is recommend that you use this for binary data that will typically be larger than 20KB.

#### Renaming Tables

Expand Down
39 changes: 21 additions & 18 deletions docs/developers/operations-api/clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,33 +128,36 @@ _Operation is restricted to super_user roles only_
"type": "cluster-status",
"connections": [
{
"url": "wss://server-two:9925",
"subscriptions": [
{
"schema": "dev",
"table": "my-table",
"publish": true,
"subscribe": true
}
],
"name": "server-two",
"replicateByDefault": true,
"replicates": true,
"url": "wss://server-2.domain.com:9933",
"name": "server-2.domain.com",
"subscriptions": null,
"database_sockets": [
{
"database": "dev",
"database": "data",
"connected": true,
"latency": 0.84197798371315,
"threadId": 1,
"latency": 0.70,
"thread_id": 1,
"nodes": [
"server-two"
]
}
]
"server-2.domain.com"
],
"lastCommitConfirmed": "Wed, 12 Feb 2025 19:09:34 GMT",
"lastReceivedRemoteTime": "Wed, 12 Feb 2025 16:49:29 GMT",
"lastReceivedLocalTime": "Wed, 12 Feb 2025 16:50:59 GMT",
"lastSendTime": "Wed, 12 Feb 2025 16:50:59 GMT"
},
}
],
"node_name": "server-one",
"node_name": "server-1.domain.com",
"is_enabled": true
}
```
There is a separate socket for each database for each node. Each node is represented in the connections array, and each database connection to that node is represented in the `database_sockets` array. Additional timing statistics include:
* `lastCommitConfirmed`: When a commit is sent out, it should receive a confirmation from the remote server; this is the last receipt of confirmation of an outgoing commit.
* `lastReceivedRemoteTime`: This is the timestamp of the transaction that was last received. The timestamp is from when the original transaction occurred.
* `lastReceivedLocalTime`: This is local time when the last transaction was received. If there is a different between this and `lastReceivedRemoteTime`, it means there is a delay from the original transaction to * receiving it and so it is probably catching-up/behind.
* `sendingMessage`: The timestamp of transaction is actively being sent. This won't exist if the replicator is waiting for the next transaction to send.

---

Expand Down
78 changes: 78 additions & 0 deletions docs/technical-details/reference/blob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Blobs are binary large objects that can used to store any type of unstructured/binary data and is designed for large content. Blobs support streaming and feature better performance for content larger than about 20KB. Blobs are built off the native JavaScript `Blob` type, and HarperDB extends the native `Blob` type for integrated storage with the database. To use blobs, you would generally want to declare a field as a `Blob` type in your schema:
```graphql
type MyTable {
id: Any! @primaryKey
data: Blob
}
```

You can then create a blob which writes the binary data to disk, and can then be included (as a reference) in a record. For example, you can create a record with a blob like:

```javascript
let blob = await createBlob(largeBuffer);
await MyTable.put({ id: 'my-record', data: blob });
```
The `data` attribute in this example is a blob reference, and can be used like any other attribute in the record, but it is stored separately, and the data must be accessed asynchronously. You can retrieve the blob data with the standard `Blob` methods:

```javascript
let buffer = await blob.bytes();
```

If you are creating a resource method, you can return a `Response` object with a blob as the body:

```javascript
export class MyEndpoint extends MyTable {
async get() {
return {
status: 200,
headers: {},
body: this.data, // this.data is a blob
});
}
}
```
One of the important characteristics of blobs is they natively support asynchronous streaming of data. This is important for both creation and retrieval of large data. When we create a blob with `createBlob`, the returned blob will create the storage entry, but the data will be streamed to storage. This means that you can create a blob from a buffer or from a stream. You can also create a record that references a blob before the blob is fully written to storage. For example, you can create a blob from a stream:

```javascript
let blob = await createBlob(stream);
// at this point the blob exists, but the data is still being written to storage
await MyTable.put({ id: 'my-record', data: blob });
// we now have written a record that references the blob
let record = await MyTable.get('my-record');
// we now have a record that gives us access to the blob. We can asynchronously access the blob's data or stream the data, and it will be available as blob the stream is written to the blob.
let stream = record.data.stream();
```
This can be powerful functionality for large media content, where content can be streamed into storage as it streamed out in real-time to users as it is received.
Alternately, we can also wait for the blob to be fully written to storage before creating a record that references the blob:

```javascript
let blob = await createBlob(stream);
// at this point the blob exists, but the data is was not been written to storage
await blob.save(MyTable);
// we now know the blob is fully written to storage
await MyTable.put({ id: 'my-record', data: blob });
```

Note that this means that blobs are _not_ atomic or [ACID](https://en.wikipedia.org/wiki/ACID) compliant; streaming functionality achieves the opposite behavior of ACID/atomic writes that would prevent access to data as it is being written.

### Error Handling
Because blobs can be streamed and referenced prior to their completion, there is a chance that an error or interruption could occur while streaming data to the blob (after the record is committed). We can create an error handler for the blob to handle the case of an interrupted blob:

```javascript
export class MyEndpoint extends MyTable {
let blob = this.data;
blob.on('error', () => {
// if this was a caching table, we may want to invalidate or delete this record:
this.invalidate();
});
async get() {
return {
status: 200,
headers: {},
body: blob
});
}
}
```

See the [configuration](../../deployments/configuration.md) documentation for more information on configuring where blob are stored.
25 changes: 25 additions & 0 deletions docs/technical-details/release-notes/4.tucker/4.5.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# 4.5.0

#### HarperDB 4.5.0

2/?/2025

### Blob Storage
4.5 introduces a new [Blob storage system](../../reference/blob.md), that is designed to efficiently handle large binary objects, with built-in support for streaming large content/media in and out of storage. This provides significantly better performance and functionality for large unstructured data, such as HTML, images, video, and other large files. Components can leverage this functionality through the JavaScript `Blob` interface, and the new `createBlob` function. Blobs are fully replicated and integrated.

### Password Hashing Upgrade
4.5 adds two new password hashing algorithms for better security (to replace md5):
`sha256`: This is a solid general purpose of password hashing, with good security properties and excellent performance. This is the default algorithm in 4.5.
`argon2id`: This provides the highest level of security, and is the recommended algorithm that do not require frequent password verifications. However, it is more CPU intensive, and may not be suitable for environments with a high frequency of password verifications.

### Resource and Storage Analytics
4.5 includes numerous new analytics for resources and storage, including page faults, context switches, free space, disk usage, and other metrics.

#### Default Replication Port
The default port for replication has been changed from 9925 to 9933.

### Property Forwarding
Accessing record properties from resource instances should be accessible through standard property access syntax, regardless of whether the property was declared in a schema. Previously only properties declared in a schema were accessible through standard property access syntax. This change allows for more consistent and intuitive access to record properties, regardless of how they were defined. It is still recommended to declare properties in a schema for better performance and documentation.

### Cluster Status Information
The `cluster_status` operation now includes new statistics for replication, including the timestamps of last received transactions, sent transactions, and committed transactions.