Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of S3 clients documentation #43

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/data/big-data-tips-and-tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Big Data Tips and Tricks

For **Big Data** sets, we recommend using the `boto3` library or `s5cmd` tool.

## Fewer big files are better than many small files

When transferring **Big Data**, it is better to use fewer big files instead of many small files.

Be aware of that when you are transferring a lot of small files, the overhead of the transfer process can be significant.
You can save time and resources by packing the small files into a single big file and transferring it as one object.

## Chunk size matters

When transferring big files, the upload (or download) process is divided into chunks - so called `multipart uploads` (or downloads).
The size of these chunks can have a significant impact on the transfer speed.

The optimal chunk size depends on the size of the files you are transferring and the network conditions.

There is no one-size-fits-all solution, so you should experiment with different chunk sizes to find the optimal one for your use case.
We recommend starting with a chunk size of `file_size / 1000` (where `file_size` is the size of the file you are transferring).
You can then adjust the chunk size based on the results of your experiments.

## Cluster choice matters

Some cluster offer better `network interface` than others.

When transferring big files, it is important to choose a cluster with a good network interface.
One such cluster is the `halmir` machines, which offer a `10 Gbps` network interface.

You can check the possible clusters and their network interfaces on the [official website](https://metavo.metacentrum.cz/pbsmon2/nodes/physical) of the MetaCentrum.

## Hard disk speed does not matter

Our research has shown that the speed of the hard disk does not have a significant impact on the transfer speed.

When transferring big files, the network interface is the bottleneck, not the hard disk speed.

Therefore, you do not need to worry about the usage of `tmpfs` or `ramdisk` when transferring big files.

## Utilize compression

When transferring big files, it is a good idea to utilize compression.

You can compress the files before transferring them, effectively reducing the time and resources needed for the transfer.

Choice of the compression algorithm depends on the type of the files you are transferring, there is no one-size-fits-all solution.
We recommend using the `zstandard` algorithm, as it offers a good balance between compression ratio and decompression speed.
Depending on the type of your files, you can also consider using the `gzip`, `bzip2`, or `xz` algorithms.

For more information about the compression algorithms, please check this [comparison](https://quixdb.github.io/squash-benchmark/).

## Use the right tool for the job

When transferring big files, it is important to use the right tool for the job.

If you are unsure which tool to use, we recommend checking the [Storage Department](storage-department.md) page with a table of S3 service clients.

In short, we recommend using the `boto3` library or `s5cmd` tool for **Big Data** transfers.
54 changes: 44 additions & 10 deletions docs/data/storage-department.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,57 @@
# Storage Department services

The CESNET Storage Department provides various types of data services. It is available to all users with **MetaCentrum login and password**.
The CESNET Storage Department provides various types of data services.
It is available to all users with **MetaCentrum login and password**.

Storage Department data policies will be described to a certain level at this page. For more detailed information, users should however navigate the [Storage Department documentation pages](https://docs.du.cesnet.cz).
Storage Department data policies will be described to a certain level at this page.
For more detailed information, users should however navigate the [Storage Department documentation pages](https://docs.du.cesnet.cz).

!!! warning "Data storage technology in the Data Storage Department has changed by May 2024"
For a long time the data were stored on hierarchical storage machines ("HSM" for short) with a directory structure accessible from `/storage/du-cesnet`.<br/> Due to technological innovation of operated system were the HSM storages disconnected and decommissioned. User data have been transferred to [machines with Object storage technology](https://docs.du.cesnet.cz/en/object-storage-s3/s3-service).<br/> Object storage is successor of HSM with slightly different set of commands, i.e. it **does not** work in the same way.

## Object storage
S3 storage is available for all Metacentrum users. You can generate your credetials via [Gatekeeper service](https://access.du.cesnet.cz/#/). Where you will select your Metacentrum account and you should obtain your `access_key` and `secret_key`.
S3 storage is available for all Metacentrum users.
You can generate your credetials via [Gatekeeper service](https://access.du.cesnet.cz/#/).
Where you will select your Metacentrum account and you should obtain your `access_key` and `secret_key`.

### Simple storage - use when you need commonly store your data

You can use the S3 storage as simple storage to store your data. You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone. The detailed tutorial for S3 client configuration can be found in the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients)
You can use the S3 storage as simple storage to store your data.
You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone.
The detailed tutorial for S3 client configuration can be found in the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients).

### Direct usage in the job file
You can add s5cmd and rclone commands directly into your job file.
You can add `s5cmd` and `rclone` commands directly into your job file.

!!! warning "Bucket creation"
Do not forget that the bucket being used for staging MUST exist on the remote S3 data storage. If you plan to stage-out your data into a non-existing bucket the job will fail. You need to prepare the bucket for stage-out in advance. You can use the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients) for particular S3 client.

### Big Data transfers

For **Big Data** sets, we recommend using the `boto3` library or `s5cmd` tool.

For general tips and tricks regarding **Big Data** and **CESNET S3 storage**, please visit the [Big Data Tips and Tricks](big-data-tips-and-tricks.md) page.
### S3 service clients

| Binary | Source code language | Library | Console usage | Python usage | Fit for Big Data transfers |
|-----------------|----------------------|-----------------|---------------|--------------|----------------------------|
| aws cli | Python | aws cli | Yes | Yes | No |
| s3cmd | Python | s3cmd | Yes | Yes | No |
| s4cmd | Python | [boto3](#boto3) | No | Yes | Yes |
| [s5cmd](#s5cmd) | Go | --- ? --- | Yes | No | Yes |

For further details and more information about all the possible S3 clients, please refer to the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/docs/object-storage-s3/s3-service).

#### boto3

`boto3` is a **Python** library that allows you to interact with the S3 storage.
You have to use it from your **Python** scripts - it is not a standalone tool like `s3cmd` or `s5cmd`.

For more details and information about `boto3`, please check the [Data Storage guide](https://docs.du.cesnet.cz/en/docs/object-storage-s3/boto3).

#### s5cmd
To use s5cmd tool (preferred) you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home/<your-login-name>/.aws/credentials`.

To use `s5cmd` tool you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home/<your-login-name>/.aws/credentials`.

```
[profile-name]
Expand All @@ -32,7 +63,8 @@ multipart_threshold = 128MB
multipart_chunksize = 32MB
```

Then you can continue to use `s5cmd` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/s5cmd). Alternatively, you can directly add the following lines into your job file.
Then you can continue to use `s5cmd` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/s5cmd).
Alternatively, you can directly add the following lines into your job file.

```
#define CREDDIR, where you stored your S3 credentials for, default is your home directory
Expand All @@ -47,7 +79,9 @@ s5cmd --credentials-file "${S3CRED}" --profile profile-name --endpoint-url=https

#### rclone

Alternatively, you can use rclone tool, which is less handy for large data sets. In case of large data sets (tens of terabytes) please use `s5cmd` above. For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home/<your-login-name>/.config/rclone/rclone.conf`.
Alternatively, you can use `rclone` tool, which is less handy for large data sets.
In case of large data sets (tens of terabytes) please use `s5cmd` or `boto3`, mentioned above.
For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home/<your-login-name>/.config/rclone/rclone.conf`.

```
[profile-name]
Expand All @@ -59,7 +93,8 @@ endpoint = s3.cl4.du.cesnet.cz
acl = private
```

Then you can continue to use `rclone` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/rclone). Or you can directly add following lines into your job file.
Then you can continue to use `rclone` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/rclone).
Or you can directly add following lines into your job file.

```
#define CREDDIR, where you stored your S3 credentials for, default is your home directory
Expand All @@ -71,4 +106,3 @@ rclone sync --progress --fast-list --config ${S3CRED} profile-name:my-bucket/h2o
#stage out command for rclone
rclone sync --progress --fast-list --config ${S3CRED} ${DATADIR}/h2o.out profile-name:my-bucket/
```