The CESNET Storage Department provides various types of data services. It is available to all users with MetaCentrum login and password.
Storage Department data policies will be described to a certain level at this page. For more detailed information, users should however navigate the Storage Department documentation pages.
!!! warning "Data storage technology in the Data Storage Department has changed by May 2024"
For a long time the data were stored on hierarchical storage machines ("HSM" for short) with a directory structure accessible from /storage/du-cesnet
.
Due to technological innovation of operated system were the HSM storages disconnected and decommissioned. User data have been transferred to machines with Object storage technology.
Object storage is successor of HSM with slightly different set of commands, i.e. it does not work in the same way.
S3 storage is available for all Metacentrum users.
You can generate your credetials via Gatekeeper service.
Where you will select your Metacentrum account and you should obtain your access_key
and secret_key
.
You can use the S3 storage as simple storage to store your data. You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone. The detailed tutorial for S3 client configuration can be found in the official Data Storage Department tutorials.
You can add s5cmd
and rclone
commands directly into your job file.
!!! warning "Bucket creation" Do not forget that the bucket being used for staging MUST exist on the remote S3 data storage. If you plan to stage-out your data into a non-existing bucket the job will fail. You need to prepare the bucket for stage-out in advance. You can use the official Data Storage Department tutorials for particular S3 client.
For Big Data sets, we recommend using the boto3
library or s5cmd
tool.
For general tips and tricks regarding Big Data and CESNET S3 storage, please visit the Big Data Tips and Tricks page.
Binary | Source code language | Library | Console usage | Python usage | Fit for Big Data transfers |
---|---|---|---|---|---|
aws cli | Python | aws cli | Yes | Yes | No |
s3cmd | Python | s3cmd | Yes | Yes | No |
s4cmd | Python | boto3 | No | Yes | Yes |
s5cmd | Go | --- ? --- | Yes | No | Yes |
For further details and more information about all the possible S3 clients, please refer to the official Data Storage Department tutorials.
boto3
is a Python library that allows you to interact with the S3 storage.
You have to use it from your Python scripts - it is not a standalone tool like s3cmd
or s5cmd
.
!!! warning "Issues with uploads within the latest boto3 versions"
There have been some issues lately with the latest releases of the boto3
library and third-party S3 storage providers (like the CESNET S3 storage).
As of now, the issues still persist with versions 1.36.x
(1.36.12
to be exact). For this reason, we recommend using the boto3
library with the version 1.35.99
or lower.
For further information, see this GitHub issue.
In order to use boto3
library, you need to install it first.
You can do it by running the following command inside your Python environment:
pip install boto3
Or you can use the specific version of the library:
pip install boto3==1.35.99
First, you need to create an instance of the s3 client
object with your credentials and the endpoint URL:
import boto3
access_key = "********************"
secret_key = "****************************************"
endpoint_url = "https://s3.cl4.du.cesnet.cz"
s3 = boto3.client("s3", aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url)
(Side note: You can use the ~/.aws/credentials
file to store your credentials and use them in your scripts.
For more information, see the official boto3 documentation.)
Then you can use the s3
object to interact with the S3 storage.
For example, you can list all the buckets:
response = s3.list_buckets()
for bucket in response["Buckets"]:
print(f"{bucket['Name']}")
Or you can upload a file to the S3 storage:
s3.upload_file("/local/path/to/file", "bucket-name", "remote/path/to/object")
Or alternatively, you can download an object from the S3 storage (be aware of the parameters order!):
s3.download_file("bucket-name", "remote/path/to/object", "/local/path/to/file")
To use s5cmd
tool you need to create a credentials file (copy the content below) in your home dir, e.g. /storage/brno2/home/<your-login-name>/.aws/credentials
.
[profile-name]
aws_access_key_id = XXXXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXX
max_concurrent_requests = 200
max_queue_size = 20000
multipart_threshold = 128MB
multipart_chunksize = 32MB
Then you can continue to use s5cmd
via commands described in Data Storage guide.
Alternatively, you can directly add the following lines into your job file.
#define CREDDIR, where you stored your S3 credentials for, default is your home directory
S3CRED=/storage/brno2/home/<your-login-name>/.aws/credentials
#stage in command for s5cmd
s5cmd --credentials-file "${S3CRED}" --profile profile-name --endpoint-url=https://s3.cl4.du.cesnet.cz cp s3://my-bucket/h2o.com ${DATADIR}/
#stage out command for s5cmd
s5cmd --credentials-file "${S3CRED}" --profile profile-name --endpoint-url=https://s3.cl4.du.cesnet.cz cp ${DATADIR}/h2o.out s3://my-bucket/
Alternatively, you can use rclone
tool, which is less handy for large data sets.
In case of large data sets (tens of terabytes) please use s5cmd
or boto3
, mentioned above.
For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. /storage/brno2/home/<your-login-name>/.config/rclone/rclone.conf
.
[profile-name]
type = s3
provider = Ceph
access_key_id = XXXXXXXXXXXXXXXXXXX
secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXX
endpoint = s3.cl4.du.cesnet.cz
acl = private
Then you can continue to use rclone
via commands described in Data Storage guide.
Or you can directly add following lines into your job file.
#define CREDDIR, where you stored your S3 credentials for, default is your home directory
S3CRED=/storage/brno2/home/<your-login-name>/.config/rclone/rclone.conf
#stage in command for rclone
rclone sync --progress --fast-list --config ${S3CRED} profile-name:my-bucket/h2o.com ${DATADIR}
#stage out command for rclone
rclone sync --progress --fast-list --config ${S3CRED} ${DATADIR}/h2o.out profile-name:my-bucket/