Skip to content

Commit ccb601f

Browse files
committed
Add bulk storage page
1 parent d178ac1 commit ccb601f

File tree

2 files changed

+107
-0
lines changed

2 files changed

+107
-0
lines changed

docs/mkdocs.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ nav:
1414
- Glossary: concepts/glossary.md
1515
- System Administration:
1616
- Database Administration: sysadmin/dba.md
17+
- Bulk Storage Systems: sysadmin/bulk-storage.md
1718
- File Storage: sysadmin/filestore.md
1819
- Backups and Recovery: sysadmin/backup.md
1920
- Database Server Hosting: sysadmin/hosting.md

docs/src/sysadmin/bulk-storage.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Bulk Storage Systems
2+
3+
## Why External Bulk Storage?
4+
5+
DataJoint supports the storage of large data objects associated with
6+
relational records externally from the MySQL Database itself. This is
7+
significant and useful for a number of reasons.
8+
9+
### Cost
10+
11+
One of these is that the high-performance storage commonly used in
12+
database systems is more expensive than that used in more typical
13+
commodity storage, and so storing the smaller identifying information
14+
typically used in queries on fast, relational database storage and
15+
storing the larger bulk data used for analysis or processing on lower
16+
cost commodity storage can allow for large savings in storage expense.
17+
18+
### Flexibility
19+
20+
Storing bulk data separately also facilitates more flexibility in
21+
usage, since the bulk data can managed using separate maintenance
22+
processes than that in the relational storage.
23+
24+
For example, larger relational databases may require many hours to be
25+
restored in the event of system failures. If the relational portion of
26+
the data is stored separately, with the larger bulk data stored on
27+
another storage system, this downtime can be reduced to a matter of
28+
minutes. Similarly, due to the lower cost of bulk commodity storage,
29+
more emphasis can be put into redundancy of this data and backups to
30+
help protect the non-relational data.
31+
32+
### Performance
33+
34+
Storing the non-relational bulk data separately can have system
35+
performance impacts by removing data transfer, disk I/O, and memory
36+
load from the database server and shifting these to the bulk storage
37+
system. Additionally, DataJoint supports caching of bulk data records
38+
which can allow for faster processing of records which already have
39+
been retrieved in previous queries.
40+
41+
### Data Sharing
42+
43+
DataJoint provides pluggable support for different external bulk
44+
storage backends, which can provide benefits for data sharing by
45+
publishing bulk data to S3-Protocol compatible data shares both in the
46+
cloud and on locally managed systems and other common tools for data
47+
sharing, such as Globus, etc.
48+
49+
## Bulk Storage Scenarios
50+
51+
Typical bulk storage considerations relate to the cost of the storage
52+
backend per unit of storage, the amount of data which will be stored,
53+
the desired focus of the shared data (system performance, data
54+
flexibility, data sharing), and data access. Some common scenarios are
55+
given in the following table:
56+
57+
| Scenario | Storage Solution | System Requirements | Notes |
58+
| -- | -- | -- | -- |
59+
| Local Object Cache | Local External Storage | Local Hard Drive | Used to Speed Access to other Storage |
60+
| LAN Object Cache | Network External Storage | Local Network Share | Used to Speed Access to other storage, reduce Cloud/Network Costs/Overhead |
61+
| Local Object Store | Local/Network External Storage | Local/Network Storage | Used to store objects externally from the database |
62+
| Local S3-Compatible Store | Local S3-Compatible Server | Network S3-Server | Used to host S3-Compatible services locally (e.g. minio) for internal use or to lower cloud costs |
63+
| Cloud S3-Compatible Storage | Cloud Provider | Internet Connectivity | Used to reduce/remove requirement for external storage management, data sharing |
64+
| Globus Storage | Globus Endpoint | Local/Local Network Storage, Internet Connectivity | Used for institutional data transfer or publishing. |
65+
66+
## Bulk Storage Considerations
67+
68+
Although external bulk storage provides a variety of advantages for
69+
storage cost and data sharing, it also uses slightly different data
70+
input/retrieval semantics and as such has different performance
71+
characteristics.
72+
73+
### Performance Characteristics
74+
75+
In the direct database connection scenario, entire result sets are
76+
either added or retrieved from the database in a single stream
77+
action. In the case of external storage, individual record components
78+
are retrieved in a set of sequential actions per record, each one
79+
subject to the network round trip to the given storage medium. As
80+
such, tables using many small records may be ill suited to external
81+
storage usage in the absence of a caching mechanism. While some of
82+
these impacts may be addressed by code changes in a future release of
83+
DataJoint, to some extent, the impact is directly related from needing
84+
to coordinate the activities of the database data stream with the
85+
external storage system, and so cannot be avoided.
86+
87+
### Network Traffic
88+
89+
Some of the external storage solutions mentioned above incur cost both
90+
at a data volume and transfer bandwidth level. The number of users
91+
querying the database, data access, and use of caches should be
92+
considered in these cases to reduce this cost if applicable.
93+
94+
### Data Coherency
95+
96+
When storing all data directly in the relational data store, it is
97+
relatively easy to ensure that all data in the database is consistent
98+
in the event of system issues such as crash recoveries, since MySQL’s
99+
relational storage engine manages this for you. When using external
100+
storage however, it is important to ensure that any data recoveries of
101+
the database system are paired with a matching point-in-time of the
102+
external storage system. While DataJoint does use hashing to help
103+
facilitate a guarantee that external files are uniquely named
104+
throughout their lifecycle, the pairing of a given relational dataset
105+
against a given filesystem state is loosely coupled, and so an
106+
incorrect pairing could result in processing failures or other issues.

0 commit comments

Comments
 (0)