|
| 1 | +# Bulk Storage Systems |
| 2 | + |
| 3 | +## Why External Bulk Storage? |
| 4 | + |
| 5 | +DataJoint supports the storage of large data objects associated with |
| 6 | +relational records externally from the MySQL Database itself. This is |
| 7 | +significant and useful for a number of reasons. |
| 8 | + |
| 9 | +### Cost |
| 10 | + |
| 11 | +One of these is that the high-performance storage commonly used in |
| 12 | +database systems is more expensive than that used in more typical |
| 13 | +commodity storage, and so storing the smaller identifying information |
| 14 | +typically used in queries on fast, relational database storage and |
| 15 | +storing the larger bulk data used for analysis or processing on lower |
| 16 | +cost commodity storage can allow for large savings in storage expense. |
| 17 | + |
| 18 | +### Flexibility |
| 19 | + |
| 20 | +Storing bulk data separately also facilitates more flexibility in |
| 21 | +usage, since the bulk data can managed using separate maintenance |
| 22 | +processes than that in the relational storage. |
| 23 | + |
| 24 | +For example, larger relational databases may require many hours to be |
| 25 | +restored in the event of system failures. If the relational portion of |
| 26 | +the data is stored separately, with the larger bulk data stored on |
| 27 | +another storage system, this downtime can be reduced to a matter of |
| 28 | +minutes. Similarly, due to the lower cost of bulk commodity storage, |
| 29 | +more emphasis can be put into redundancy of this data and backups to |
| 30 | +help protect the non-relational data. |
| 31 | + |
| 32 | +### Performance |
| 33 | + |
| 34 | +Storing the non-relational bulk data separately can have system |
| 35 | +performance impacts by removing data transfer, disk I/O, and memory |
| 36 | +load from the database server and shifting these to the bulk storage |
| 37 | +system. Additionally, DataJoint supports caching of bulk data records |
| 38 | +which can allow for faster processing of records which already have |
| 39 | +been retrieved in previous queries. |
| 40 | + |
| 41 | +### Data Sharing |
| 42 | + |
| 43 | +DataJoint provides pluggable support for different external bulk |
| 44 | +storage backends, which can provide benefits for data sharing by |
| 45 | +publishing bulk data to S3-Protocol compatible data shares both in the |
| 46 | +cloud and on locally managed systems and other common tools for data |
| 47 | +sharing, such as Globus, etc. |
| 48 | + |
| 49 | +## Bulk Storage Scenarios |
| 50 | + |
| 51 | +Typical bulk storage considerations relate to the cost of the storage |
| 52 | +backend per unit of storage, the amount of data which will be stored, |
| 53 | +the desired focus of the shared data (system performance, data |
| 54 | +flexibility, data sharing), and data access. Some common scenarios are |
| 55 | +given in the following table: |
| 56 | + |
| 57 | +| Scenario | Storage Solution | System Requirements | Notes | |
| 58 | +| -- | -- | -- | -- | |
| 59 | +| Local Object Cache | Local External Storage | Local Hard Drive | Used to Speed Access to other Storage | |
| 60 | +| LAN Object Cache | Network External Storage | Local Network Share | Used to Speed Access to other storage, reduce Cloud/Network Costs/Overhead | |
| 61 | +| Local Object Store | Local/Network External Storage | Local/Network Storage | Used to store objects externally from the database | |
| 62 | +| Local S3-Compatible Store | Local S3-Compatible Server | Network S3-Server | Used to host S3-Compatible services locally (e.g. minio) for internal use or to lower cloud costs | |
| 63 | +| Cloud S3-Compatible Storage | Cloud Provider | Internet Connectivity | Used to reduce/remove requirement for external storage management, data sharing | |
| 64 | +| Globus Storage | Globus Endpoint | Local/Local Network Storage, Internet Connectivity | Used for institutional data transfer or publishing. | |
| 65 | + |
| 66 | +## Bulk Storage Considerations |
| 67 | + |
| 68 | +Although external bulk storage provides a variety of advantages for |
| 69 | +storage cost and data sharing, it also uses slightly different data |
| 70 | +input/retrieval semantics and as such has different performance |
| 71 | +characteristics. |
| 72 | + |
| 73 | +### Performance Characteristics |
| 74 | + |
| 75 | +In the direct database connection scenario, entire result sets are |
| 76 | +either added or retrieved from the database in a single stream |
| 77 | +action. In the case of external storage, individual record components |
| 78 | +are retrieved in a set of sequential actions per record, each one |
| 79 | +subject to the network round trip to the given storage medium. As |
| 80 | +such, tables using many small records may be ill suited to external |
| 81 | +storage usage in the absence of a caching mechanism. While some of |
| 82 | +these impacts may be addressed by code changes in a future release of |
| 83 | +DataJoint, to some extent, the impact is directly related from needing |
| 84 | +to coordinate the activities of the database data stream with the |
| 85 | +external storage system, and so cannot be avoided. |
| 86 | + |
| 87 | +### Network Traffic |
| 88 | + |
| 89 | +Some of the external storage solutions mentioned above incur cost both |
| 90 | +at a data volume and transfer bandwidth level. The number of users |
| 91 | +querying the database, data access, and use of caches should be |
| 92 | +considered in these cases to reduce this cost if applicable. |
| 93 | + |
| 94 | +### Data Coherency |
| 95 | + |
| 96 | +When storing all data directly in the relational data store, it is |
| 97 | +relatively easy to ensure that all data in the database is consistent |
| 98 | +in the event of system issues such as crash recoveries, since MySQL’s |
| 99 | +relational storage engine manages this for you. When using external |
| 100 | +storage however, it is important to ensure that any data recoveries of |
| 101 | +the database system are paired with a matching point-in-time of the |
| 102 | +external storage system. While DataJoint does use hashing to help |
| 103 | +facilitate a guarantee that external files are uniquely named |
| 104 | +throughout their lifecycle, the pairing of a given relational dataset |
| 105 | +against a given filesystem state is loosely coupled, and so an |
| 106 | +incorrect pairing could result in processing failures or other issues. |
0 commit comments