Add bulk storage page

kabilar · kabilar · commit ccb601f08778 · 2023-07-06T22:16:32.000-05:00
diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml
@@ -14,6 +14,7 @@ nav:
     - Glossary: concepts/glossary.md
   - System Administration: 
     - Database Administration: sysadmin/dba.md
+    - Bulk Storage Systems: sysadmin/bulk-storage.md
     - File Storage: sysadmin/filestore.md
     - Backups and Recovery: sysadmin/backup.md
     - Database Server Hosting: sysadmin/hosting.md
diff --git a/docs/src/sysadmin/bulk-storage.md b/docs/src/sysadmin/bulk-storage.md
@@ -0,0 +1,106 @@
+# Bulk Storage Systems
+
+## Why External Bulk Storage?
+
+DataJoint supports the storage of large data objects associated with
+relational records externally from the MySQL Database itself. This is
+significant and useful for a number of reasons.
+
+### Cost
+
+One of these is that the high-performance storage commonly used in
+database systems is more expensive than that used in more typical
+commodity storage, and so storing the smaller identifying information
+typically used in queries on fast, relational database storage and
+storing the larger bulk data used for analysis or processing on lower
+cost commodity storage can allow for large savings in storage expense.
+
+### Flexibility
+
+Storing bulk data separately also facilitates more flexibility in
+usage, since the bulk data can managed using separate maintenance
+processes than that in the relational storage.
+
+For example, larger relational databases may require many hours to be
+restored in the event of system failures. If the relational portion of
+the data is stored separately, with the larger bulk data stored on
+another storage system, this downtime can be reduced to a matter of
+minutes. Similarly, due to the lower cost of bulk commodity storage,
+more emphasis can be put into redundancy of this data and backups to
+help protect the non-relational data.
+
+### Performance
+
+Storing the non-relational bulk data separately can have system
+performance impacts by removing data transfer, disk I/O, and memory
+load from the database server and shifting these to the bulk storage
+system. Additionally, DataJoint supports caching of bulk data records
+which can allow for faster processing of records which already have
+been retrieved in previous queries.
+
+### Data Sharing
+
+DataJoint provides pluggable support for different external bulk
+storage backends, which can provide benefits for data sharing by
+publishing bulk data to S3-Protocol compatible data shares both in the
+cloud and on locally managed systems and other common tools for data
+sharing, such as Globus, etc.
+
+## Bulk Storage Scenarios
+
+Typical bulk storage considerations relate to the cost of the storage
+backend per unit of storage, the amount of data which will be stored,
+the desired focus of the shared data (system performance, data
+flexibility, data sharing), and data access. Some common scenarios are
+given in the following table:
+
+| Scenario | Storage Solution | System Requirements | Notes |
+| -- | -- | -- | -- |
+| Local Object Cache | Local External Storage | Local Hard Drive | Used to Speed Access to other Storage |
+| LAN Object Cache | Network External Storage | Local Network Share | Used to Speed Access to other storage, reduce Cloud/Network Costs/Overhead |
+| Local Object Store | Local/Network External Storage | Local/Network Storage | Used to store objects externally from the database |
+| Local S3-Compatible Store | Local S3-Compatible Server | Network S3-Server | Used to host S3-Compatible services locally (e.g. minio) for internal use or to lower cloud costs |
+| Cloud S3-Compatible Storage | Cloud Provider | Internet Connectivity | Used to reduce/remove requirement for external storage management, data sharing |
+| Globus Storage | Globus Endpoint | Local/Local Network Storage, Internet Connectivity | Used for institutional data transfer or publishing. |
+
+## Bulk Storage Considerations
+
+Although external bulk storage provides a variety of advantages for
+storage cost and data sharing, it also uses slightly different data
+input/retrieval semantics and as such has different performance
+characteristics.
+
+### Performance Characteristics
+
+In the direct database connection scenario, entire result sets are
+either added or retrieved from the database in a single stream
+action. In the case of external storage, individual record components
+are retrieved in a set of sequential actions per record, each one
+subject to the network round trip to the given storage medium. As
+such, tables using many small records may be ill suited to external
+storage usage in the absence of a caching mechanism. While some of
+these impacts may be addressed by code changes in a future release of
+DataJoint, to some extent, the impact is directly related from needing
+to coordinate the activities of the database data stream with the
+external storage system, and so cannot be avoided.
+
+### Network Traffic
+
+Some of the external storage solutions mentioned above incur cost both
+at a data volume and transfer bandwidth level. The number of users
+querying the database, data access, and use of caches should be
+considered in these cases to reduce this cost if applicable.
+
+### Data Coherency
+
+When storing all data directly in the relational data store, it is
+relatively easy to ensure that all data in the database is consistent
+in the event of system issues such as crash recoveries, since MySQL’s
+relational storage engine manages this for you. When using external
+storage however, it is important to ensure that any data recoveries of
+the database system are paired with a matching point-in-time of the
+external storage system. While DataJoint does use hashing to help
+facilitate a guarantee that external files are uniquely named
+throughout their lifecycle, the pairing of a given relational dataset
+against a given filesystem state is loosely coupled, and so an
+incorrect pairing could result in processing failures or other issues.