I just want to serve five terabytes of data, across multiple cloud providers.
Cacheroach may be of interest to you if you need a multi-tenant, multi-region, multi-cloud, file-storage abstraction built on top of CockroachDB. It has both an HTTP and a gRPC API, a robust security model, and can be deployed within your favorite serverless or containerized runtime. For users with data-domiciling needs, Cacheroach's database schema is built to take advantage of CockroachDB's support for geo-partitioned workloads.
git clone [email protected]:bobvawter/cacheroach
cd cacheroach
# To build the cacheroach CLI tool
go install
# Launch demo stack consisting of a CRDB node and cacheroach
docker-compose -f docker-compose.quickstart.yml up
# Create a root user, writing access data to root.cfg
cacheroach -v -c root.cfg bootstrap --hmacKey @./cacheroach-data/hmac http://root@localhost:13013/
# List tenants, should show the default scratchpad
cacheroach -v -c root.cfg tenant ls
# Create a working user, this will write a new configuration file.
# You can add a --password flag to enable password-based login.
cacheroach -v -c root.cfg principal create $USER --out $HOME/.cacheroach/config
# Grant the new user read/write access to a tenant.
cacheroach -v -c root.cfg session delegate --for <PRINCIPAL> --on tenant --id <TENANT> --capabilities read,write
# Use the real user to list files
cacheroach -v file ls -t <TENANT>
# Set a default tenant id to minimize typing
cacheroach tenant default <TENANT>
# Upload a file
echo "Hello World." > hello.txt
cacheroach -v file put / hello.txt
# Look at HTTP vhost mapping
cacheroach -v -c root.cfg vhost ls
# Fetch a file using the default HTTP VHost endpoint
# Use -H "Authorization: Bearer <JWT TOKEN>" for non-public tenants.
curl -O http://127.0.0.1:13013/cacheroach
# Generate a signed retrieval URL.
cacheroach file sign /hello.txt
See the CLI docs for additional information.
A CockroachDB cluster can easily hold many terabytes of data, however it is fundamentally designed to service "Online Transaction Processing" (OLTP) and System-of-Record (SoR) use-cases. The implementation choices necessary to excel in providing consistent and reliable transactions at scale impose certain limitations on the queries that can be effectively processed. The chief limitations to overcome when building a file store are the maximum row (512 MiB) and transaction (64 MiB) sizes.
Cacheroach breaks files into content-addressable chunks of 512 KiB which are assembled into content-addressable ropes. Ropes are then referenced by a filesystem abstraction which provides the necessary metadata needed by clients. In this context, "content-addressable" means that we use the cryptographic hash of the contents of a chunk or of a rope in order to identify it.
Cacheroach foregoes the typical approach of "one-request, one-transaction" due to the aforementioned limits on the size on any given data transaction. Rather, we use an idempotent approach to most data-storage operations. Writing the same chunk twice is effectively a no-op. We build on this when manipulating ropes or performing operations on filesystem manifests. The use of single-SQL-statement, implicit, transactions allows us to take advantage of CockroachDB's automatic transaction retry mechanisms in cases where there is a transactional serialization conflict.
One extra layer of data organization is applied to chunks, ropes, and files: Tenancy. This allows a single Cacheroach service to service multiple use-cases. The database schema also allows CockroachDB zone configurations to be applied to better control where any given tenant's data lives.
Cacheroach uses a "capability, delegate, target" approach to authorization. A Principal may have zero or more durable Sessions which grant the principal the permission to perform operations within the system.
Automatic principal provisioning can be enabled through OIDC integration. Cacheroach will request OIDC credentials with an offline scope. Principals are periodically re-validated using the OIDC refresh token. A whitelist of email domains is provided as part of cacheroach's configuration to limit access to specified users.
Sessions are exposed as signed JWT tokens. Active sessions are maintained in a table to facilitate occasional invalidation checks.
The API surface area uses a declarative model to implement ACL checks in a centralized manner. All access checks will have been performed by the time an RPC method has been invoked. The return values are also checked and elided. An RPC call will be rejected if a client "says" something that it's not allowed to "say," and it cannot "hear" anything that it could not "say" later.
Cacheroach supports using an OIDC provider that provides a discovery URL. Pass an OAuth client id, secret, and the OIDC discovery URL to the cacheroach server to enable automatic principal provisioning. An email-domain principal can provide automatic access delegation to newly-created principals.
Using your favorite OIDC provider of choice, create a new OAuth web-server
integration. Here's how you'd do it using Google
Accounts,
but any OIDC implemention ought to work. You will need the discovery URL, the
OAuth2 client id and secret, and a list of user email domains for which you
want to automatically provision accounts. You'll also need to configure an
OAuth redirect URL, such as https://your.cacheroach.server/_/oidc/receive
.
cacheroach start
--oidcIssuer https://accounts.google.com
--oidcDomains yourcompany.com
--oidcClientID xyzzy
--oidcClientSecret soupOrSecret
....
On the client side, run cacheroach auth login https://your.cacheroach.server/
. You'll be prompted with a URL to open in a
browser to authenticate with the OIDC provider. If all goes accordingly, your
browser will connect to the running cacheroach instance to be redirected to the
OIDC provider. The provider will authenticate your browser and redirect back
to the cacheroach server to complete the handoff. The cacheroach server will
then redirect you to an ephemeral webserver started by the local cacheroach CLI
process to transfer the cacheroach session data.
Using a superuser token, you can create a cacheroach principal that represents
all users with a given OIDC domain using cacheroach principal create --emailDomain yourcompany.com "Your Company"
. Any grants provided to the
domain principal will be inherited by OIDC-provisioned principals.
Cacheroach can generate durable signed URLs that allow an otherwise-unauthenticated client to retrieve a file through Cacheroach's HTTP endpoint.
A tenant's filesystem can be bound to a virtual hostname and served over a regular HTTP endpoint.