- VPC (
aws_vpc.datachain_cluster
): A Virtual Private Cloud with CIDR block172.18.0.0/16
to isolate the compute cluster. - Subnets (
aws_subnet.datachain_cluster
): Public subnets are created in all available availability zones to ensure high availability. - Internet Gateway (
aws_internet_gateway.datachain_cluster
): Provides internet access for resources within the VPC. - Route Table (
aws_route_table.datachain_cluster
): Configures routing for internet-bound traffic through the Internet Gateway. - Security Group (
aws_security_group.datachain_cluster
): Controls inbound and outbound traffic for the cluster, allowing all traffic within the VPC and outbound traffic to the internet.
- Cluster Role (
aws_iam_role.datachain_cluster
): Assumed by the EKS cluster to manage AWS resources. - Node Role (
aws_iam_role.datachain_cluster_node
): Assumed by EC2 instances (worker nodes) in the cluster. - Pod Role (
aws_iam_role.datachain_cluster_pod
): Assumed by EKS pods for accessing AWS resources. - OIDC Compute Role (
aws_iam_role.datachain_oidc_compute
): Assumed by DataChain Studio (SaaS) to create, delete and manage DataChain clusters. - OIDC Storage Role (
aws_iam_role.datachain_oidc_storage
): Assumed by DataChain Studio (SaaS) to read and write object storage (S3) buckets.
- OIDC Provider (
aws_iam_openid_connect_provider.datachain_oidc
): Configures an OIDC provider for the cluster, enabling secure authentication for pods.
- VPC ID (
datachain_cluster_vpc_id
) - Subnet IDs (
datachain_cluster_subnet_ids
) - Security Group IDs (
datachain_cluster_security_group_ids
) - IAM Role ARNs for the cluster, nodes, and OIDC roles.
-
IAM Policies: The roles are attached to AWS-managed policies to ensure least privilege access, and restrict allowed actions to DataChain-managed resources.
-
OIDC Integration: The use of OIDC allows DataChain Studio to securely manage cloud resources in the target account eliminating the need for static credentials.
-
Network Isolation: The VPC and security groups ensure that the cluster is isolated from external networks, with controlled ingress and egress rules.
DataChain Studio is split into 2 main components:
- Control Plane — typically hosted by us as a fully managed service
- Compute & Data Plane — typically hosted on your cloud accounts
Compute resources will be provisioned through managed Kubernetes clusters we automatically deploy on your account, using the permissions described in this repository.
Update the storage_buckets
list in variables.tf
with the list of S3 bucket names DataChain Studio Jobs should have access to, and run terraform apply
You can securely inject sensitive configuration (such as tokens, passwords, or private URLs) into your compute jobs by referencing AWS Secrets Manager secrets through environment variables. This avoids hardcoding credentials and allows fine-grained secret management.
-
Create a JSON Secret in AWS Secrets Manager
Store your secret as a JSON object. For example:
{ "EXAMPLE_SECRET": "your-secret-value-or-url" }
-
Grant Access to the Secret through Terraform
Update the
secrets
list invariables.tf
in order to grant access to the created secret, and runterraform apply
-
Set an Environment Variable in the Studio Job Settings
In DataChain Studio, configure your job with an environment variable that references the secret using the special
aws://
syntax:EXAMPLE_SECRET=awssecret://arn:aws:secretsmanager:us-east-1:000000000000:secret:example-secret/test-abcdef#EXAMPLE_SECRET
- Replace
arn:aws:secretsmanager:us-east-1:000000000000:secret:example-secret/test-abcdef
with the full ARN of your secret. - The part after the
#
(e.g.,#EXAMPLE_SECRET
) refers to the key in your JSON secret. - Add the full ARN of your secret to
variables.tf
- Replace