You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Update the RDS connection in the Airflow Config file
109
+
sed -i '/sql_alchemy_conn/s/^/#/g' ~/airflow/airflow.cfg
110
+
sed -i '/sql_alchemy_conn/ a sql_alchemy_conn = postgresql://airflow:${DBPassword}@${DBInstance.Endpoint.Address}:${DBInstance.Endpoint.Port}/airflowdb' ~/airflow/airflow.cfg
111
+
# Update the type of executor in the Airflow Config file
112
+
sed -i '/executor = SequentialExecutor/s/^/#/g' ~/airflow/airflow.cfg
113
+
sed -i '/executor = SequentialExecutor/ a executor = LocalExecutor' ~/airflow/airflow.cfg
114
+
airflow initdb
115
+
# Move all the files to the ~/airflow directory. The Airflow config file is setup to hold all the DAG related files in the ~/airflow/ folder.
# Replace the name of the S3 bucket in each of the .scala files. CHANGE THE HIGHLIGHTED PORTION BELOW TO THE NAME OF THE S3 BUCKET YOU CREATED IN STEP 1. The below command replaces the instance of the string ‘<s3-bucket>’ in each of the scripts to the name of the actual bucket.
120
+
sed -i 's/<s3-bucket>/${S3BucketName}/g' /root/airflow/dags/transform/*
121
+
# Run Airflow webserver
122
+
airflow webserver
123
+
Metadata:
124
+
AWS::CloudFormation::Init:
125
+
configSets:
126
+
install:
127
+
- gcc
128
+
gcc:
129
+
packages:
130
+
yum:
131
+
gcc: []
132
+
DependsOn:
133
+
- DBInstance
134
+
- AirflowEC2SecurityGroup
135
+
DBInstance:
136
+
Type: AWS::RDS::DBInstance
137
+
DeletionPolicy: Delete
138
+
Properties:
139
+
DBName: airflowdb
140
+
Engine: postgres
141
+
MasterUsername: airflow
142
+
MasterUserPassword: !Ref 'DBPassword'
143
+
DBInstanceClass: db.t2.small
144
+
AllocatedStorage: 5
145
+
DBSecurityGroups:
146
+
- Ref: DBSecurityGroup
147
+
AirflowEC2SecurityGroup:
148
+
Type: AWS::EC2::SecurityGroup
149
+
Properties:
150
+
GroupName: AirflowEC2SG
151
+
GroupDescription: Enable HTTP access via port 80 + SSH access
Script is available publically and can be imported from -> https://s3.amazonaws.com/aws-bigdata-blog/artifacts/airflow.livy.emr/airflow.yaml
7
+
8
+
**This requires access to an Amazon EC2 key pair in the AWS Region you’re launching your CloudFormation stack. Please make sure to create a key-pair in the AWS Region first. Follow : [create-your-key-pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair)**
9
+
10
+
Steps to import:
11
+
1. Go to AWS Console -> Search for CloudFormation Service and open it.
12
+
2. Click on create stack -> Select **Template is Ready**
13
+
3. In the Amazon S3 URL paste the URL mentioned above.
14
+
4. This will load a template from `airflow.yaml`
15
+
5. Click Next -> Specify DBPassword and KeyName(the already existing key-pair) and S3BucketName (bucket should not be exisiting, it will automatically create a new bucket).
16
+
6. Click Next -> Next to run the stack.
17
+
18
+
After the stack run is successfully completed, got to EC2 and you will see a new instance launched. Connect to instance using ssh connection. You can use putty or can connect using command line using ssh.
19
+
20
+
[Connect to EC2 using putty](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html)
In this project, I will build a Data Lake on AWS using Spark and AWS EMR cluster. The data lake will be the single source for the analytics platform. Utilizing spark jobs to perform ELT operations that picks data from the S3 landing zone, then transforming and storing the data to the S3 processed zone.
For this project, a Data Pipeline workflow was created with Apache Airflow. I will schedule ETL jobs to create project related custom plugins and operators to automate the pipeline execution.
0 commit comments