Skip to content

Commit 099b9f2

Browse files
committed
Airflow config + README update
1 parent 1b08b8d commit 099b9f2

11 files changed

+344
-1
lines changed

AWS_Services/README.md

+2
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,5 @@
2222

2323
# AWS s3 CLI Cheat Sheet
2424
![S3 CLI cheat sheet](/AWS_Services/aws-s3-cheat-sheet.png)
25+
26+
![AWS Big Data Pipeline](images/aws_big_data_pipeline.png)

Airflow_CloudFormation.yaml

+267
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
3+
Description: Airflow server backed by Postgres RDS
4+
5+
Parameters:
6+
KeyName:
7+
Description: Name of an existing EC2 KeyPair to enable SSH access into the Airflow web server
8+
Type: AWS::EC2::KeyPair::KeyName
9+
ConstraintDescription: Must be the name of an existing EC2 KeyPair
10+
S3BucketName:
11+
Description: REQUIRED - A new S3 Bucket name. This bucket will be used to read and write the Movielens dataset.
12+
Type: String
13+
AllowedPattern: '.+'
14+
DBPassword:
15+
Default: airflowpassword
16+
NoEcho: 'true'
17+
Description: Airflow database admin account password
18+
Type: String
19+
MinLength: '8'
20+
MaxLength: '41'
21+
AllowedPattern: '[a-zA-Z0-9]*'
22+
ConstraintDescription: Must contain only alphanumeric characters
23+
24+
# Mapping to find the Amazon Linux AMI in each region.
25+
Mappings:
26+
RegionMap:
27+
us-east-1:
28+
AMI: ami-97785bed
29+
us-east-2:
30+
AMI: ami-f63b1193
31+
us-west-1:
32+
AMI: ami-824c4ee2
33+
us-west-2:
34+
AMI: ami-f2d3638a
35+
ca-central-1:
36+
AMI: ami-a954d1cd
37+
eu-west-1:
38+
AMI: ami-d834aba1
39+
eu-west-2:
40+
AMI: ami-403e2524
41+
eu-west-3:
42+
AMI: ami-8ee056f3
43+
eu-central-1:
44+
AMI: ami-5652ce39
45+
sa-east-1:
46+
AMI: ami-84175ae8
47+
ap-south-1:
48+
AMI: ami-531a4c3c
49+
ap-southeast-1:
50+
AMI: ami-68097514
51+
ap-southeast-2:
52+
AMI: ami-942dd1f6
53+
ap-northeast-1:
54+
AMI: ami-ceafcba8
55+
ap-northeast-2:
56+
AMI: ami-863090e8
57+
Resources:
58+
EC2Instance:
59+
Type: AWS::EC2::Instance
60+
Properties:
61+
KeyName: !Ref 'KeyName'
62+
SecurityGroups: [!Ref 'AirflowEC2SecurityGroup']
63+
InstanceType: 'm4.xlarge'
64+
IamInstanceProfile:
65+
Ref: EC2InstanceProfile
66+
Tags:
67+
-
68+
Key: Name
69+
Value: Airflow
70+
ImageId: !FindInMap
71+
- RegionMap
72+
- !Ref 'AWS::Region'
73+
- AMI
74+
UserData:
75+
Fn::Base64: !Sub |
76+
#!/bin/bash
77+
set -x
78+
exec > >(tee /var/log/user-data.log|logger -t user-data ) 2>&1
79+
# Get the latest CloudFormation package
80+
echo "Installing aws-cfn"
81+
yum install -y aws-cfn-bootstrap
82+
# Start cfn-init
83+
/opt/aws/bin/cfn-init -v -c install --stack ${AWS::StackId} --resource EC2Instance --region ${AWS::Region}
84+
# Download and unzip the Movielens dataset
85+
wget http://files.grouplens.org/datasets/movielens/ml-latest.zip && unzip ml-latest.zip
86+
# Upload the movielens dataset files to the S3 bucket
87+
aws s3 cp ml-latest s3://${S3BucketName} --recursive
88+
# Install git
89+
sudo yum install -y git
90+
# Clone the git repository
91+
git clone https://github.com/aws-samples/aws-concurrent-data-orchestration-pipeline-emr-livy.git
92+
sudo pip install boto3
93+
# Install airflow using pip
94+
echo "Install Apache Airflow"
95+
sudo SLUGIFY_USES_TEXT_UNIDECODE=yes pip install -U apache-airflow
96+
# Encrypt connection passwords in metadata db
97+
sudo pip install apache-airflow[crypto]
98+
# Postgres operators and hook, support as an Airflow backend
99+
sudo pip install apache-airflow[postgres]
100+
sudo -H pip install six==1.10.0
101+
sudo pip install --upgrade six
102+
sudo pip install markupsafe
103+
sudo pip install --upgrade MarkupSafe
104+
echo 'export PATH=/usr/local/bin:$PATH' >> /root/.bash_profile
105+
source /root/.bash_profile
106+
# Initialize Airflow
107+
airflow initdb
108+
# Update the RDS connection in the Airflow Config file
109+
sed -i '/sql_alchemy_conn/s/^/#/g' ~/airflow/airflow.cfg
110+
sed -i '/sql_alchemy_conn/ a sql_alchemy_conn = postgresql://airflow:${DBPassword}@${DBInstance.Endpoint.Address}:${DBInstance.Endpoint.Port}/airflowdb' ~/airflow/airflow.cfg
111+
# Update the type of executor in the Airflow Config file
112+
sed -i '/executor = SequentialExecutor/s/^/#/g' ~/airflow/airflow.cfg
113+
sed -i '/executor = SequentialExecutor/ a executor = LocalExecutor' ~/airflow/airflow.cfg
114+
airflow initdb
115+
# Move all the files to the ~/airflow directory. The Airflow config file is setup to hold all the DAG related files in the ~/airflow/ folder.
116+
mv aws-concurrent-data-orchestration-pipeline-emr-livy/* ~/airflow/
117+
# Delete the higher-level git repository directory
118+
rm -rf aws-concurrent-data-orchestration-pipeline-emr-livy
119+
# Replace the name of the S3 bucket in each of the .scala files. CHANGE THE HIGHLIGHTED PORTION BELOW TO THE NAME OF THE S3 BUCKET YOU CREATED IN STEP 1. The below command replaces the instance of the string ‘<s3-bucket>’ in each of the scripts to the name of the actual bucket.
120+
sed -i 's/<s3-bucket>/${S3BucketName}/g' /root/airflow/dags/transform/*
121+
# Run Airflow webserver
122+
airflow webserver
123+
Metadata:
124+
AWS::CloudFormation::Init:
125+
configSets:
126+
install:
127+
- gcc
128+
gcc:
129+
packages:
130+
yum:
131+
gcc: []
132+
DependsOn:
133+
- DBInstance
134+
- AirflowEC2SecurityGroup
135+
DBInstance:
136+
Type: AWS::RDS::DBInstance
137+
DeletionPolicy: Delete
138+
Properties:
139+
DBName: airflowdb
140+
Engine: postgres
141+
MasterUsername: airflow
142+
MasterUserPassword: !Ref 'DBPassword'
143+
DBInstanceClass: db.t2.small
144+
AllocatedStorage: 5
145+
DBSecurityGroups:
146+
- Ref: DBSecurityGroup
147+
AirflowEC2SecurityGroup:
148+
Type: AWS::EC2::SecurityGroup
149+
Properties:
150+
GroupName: AirflowEC2SG
151+
GroupDescription: Enable HTTP access via port 80 + SSH access
152+
SecurityGroupIngress:
153+
- IpProtocol: tcp
154+
FromPort: 80
155+
ToPort: 80
156+
CidrIp: 0.0.0.0/0
157+
- IpProtocol: tcp
158+
FromPort: 8080
159+
ToPort: 8080
160+
CidrIp: 0.0.0.0/0
161+
- IpProtocol: tcp
162+
FromPort: 22
163+
ToPort: 22
164+
CidrIp: 0.0.0.0/0
165+
AirflowEMRMasterEC2SecurityGroup:
166+
Type: AWS::EC2::SecurityGroup
167+
Properties:
168+
GroupName: AirflowEMRMasterSG
169+
GroupDescription: Airflow EMR Master SG
170+
DependsOn:
171+
- AirflowEC2SecurityGroup
172+
AirflowEMRMasterInboundRule:
173+
Type: AWS::EC2::SecurityGroupIngress
174+
Properties:
175+
IpProtocol: tcp
176+
FromPort: '8998'
177+
ToPort: '8998'
178+
SourceSecurityGroupName: !Ref 'AirflowEC2SecurityGroup'
179+
GroupName: !Ref 'AirflowEMRMasterEC2SecurityGroup'
180+
AirflowEMRSlaveEC2SecurityGroup:
181+
Type: AWS::EC2::SecurityGroup
182+
Properties:
183+
GroupName: AirflowEMRSlaveSG
184+
GroupDescription: Airflow EMR Slave SG
185+
DBSecurityGroup:
186+
Type: AWS::RDS::DBSecurityGroup
187+
Properties:
188+
GroupDescription: Frontend Access
189+
DBSecurityGroupIngress:
190+
EC2SecurityGroupName:
191+
Ref: AirflowEC2SecurityGroup
192+
EC2Role:
193+
Type: AWS::IAM::Role
194+
Properties:
195+
RoleName: AirflowInstanceRole
196+
AssumeRolePolicyDocument:
197+
Version: "2012-10-17"
198+
Statement:
199+
-
200+
Effect: "Allow"
201+
Principal:
202+
Service:
203+
- "ec2.amazonaws.com"
204+
Action:
205+
- "sts:AssumeRole"
206+
ManagedPolicyArns:
207+
- arn:aws:iam::aws:policy/AmazonS3FullAccess
208+
- arn:aws:iam::aws:policy/AmazonElasticMapReduceFullAccess
209+
EC2InstanceProfile:
210+
Type: AWS::IAM::InstanceProfile
211+
Properties:
212+
InstanceProfileName: AirflowInstanceProfile
213+
Roles:
214+
-
215+
Ref: EC2Role
216+
EmrRole:
217+
Type: AWS::IAM::Role
218+
Properties:
219+
RoleName: EmrRole
220+
AssumeRolePolicyDocument:
221+
Version: "2012-10-17"
222+
Statement:
223+
-
224+
Effect: "Allow"
225+
Principal:
226+
Service:
227+
- "elasticmapreduce.amazonaws.com"
228+
- "s3.amazonaws.com"
229+
Action:
230+
- "sts:AssumeRole"
231+
ManagedPolicyArns:
232+
- arn:aws:iam::aws:policy/AmazonS3FullAccess
233+
- arn:aws:iam::aws:policy/AmazonElasticMapReduceFullAccess
234+
EmrEc2Role:
235+
Type: AWS::IAM::Role
236+
Properties:
237+
RoleName: EmrEc2Role
238+
AssumeRolePolicyDocument:
239+
Version: "2012-10-17"
240+
Statement:
241+
-
242+
Effect: "Allow"
243+
Principal:
244+
Service:
245+
- "ec2.amazonaws.com"
246+
Action:
247+
- "sts:AssumeRole"
248+
ManagedPolicyArns:
249+
- arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role
250+
- arn:aws:iam::aws:policy/AmazonS3FullAccess
251+
EmrEc2InstanceProfile:
252+
Type: AWS::IAM::InstanceProfile
253+
Properties:
254+
InstanceProfileName: EmrEc2InstanceProfile
255+
Roles:
256+
-
257+
Ref: EmrEc2Role
258+
S3Bucket:
259+
Type: AWS::S3::Bucket
260+
DeletionPolicy: Retain
261+
Properties:
262+
AccessControl: BucketOwnerFullControl
263+
BucketName: !Ref 'S3BucketName'
264+
Outputs:
265+
AirflowEC2PublicDNSName:
266+
Description: Public DNS Name of the Airflow EC2 instance
267+
Value: !Join ["", ["http://", !GetAtt EC2Instance.PublicDnsName, ":8080"]]

Airflow_Livy_Config_CloudFormation.md

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
## Data Orchestration Pipeline Using Amazon EMR and Apache Livy
2+
## Setting up Airflow using AWS CloudFormation script
3+
4+
![Airflow_Livy_Architecture](https://github.com/AuFeld/Data_Engineering_Projects/blob/main/images/airflow_livy.png)
5+
6+
Script is available publically and can be imported from -> https://s3.amazonaws.com/aws-bigdata-blog/artifacts/airflow.livy.emr/airflow.yaml
7+
8+
**This requires access to an Amazon EC2 key pair in the AWS Region you’re launching your CloudFormation stack. Please make sure to create a key-pair in the AWS Region first. Follow : [create-your-key-pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair)**
9+
10+
Steps to import:
11+
1. Go to AWS Console -> Search for CloudFormation Service and open it.
12+
2. Click on create stack -> Select **Template is Ready**
13+
3. In the Amazon S3 URL paste the URL mentioned above.
14+
4. This will load a template from `airflow.yaml`
15+
5. Click Next -> Specify DBPassword and KeyName(the already existing key-pair) and S3BucketName (bucket should not be exisiting, it will automatically create a new bucket).
16+
6. Click Next -> Next to run the stack.
17+
18+
After the stack run is successfully completed, got to EC2 and you will see a new instance launched. Connect to instance using ssh connection. You can use putty or can connect using command line using ssh.
19+
20+
[Connect to EC2 using putty](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html)
21+
22+
**Connect using ssh from command line**
23+
24+
chmod 400 airflow_key_pair.pem
25+
ssh -i "airflow_key_pair.pem" [email protected]
26+
27+
After you are logged in:
28+
Run Command:
29+
30+
# sudo as the root user
31+
sudo su
32+
33+
export AIRFLOW_HOME=~/airflow
34+
# Navigate to the airflow directory which was created by the cloudformation template – Look at the user-data section.
35+
cd ~/airflow
36+
source ~/.bash_profile
37+
38+
#### Airflow initialization and running webserver
39+
40+
# initialise the SqlLite database,
41+
# below command will pick changes from airflow.cfg
42+
airflow initdb
43+
44+
Open two new terminals. One to start the web server (you can set the port as well) and the other for a scheduler
45+
46+
47+
# Run the webserver on the custom port you can specify
48+
# MAKE SURE THIS PORT IS SPECIFIED IN YOUR SECURITY GROUP FOR INBOUND TRAFFIC. READ BELOW ARTICLE FOR MORE DETAILS.
49+
50+
airflow webserver --port=<your port number>
51+
52+
# RUN THE SCHEDULER
53+
airflow scheduler
54+
55+
56+
[Authorizing Access To An Instance](https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/authorizing-access-to-an-instance.html)
57+
58+
#### Once the scheduler is running you can access airflow UI using your brower.
59+
To see the Airflow webserver, open any browser and type in the
60+
61+
<EC2-public-dns-name>:<your-port-number>
62+
63+
64+
REFERENCES:
65+
66+
[Build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy](https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/)
67+
68+
[Airflow Installation Steps](https://airflow.apache.org/docs/apache-airflow/stable/installation.html)

README.md

+6-1
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,9 @@ Link: [Data Warehouse](https://github.com/AuFeld/Data_Engineering_Projects/tree/
2727
## Project 4: Data Lake
2828
In this project, I will build a Data Lake on AWS using Spark and AWS EMR cluster. The data lake will be the single source for the analytics platform. Utilizing spark jobs to perform ELT operations that picks data from the S3 landing zone, then transforming and storing the data to the S3 processed zone.
2929

30-
Link: [Data Lake](https://github.com/AuFeld/Data_Engineering_Projects/tree/main/Data_Lake)
30+
Link: [Data Lake](https://github.com/AuFeld/Data_Engineering_Projects/tree/main/Data_Lake)
31+
32+
## Project 5: Data Pipelines with Airflow
33+
For this project, a Data Pipeline workflow was created with Apache Airflow. I will schedule ETL jobs to create project related custom plugins and operators to automate the pipeline execution.
34+
35+
Link: **Coming Soon**

images/airflow_livy.png

51.4 KB
Loading

images/aws_big_data_pipeline.png

393 KB
Loading

images/connections.png

84.9 KB
Loading

images/data-pipeline.png

83.7 KB
Loading

images/validate-on-redshift.png

86.7 KB
Loading

images/variables.png

52.4 KB
Loading

requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ cassandra-driver
66
boto3
77
pyspark
88
pyspark[sql]
9+
apache-airflow

0 commit comments

Comments
 (0)