Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DE최우형 - W5M2 #273

Open
wants to merge 51 commits into
base: DE최우형_W5
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
4d37dbb
#[W2M2] : init
dn7638 Jul 8, 2024
b170433
W1 : squash commit
dn7638 Jul 10, 2024
0669677
W2 : squash commit
dn7638 Jul 10, 2024
68b9b2f
W2 : squash commit
dn7638 Jul 10, 2024
76bb203
W1 : refactor for building docker image
dn7638 Jul 10, 2024
8e4f139
W1 : refactor for making docker image
dn7638 Jul 10, 2024
615ec68
W1 : refactor for making docker image
dn7638 Jul 10, 2024
a915ed8
W1 : add docker script
dn7638 Jul 11, 2024
59fc0f2
W2 : add files for build docker image
dn7638 Jul 11, 2024
3277a79
W2 : try to make correct Dockerfile
dn7638 Jul 11, 2024
0dbf58d
W2 : Dockerfile for amazone linux 2
dn7638 Jul 11, 2024
492d952
resolve merge conflict M1<-M2
dn7638 Jul 11, 2024
9a534c2
W1M3
dn7638 Jul 11, 2024
9c4953c
W2 : update W2 README.md
dn7638 Jul 12, 2024
79d1ef2
W1M3 : apply review
dn7638 Jul 14, 2024
f659b05
W1M3 : apply review
dn7638 Jul 14, 2024
3408ac7
W2M1_4 : add explanation
dn7638 Jul 14, 2024
53846c8
Merge branch 'W1M3' into W2/main
dn7638 Jul 14, 2024
07fde9a
W2 : add explation
dn7638 Jul 14, 2024
0c3ff6d
W3_main : init for W3 missions
dn7638 Jul 15, 2024
335f174
W3M2 : init
dn7638 Jul 17, 2024
30da9b2
W3M2 : add Dockerfile.datanode
dn7638 Jul 17, 2024
dea5272
W3M2 : add Dockerfile.namenode
dn7638 Jul 17, 2024
9f9684f
W3M2 : add hadoop config files
dn7638 Jul 17, 2024
e38f253
W3M2 : add docker-compose and shell script for datanode
dn7638 Jul 17, 2024
c1727cf
W3M2 : add config file
dn7638 Jul 18, 2024
ab990cc
W3M2 : add start_script for hadoop services
dn7638 Jul 18, 2024
bbdeca2
W3M2 : add Dockerfiles
dn7638 Jul 18, 2024
bd94179
W3M2 : add script for scp script
dn7638 Jul 18, 2024
9064161
W3M2 : add script for building docker image and runging compose
dn7638 Jul 18, 2024
900a66a
W3M2 : feat modify, verification, mapreduce script
dn7638 Jul 20, 2024
b396fe3
W3M2 : update readme.md
dn7638 Jul 21, 2024
1f036dc
W3M2 : update readme.md
dn7638 Jul 21, 2024
e2c2f87
main : update gitignore
dn7638 Jul 21, 2024
33a9307
W4MAIN : init
dn7638 Jul 23, 2024
05e728c
W4MAIN : init
dn7638 Jul 23, 2024
c19db8a
W4M1 : init
dn7638 Jul 23, 2024
c1ba33c
W4M1 : add spark test script
dn7638 Jul 23, 2024
ff3f70c
W4M1 : add README.md
dn7638 Jul 23, 2024
c772ad6
W4M1 : trivial change
dn7638 Jul 29, 2024
8e89e55
W5M1 : init
dn7638 Jul 29, 2024
ad0c61c
Merge branch 'W3M2' into W5M1
dn7638 Aug 1, 2024
eed9d4c
Merge branch 'W2/main' into W5M1
dn7638 Aug 11, 2024
4503783
W5M1 : delete dir
dn7638 Aug 19, 2024
102b00f
W5M2 : Week 5 mission 2
dn7638 Aug 19, 2024
270d976
add nginx.conf
Sep 3, 2024
67e50e1
update nginx.conf
Sep 3, 2024
424dc09
update docker compose
Sep 3, 2024
572f5c9
update yarn hostname
Sep 3, 2024
bde3afa
add yarn cluster manager
Sep 3, 2024
dbf2736
df, rdd test
Sep 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,19 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

.DS_Store
._.DS_Store
**/.DS_Store
**/._.DS_Store

.viscid
vscode
.vscode/
vscode/


.untracked
._untracked
**/untracked
**/._untracked
23 changes: 23 additions & 0 deletions missions/W1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## 1. mtcars

# 학습 키워드
* jupyter
* matplotlib
* pandas

# 학습자료
* mtcars.csv
* 추가 정보
* 난이도


# 학습 목표
[
- mtcars 데이터셋을 분석하고 그래프를 그리는 방법을 배웁니다.
- Jupyter notebook의 기본 사용법을 배웁니다.
- pandas의 DataFrame의 기본 사용법을 배웁니다.
- matplotlib로 그래프를 그리는 방법을 배웁니다.
]

# 학습 내용
...
2 changes: 2 additions & 0 deletions missions/W1/etl/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# .dockerignore
.venv
1 change: 1 addition & 0 deletions missions/W1/etl/Countries_by_GDP.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"0":"World","1":"109,529,216"},{"0":"United States","1":"28,781,083"},{"0":"China","1":"18,532,633"},{"0":"Germany","1":"4,591,100"},{"0":"Japan","1":"4,110,452"},{"0":"India","1":"3,937,011"},{"0":"United Kingdom","1":"3,495,261"},{"0":"France","1":"3,130,014"},{"0":"Brazil","1":"2,331,391"},{"0":"Italy","1":"2,328,028"},{"0":"Canada","1":"2,242,182"},{"0":"Russia","1":"2,056,844"},{"0":"Mexico","1":"2,017,025"},{"0":"Australia","1":"1,790,348"},{"0":"South Korea","1":"1,760,947"},{"0":"Spain","1":"1,647,114"},{"0":"Indonesia","1":"1,475,690"},{"0":"Netherlands","1":"1,142,513"},{"0":"Turkey","1":"1,113,561"},{"0":"Saudi Arabia","1":"1,106,015"},{"0":"Switzerland","1":"938,458"},{"0":"Poland","1":"844,623"},{"0":"Taiwan","1":"802,958"},{"0":"Belgium","1":"655,192"},{"0":"Sweden","1":"623,048"},{"0":"Argentina","1":"604,260"},{"0":"Ireland","1":"564,020"},{"0":"Thailand","1":"548,890"},{"0":"Austria","1":"540,887"},{"0":"Israel","1":"530,664"},{"0":"United Arab Emirates","1":"527,796"},{"0":"Norway","1":"526,951"},{"0":"Singapore","1":"525,228"},{"0":"Philippines","1":"471,516"},{"0":"Vietnam","1":"465,814"},{"0":"Iran","1":"464,181"},{"0":"Bangladesh","1":"455,162"},{"0":"Malaysia","1":"445,519"},{"0":"Denmark","1":"409,989"},{"0":"Hong Kong","1":"406,775"},{"0":"Colombia","1":"386,076"},{"0":"South Africa","1":"373,233"},{"0":"Romania","1":"369,971"},{"0":"Egypt","1":"347,594"},{"0":"Pakistan","1":"338,237"},{"0":"Chile","1":"333,760"},{"0":"Czech Republic","1":"325,880"},{"0":"Finland","1":"308,055"},{"0":"Portugal","1":"298,949"},{"0":"Kazakhstan","1":"296,740"},{"0":"Peru","1":"282,458"},{"0":"Algeria","1":"266,780"},{"0":"Iraq","1":"265,894"},{"0":"New Zealand","1":"257,625"},{"0":"Nigeria","1":"252,738"},{"0":"Greece","1":"250,276"},{"0":"Qatar","1":"244,686"},{"0":"Hungary","1":"223,413"},{"0":"Ethiopia","1":"205,130"},{"0":"Ukraine","1":"188,943"},{"0":"Kuwait","1":"160,397"},{"0":"Morocco","1":"152,377"},{"0":"Cuba","1":"-1"},{"0":"Slovakia","1":"140,808"},{"0":"Dominican Republic","1":"127,356"},{"0":"Ecuador","1":"121,592"},{"0":"Puerto Rico","1":"117,763"},{"0":"Guatemala","1":"110,035"},{"0":"Oman","1":"108,927"},{"0":"Bulgaria","1":"107,933"},{"0":"Kenya","1":"104,001"},{"0":"Venezuela","1":"102,328"},{"0":"Uzbekistan","1":"97,956"},{"0":"Costa Rica","1":"96,058"},{"0":"Angola","1":"92,123"},{"0":"Luxembourg","1":"88,556"},{"0":"Croatia","1":"88,076"},{"0":"Panama","1":"87,347"},{"0":"Ivory Coast","1":"86,911"},{"0":"Uruguay","1":"82,605"},{"0":"Turkmenistan","1":"81,896"},{"0":"Serbia","1":"81,873"},{"0":"Lithuania","1":"81,170"},{"0":"Tanzania","1":"79,605"},{"0":"Azerbaijan","1":"78,749"},{"0":"Ghana","1":"75,244"},{"0":"Sri Lanka","1":"74,846"},{"0":"DR Congo","1":"73,761"},{"0":"Slovenia","1":"72,101"},{"0":"Belarus","1":"69,048"},{"0":"Myanmar","1":"68,006"},{"0":"Uganda","1":"56,310"},{"0":"Tunisia","1":"54,708"},{"0":"Macau","1":"54,677"},{"0":"Jordan","1":"53,570"},{"0":"Cameroon","1":"53,205"},{"0":"Bolivia","1":"49,334"},{"0":"Libya","1":"48,221"},{"0":"Bahrain","1":"46,790"},{"0":"Paraguay","1":"45,817"},{"0":"Latvia","1":"45,466"},{"0":"Cambodia","1":"45,150"},{"0":"Nepal","1":"44,179"},{"0":"Estonia","1":"43,486"},{"0":"Honduras","1":"37,355"},{"0":"Senegal","1":"35,450"},{"0":"El Salvador","1":"35,333"},{"0":"Zimbabwe","1":"34,405"},{"0":"Cyprus","1":"34,221"},{"0":"Iceland","1":"33,338"},{"0":"Georgia","1":"32,865"},{"0":"Papua New Guinea","1":"31,716"},{"0":"Zambia","1":"29,872"},{"0":"Bosnia and Herzegovina","1":"29,078"},{"0":"Trinidad and Tobago","1":"28,365"},{"0":"Sudan","1":"26,865"},{"0":"Guinea","1":"25,447"},{"0":"Albania","1":"25,431"},{"0":"Armenia","1":"25,408"},{"0":"Haiti","1":"24,046"},{"0":"Mozambique","1":"22,975"},{"0":"Malta","1":"22,737"},{"0":"Mongolia","1":"21,943"},{"0":"Burkina Faso","1":"21,902"},{"0":"Lebanon","1":"21,780"},{"0":"Mali","1":"21,662"},{"0":"Botswana","1":"21,418"},{"0":"Benin","1":"21,371"},{"0":"Guyana","1":"21,178"},{"0":"Gabon","1":"21,013"},{"0":"Jamaica","1":"20,098"},{"0":"Nicaragua","1":"18,829"},{"0":"Niger","1":"18,816"},{"0":"Chad","1":"18,697"},{"0":"Palestine","1":"18,602"},{"0":"Moldova","1":"18,356"},{"0":"Yemen","1":"16,940"},{"0":"Madagascar","1":"16,465"},{"0":"Mauritius","1":"16,359"},{"0":"North Macedonia","1":"15,873"},{"0":"Brunei","1":"15,510"},{"0":"Congo","1":"15,501"},{"0":"Laos","1":"15,190"},{"0":"North Korea","1":"-1"},{"0":"Afghanistan","1":"14,467"},{"0":"Bahamas","1":"14,390"},{"0":"Rwanda","1":"13,701"},{"0":"Kyrgyzstan","1":"13,599"},{"0":"Tajikistan","1":"12,953"},{"0":"Somalia","1":"12,804"},{"0":"Namibia","1":"12,765"},{"0":"Kosovo","1":"11,318"},{"0":"Malawi","1":"11,241"},{"0":"Equatorial Guinea","1":"10,708"},{"0":"Mauritania","1":"10,628"},{"0":"Togo","1":"9,832"},{"0":"New Caledonia","1":"-1"},{"0":"Syria","1":"-1"},{"0":"Monaco","1":"-1"},{"0":"Montenegro","1":"8,010"},{"0":"Liechtenstein","1":"-1"},{"0":"Bermuda","1":"-1"},{"0":"Maldives","1":"7,199"},{"0":"Barbados","1":"6,863"},{"0":"Cayman Islands","1":"-1"},{"0":"South Sudan","1":"6,517"},{"0":"French Polynesia","1":"-1"},{"0":"Fiji","1":"5,801"},{"0":"Eswatini","1":"5,085"},{"0":"Liberia","1":"4,754"},{"0":"Sierra Leone","1":"4,558"},{"0":"Djibouti","1":"4,364"},{"0":"Suriname","1":"4,337"},{"0":"Aruba","1":"4,069"},{"0":"Andorra","1":"3,897"},{"0":"Belize","1":"3,296"},{"0":"Greenland","1":"-1"},{"0":"Bhutan","1":"3,110"},{"0":"Cura\u00e7ao","1":"-1"},{"0":"Burundi","1":"3,075"},{"0":"Central African Republic","1":"2,810"},{"0":"Cape Verde","1":"2,718"},{"0":"Gambia","1":"2,694"},{"0":"Saint Lucia","1":"2,582"},{"0":"Lesotho","1":"2,395"},{"0":"Eritrea","1":"-1"},{"0":"Zanzibar","1":"-1"},{"0":"Seychelles","1":"2,203"},{"0":"Guinea-Bissau","1":"2,151"},{"0":"Antigua and Barbuda","1":"2,127"},{"0":"San Marino","1":"2,033"},{"0":"East Timor","1":"1,992"},{"0":"Solomon Islands","1":"1,707"},{"0":"Sint Maarten","1":"-1"},{"0":"Comoros","1":"1,422"},{"0":"Grenada","1":"1,406"},{"0":"Vanuatu","1":"1,289"},{"0":"Turks and Caicos Islands","1":"-1"},{"0":"Saint Kitts and Nevis","1":"1,134"},{"0":"Saint Vincent and the Grenadines","1":"1,128"},{"0":"Samoa","1":"1,024"},{"0":"S\u00e3o Tom\u00e9 and Pr\u00edncipe","1":"751"},{"0":"Dominica","1":"708"},{"0":"Tonga","1":"581"},{"0":"Micronesia","1":"484"},{"0":"Kiribati","1":"311"},{"0":"Palau","1":"308"},{"0":"Marshall Islands","1":"305"},{"0":"Nauru","1":"161"},{"0":"Tuvalu","1":"66"}]
14 changes: 14 additions & 0 deletions missions/W1/etl/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# 베이스 이미지로 Python 3.9 사용
FROM python:3.9

# 작업 디렉토리 설정
WORKDIR /etl

# 현재 디렉토리의 파일들을 컨테이너의 작업 디렉토리로 복사
COPY . /etl

# 필요한 패키지 설치
RUN pip install --no-cache-dir -r requirements.txt

# Python 스크립트 실행
CMD ["python", "etl_project_gdp_with_sql.py"]
14 changes: 14 additions & 0 deletions missions/W1/etl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# ETL learning

# Data flow
* Wiki -- extract to table data --> json | scraping
* json -- load as pandas obj(ram) --> print on console

# Advanced data flow
* Wiki -- extract to table data --> json | scraping,
* json -- transform to DB --> db
* db -- load as pandas obj(ram) --> print on console

* json file does a role as a temporary stored file.
* can refer 'The Solution' process in Week 1 script

Binary file added missions/W1/etl/World_Economies.db
Binary file not shown.
203 changes: 203 additions & 0 deletions missions/W1/etl/etl_project_gdp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
from bs4 import BeautifulSoup
import pandas as pd
import requests
import datetime


"""
위키에서 국가별 GDP를 스크롤하여 리스트로 반환하는 함수
[[국가, GDP], [국가, GDP], ...] 형태
첫번째 행은 [모든 국가, GDP 총합]
"""
def scroll_wiki() -> list:
# 데이터 저장을 위한 리스트 초기화
table_data = []
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
response = requests.get(url)

# 요청이 성공했는지 확인
if response.status_code == 200:

# HTML 파싱
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find_all('table')
tbody = table[2].find('tbody')
rows = tbody.find_all('tr')

# 각 행의 데이터를 리스트 형태로 추출
for row in rows:
cells = row.find_all('td')[:2]
if not cells:
continue
cell_values = [cell.get_text(strip=True) for cell in cells]
if cell_values[1] == '\u2014':
cell_values[1] = '-1'
table_data.append(cell_values)

# 요청 실패시 메시지 출력
else:
print("Failed to retrieve the web page")

return table_data


"""
list 자료를 입력받아 json 파일로 내보내는 함수
"""
def list_to_json(_list) -> None:
df = pd.DataFrame(_list)
json_data = df.to_json(orient='records')

with open('Countries_by_GDP.json', 'w') as f:
f.write(json_data)


def open_json() -> pd.DataFrame:
json_path = './Countries_by_GDP.json'
df = pd.read_json(json_path)

return df


"""
데이터프레임으로서 load한 내용을 요구사항에 맞게 정리하여 출력하는 함수
GDP가 100B USD를 초고화는 국가들을 출력
"""
def analyze_1(df):
# 컬럼 이름 수정
df.rename(columns={0: 'Nation', 1: 'GDP'}, inplace=True)

# GDP가 100 이상인 행들을 추출
filtered_df = df[df['GDP'].str.len()>=7][['Nation','GDP']]
filtered_df = filtered_df.drop(0).reset_index(drop=True)

# 분석을 위해 문자열 형태를 float형태로 변환
filtered_df['GDP'] = filtered_df['GDP'].str.replace(',', '').astype(float)
filtered_df['GDP'] = (filtered_df['GDP'] / 1000).round(2)

# 분석 내용 출력
print('\n[Nations with GDP exceeding 100B USD (Unit:Billion $)]')
print(filtered_df)



"""
데이터프레임으로서 load한 내용을 요구사항에 맞게 정리하여 출력하는 함수
대륙별 상위 5개의 GDP 평균을 출력
"""
def analyze_2(df):
# 미리 준비된 국가별 대륙 정보 "region.csv"
df_nation_conti = pd.read_csv('region.csv')

# 컬럼 이름 수정
df.rename(columns={0: 'Nation', 1: 'GDP'}, inplace=True)

# 0번째 row 제거
df = df.drop(0).reset_index(drop=True)

# 분석을 위해 문자열 형태를 float형태로 변환
df['GDP'] = df['GDP'].str.replace(',', '').astype(float)
df['GDP'] = (df['GDP'] / 1000).round(2)

# 'GDP' 열의 값이 0 인 행을 제외
df = df[df['GDP'] > 0]

# 분석을 위해 df와 df_nation_conti를 left join
df_joined = pd.merge(df, df_nation_conti, on='Nation', how='left')

# df_joined에서 대륙 정보가 없는 국가 필터링
df_joined['Continent'] = df_joined['Continent'].fillna('Unknown')
df_joined = df_joined[df_joined['Continent'] != 'Unknown']

# 대륙별 상위 5개 국가 GDP 평균 추출
top_5_gdp_per_continent = df_joined.sort_values(['Continent', 'GDP'], ascending=[True, False]).groupby('Continent').head(5)
average_gdp_per_conitinent = top_5_gdp_per_continent.groupby('Continent')['GDP'].mean().round(2)

# 대륙별 상위 5개 국가 GDP 평균 출력
print('\n[Top 5 GDP averages in each region (Unit:Billion $)]')
print(average_gdp_per_conitinent)


"""
* Deprecated *
국가,대륙 정보가 담긴 파일을 이용하여 딕셔너리 자료를 구성
중복되지 않는 대륙 정보를 사용하여 딕셔너리 자료를 구성
{키 = 대륙 : 값 = GDP 값 리스트}
"""
def trans_region_data() -> tuple[dict, dict]:
# 텍스트 파일 경로
txt_file = 'region.csv'

# 빈 dictionary 생성
nation_continent_dict = dict()
continent_set = set()

# 텍스트 파일 읽기
with open(txt_file, mode='r', encoding='utf-8') as file:
for line in file:
line = line.strip() # 줄 바꿈 문자 제거
if line:
nation, continent = line.split(',')
nation_continent_dict[nation] = continent
continent_set.add(continent)

continent_GDP_dict = {item : [] for item in continent_set}

return nation_continent_dict, continent_GDP_dict


"""
datatime library를 통해 현재 시각을 적절한 포맷의 문자열로 반환하는 함수
"""
def get_cur_time():
# 현재 시각을 얻기
now = datetime.datetime.now()
# 원하는 포맷으로 변환
formatted_now = now.strftime('%Y-%B-%d-%H-%M-%S')
return formatted_now

"""
현재 시각과 입력받은 메시지를 로그 파일에 기록하는 함수
"""
def log(process : str) -> None:
filename = 'etl_project_log.txt'
cur_time = get_cur_time()
log_string = ','.join([cur_time, process])

with open(filename, 'a+') as file:
# 파일에 쓸 콘텐츠 추가
file.write(log_string + '\n')


################################
## wiki -- scraping --> jason ##
################################

#E : start extract
log('E : start extract')
gdp_list = scroll_wiki()

#E : end extract
log('E : end extract')

#T : start transform (list -> json)
log('T : start transform (list -> json)')
list_to_json(gdp_list)

#T : end transform (list -> json)
log('T : end transform (list -> json)')

######################################
## json -- open, dataframe -> print ##
######################################

#L : start load
log('L : start load')
nation_gdp_df = open_json()

#L : end load
log('L : end load')

# analze : there are two methods of analzation
analyze_1(nation_gdp_df)
analyze_2(nation_gdp_df)
Loading