-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathmain_prepdata.py
81 lines (61 loc) · 2.26 KB
/
main_prepdata.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
'''
So you can either do it two ways
One is to run each of the files
individually, e.g.
cd ./algaebloom
python get_dem.py
python get_lag.py
python get_sat.py
These take along time, and dem/sat can
fail periodically
They are not idempotent, so incrementally save
the values to a SQLite database (in the data folder)
The easier way though to replicate the SQLite database
though is to run this script, it presumes you have the files
algaebloom/data
- metadata.csv # from competition
- train_labels.csv # from competition
- submission_format.csv # from competition
- elevation_dem.csv # generated via get_dem.py
- spat_lag.csv # generated via get_lag.py
- sat.csv # generated via get_sat.py
- split_pred.csv # generated via get_split.py
The final solution does not use the lag values, but have
included to replicate my prior tests. Get split needs to be
run AFTER the elevation stats are prepped, but otherwise
the order does not matter
any questions? Feel free to email me,
Andy Wheeler
'''
import pandas as pd
import sqlite3
db = './data/data.sqlite'
tab_names = ['elevation_dem','split_pred','spat_lag','sat']
meta_names = ['meta','labels', 'format']
meta_csv = ['metadata','train_labels','submission_format']
fd = {c:t for c,t in zip(meta_csv,meta_names)}
ft = {t:t for t in tab_names}
fd.update(ft)
# Function to save out csv files
def save_csv(db=db):
db_con = sqlite3.connect(db)
for t in tab_names:
res = pd.read_sql(f'SELECT * FROM {t}',db_con)
res.to_csv(f'./data/{t}.csv',index=False)
# Function to prep SQLite DB
def prep_sql(db=db):
db_con = sqlite3.connect(db)
for csv,tab_name in fd.items():
res = pd.read_csv(f'./data/{csv}.csv')
res.to_sql(tab_name,index=False,if_exists='replace',con=db_con)
# Showing resulting table names
query = "SELECT name FROM sqlite_master WHERE type='table';"
rt = pd.read_sql(query,db_con)
print('Resulting table names in sqlite')
print(rt['name'].tolist())
if __name__ == "__main__":
print('Executing script to prep data in sqlite')
print('This expects the csv files listed at front')
print('of script to be in the data folder')
prep_sql()