How to deal with csv, import to sql and so on
Let’s do it something with CSV ;) Let’s talk today how to deal with CSV. and pandas will help us.
import pandas as pd
There is good package for pandas - swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.
# !pip install swifter
I’ve already installed package, that’s why code was commented above.
import swifter
Also we can treat with pandas dataframe like sql data. For that we need another one package - pandasql. For installing: !pip install pandasql .
import pandasql as ps
%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
url = """https://raw.githubusercontent.com/NLPH/NLPH_Resources/master/code/VerbInflector/resources/The%20Verb%20Index.csv"""
col_names = ["name", "binyan", "type"]
df = pd.read_csv(url, error_bad_lines=False,header=None, skipinitialspace=True, names=col_names, dtype={'binyan':'category'})
df.head(2)
name | binyan | type | |
---|---|---|---|
0 | אבד | A | 16 |
1 | אבזר | C | 20 |
unique values in the each columns
pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])
Column_Name | Num_Unique | |
---|---|---|
1 | binyan | 7 |
2 | type | 60 |
0 | name | 3530 |
We could replace or map values of a column to values, that are more convenient for us
labels = df['binyan'].astype('category').cat.categories.tolist()
replace_map = {'binyan' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
replace_map
{'binyan': {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}}
replace_map_name = {'binyan': {'A': 'paal', 'B': 'nifal', 'C': 'piel', 'D': 'pual',
'E': 'hitpael', 'F': 'hifil', 'G': 'hufal'}}
df.replace(replace_map_name, inplace=True)
df
name | binyan | type | |
---|---|---|---|
0 | אבד | paal | 16 |
1 | אבזר | piel | 20 |
2 | אבחן | piel | 25 |
3 | אבטח | piel | 26 |
4 | אגד | paal | 14 |
... | ... | ... | ... |
4242 | תרגם | piel | 20 |
4243 | תרם | paal | 1 |
4244 | תשאל | piel | 24 |
4245 | תש | paal | 40 |
4246 | תשש | paal | 1 |
4247 rows × 3 columns
q1 = """SELECT * FROM df WHERE binyan='paal' and name LIKE 'א%' """
ps.sqldf(q1, locals()).head(10)
name | binyan | type | |
---|---|---|---|
0 | אבד | paal | 16 |
1 | אגד | paal | 14 |
2 | אגף | paal | 14 |
3 | אגר | paal | 14 |
4 | אהב | paal | 21 |
5 | אהד | paal | 20 |
6 | אזל | paal | 15 |
7 | אזר | paal | 14 |
8 | אחז | paal | 20 |
9 | אטם | paal | 14 |
We could get data also from sql database. For example from postgresql. Sure, we need install drivers’ packages. !pip install psycopg2 , !pip install pyodbc. And let’s connect (it’s just connection string: database_driver://user:password@host/dbname ):
%sql postgresql+psycopg2://postgres:123@localhost/hebrew
'Connected: postgres@hebrew'
result = %sql SELECT * FROM generate_series('2020-01-01 00:00'::timestamp, '2020-02-04 12:00', '10 hours') as date;
df = result.DataFrame()
* postgresql+psycopg2://postgres:***@localhost/hebrew
83 rows affected.
df.head(5)
date | |
---|---|
0 | 2020-01-01 00:00:00 |
1 | 2020-01-01 10:00:00 |
2 | 2020-01-01 20:00:00 |
3 | 2020-01-02 06:00:00 |
4 | 2020-01-02 16:00:00 |
As we see, it’s really easy to get data to dataframe.
url = """https://raw.githubusercontent.com/NLPH/NLPH_Resources/master/code/VerbInflector/resources/Inflected%20verbs%20Extended.txt"""
col_names = ["binyan", "type", "inflection", "properties", "verb"]
df_infl = pd.read_csv(url, error_bad_lines=False,header=None, skipinitialspace=True, names=col_names, dtype={'binyan':'category'})
df_infl.head(2)
binyan | type | inflection | properties | verb | |
---|---|---|---|---|---|
0 | A | 1 | בָּגַדְתִּי | PAST+FIRST+MF+SINGULAR+COMPLETE | בָּגַד |
1 | A | 1 | בָּגַדְתָּ | PAST+SECOND+M+SINGULAR+COMPLETE | בָּגַד |
In the properties column we have data with ‘+’ separator. But we want to split it. Let’s do it:
df_infl[['form','number', 'gender', 'number', 'property']] = df_infl['properties'].str.split('+',expand=True)
df_infl.head(2)
binyan | type | inflection | properties | verb | form | number | gender | property | |
---|---|---|---|---|---|---|---|---|---|
0 | A | 1 | בָּגַדְתִּי | PAST+FIRST+MF+SINGULAR+COMPLETE | בָּגַד | PAST | SINGULAR | MF | COMPLETE |
1 | A | 1 | בָּגַדְתָּ | PAST+SECOND+M+SINGULAR+COMPLETE | בָּגַד | PAST | SINGULAR | M | COMPLETE |
And we can push it to our sql server really easy
import sqlalchemy
database_username = 'username'
database_password = 'password'
database_ip = 'host'
database_name = 'dbname'
conn = sqlalchemy.create_engine('postgresql+psycopg2://{0}:{1}@{2}/{3}'.format(database_username, database_password, database_ip, database_name))
df_infl(con=conn, name='tablename', if_exists='append')