How to deal with csv, import to sql and so on

Let’s do it something with CSV ;) Let’s talk today how to deal with CSV. and pandas will help us.

import pandas as pd

There is good package for pandas - swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.

# !pip install swifter

I’ve already installed package, that’s why code was commented above.

import swifter

Also we can treat with pandas dataframe like sql data. For that we need another one package - pandasql. For installing: !pip install pandasql .

import pandasql as ps
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

url = """https://raw.githubusercontent.com/NLPH/NLPH_Resources/master/code/VerbInflector/resources/The%20Verb%20Index.csv"""

col_names = ["name", "binyan", "type"]
df = pd.read_csv(url, error_bad_lines=False,header=None, skipinitialspace=True, names=col_names, dtype={'binyan':'category'})
df.head(2)

	name	binyan	type
0	אבד	A	16
1	אבזר	C	20

unique values in the each columns

pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
                          columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])

	Column_Name	Num_Unique
1	binyan	7
2	type	60
0	name	3530

We could replace or map values of a column to values, that are more convenient for us

labels = df['binyan'].astype('category').cat.categories.tolist()
replace_map = {'binyan' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
replace_map

{'binyan': {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}}

replace_map_name = {'binyan': {'A': 'paal', 'B': 'nifal', 'C': 'piel', 'D': 'pual',
                                  'E': 'hitpael', 'F': 'hifil', 'G': 'hufal'}}

df.replace(replace_map_name, inplace=True)
df

	name	binyan	type
0	אבד	paal	16
1	אבזר	piel	20
2	אבחן	piel	25
3	אבטח	piel	26
4	אגד	paal	14
...	...	...	...
4242	תרגם	piel	20
4243	תרם	paal	1
4244	תשאל	piel	24
4245	תש	paal	40
4246	תשש	paal	1

4247 rows × 3 columns

q1 = """SELECT * FROM df WHERE binyan='paal' and name LIKE 'א%' """
ps.sqldf(q1, locals()).head(10)

	name	binyan	type
0	אבד	paal	16
1	אגד	paal	14
2	אגף	paal	14
3	אגר	paal	14
4	אהב	paal	21
5	אהד	paal	20
6	אזל	paal	15
7	אזר	paal	14
8	אחז	paal	20
9	אטם	paal	14

We could get data also from sql database. For example from postgresql. Sure, we need install drivers’ packages. !pip install psycopg2 , !pip install pyodbc. And let’s connect (it’s just connection string: database_driver://user:password@host/dbname ):

%sql postgresql+psycopg2://postgres:123@localhost/hebrew

'Connected: postgres@hebrew'

result = %sql SELECT * FROM generate_series('2020-01-01 00:00'::timestamp, '2020-02-04 12:00', '10 hours') as date;
df = result.DataFrame()

 * postgresql+psycopg2://postgres:***@localhost/hebrew
83 rows affected.

df.head(5)

	date
0	2020-01-01 00:00:00
1	2020-01-01 10:00:00
2	2020-01-01 20:00:00
3	2020-01-02 06:00:00
4	2020-01-02 16:00:00

As we see, it’s really easy to get data to dataframe.

url = """https://raw.githubusercontent.com/NLPH/NLPH_Resources/master/code/VerbInflector/resources/Inflected%20verbs%20Extended.txt"""

col_names = ["binyan", "type", "inflection", "properties", "verb"]
df_infl = pd.read_csv(url, error_bad_lines=False,header=None, skipinitialspace=True, names=col_names, dtype={'binyan':'category'})
df_infl.head(2)

	binyan	type	inflection	properties	verb
0	A	1	בָּגַדְתִּי	PAST+FIRST+MF+SINGULAR+COMPLETE	בָּגַד
1	A	1	בָּגַדְתָּ	PAST+SECOND+M+SINGULAR+COMPLETE	בָּגַד

In the properties column we have data with ‘+’ separator. But we want to split it. Let’s do it:

df_infl[['form','number', 'gender', 'number', 'property']] = df_infl['properties'].str.split('+',expand=True)

df_infl.head(2)

	binyan	type	inflection	properties	verb	form	number	gender	property
0	A	1	בָּגַדְתִּי	PAST+FIRST+MF+SINGULAR+COMPLETE	בָּגַד	PAST	SINGULAR	MF	COMPLETE
1	A	1	בָּגַדְתָּ	PAST+SECOND+M+SINGULAR+COMPLETE	בָּגַד	PAST	SINGULAR	M	COMPLETE

And we can push it to our sql server really easy

import sqlalchemy
database_username = 'username'
database_password = 'password'
database_ip       = 'host'
database_name     = 'dbname'
conn = sqlalchemy.create_engine('postgresql+psycopg2://{0}:{1}@{2}/{3}'.format(database_username, database_password, database_ip, database_name))
df_infl(con=conn, name='tablename', if_exists='append')