Let’s do it something with CSV ;) Let’s talk today how to deal with CSV. and pandas will help us.

import pandas as pd

There is good package for pandas - swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.

# !pip install swifter

I’ve already installed package, that’s why code was commented above.

import swifter

Also we can treat with pandas dataframe like sql data. For that we need another one package - pandasql. For installing: !pip install pandasql .

import pandasql as ps
%load_ext sql
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
url = """https://raw.githubusercontent.com/NLPH/NLPH_Resources/master/code/VerbInflector/resources/The%20Verb%20Index.csv"""
col_names = ["name", "binyan", "type"]
df = pd.read_csv(url, error_bad_lines=False,header=None, skipinitialspace=True, names=col_names, dtype={'binyan':'category'})
df.head(2)
name binyan type
0 אבד A 16
1 אבזר C 20

unique values in the each columns

pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
                          columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])
Column_Name Num_Unique
1 binyan 7
2 type 60
0 name 3530

We could replace or map values of a column to values, that are more convenient for us

labels = df['binyan'].astype('category').cat.categories.tolist()
replace_map = {'binyan' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
replace_map
{'binyan': {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}}
replace_map_name = {'binyan': {'A': 'paal', 'B': 'nifal', 'C': 'piel', 'D': 'pual',
                                  'E': 'hitpael', 'F': 'hifil', 'G': 'hufal'}}
df.replace(replace_map_name, inplace=True)
df
name binyan type
0 אבד paal 16
1 אבזר piel 20
2 אבחן piel 25
3 אבטח piel 26
4 אגד paal 14
... ... ... ...
4242 תרגם piel 20
4243 תרם paal 1
4244 תשאל piel 24
4245 תש paal 40
4246 תשש paal 1

4247 rows × 3 columns

q1 = """SELECT * FROM df WHERE binyan='paal' and name LIKE 'א%' """
ps.sqldf(q1, locals()).head(10)
name binyan type
0 אבד paal 16
1 אגד paal 14
2 אגף paal 14
3 אגר paal 14
4 אהב paal 21
5 אהד paal 20
6 אזל paal 15
7 אזר paal 14
8 אחז paal 20
9 אטם paal 14

We could get data also from sql database. For example from postgresql. Sure, we need install drivers’ packages. !pip install psycopg2 , !pip install pyodbc. And let’s connect (it’s just connection string: database_driver://user:password@host/dbname ):

%sql postgresql+psycopg2://postgres:123@localhost/hebrew
'Connected: postgres@hebrew'
result = %sql SELECT * FROM generate_series('2020-01-01 00:00'::timestamp, '2020-02-04 12:00', '10 hours') as date;
df = result.DataFrame()
 * postgresql+psycopg2://postgres:***@localhost/hebrew
83 rows affected.
df.head(5)
date
0 2020-01-01 00:00:00
1 2020-01-01 10:00:00
2 2020-01-01 20:00:00
3 2020-01-02 06:00:00
4 2020-01-02 16:00:00

As we see, it’s really easy to get data to dataframe.

url = """https://raw.githubusercontent.com/NLPH/NLPH_Resources/master/code/VerbInflector/resources/Inflected%20verbs%20Extended.txt"""
col_names = ["binyan", "type", "inflection", "properties", "verb"]
df_infl = pd.read_csv(url, error_bad_lines=False,header=None, skipinitialspace=True, names=col_names, dtype={'binyan':'category'})
df_infl.head(2)
binyan type inflection properties verb
0 A 1 בָּגַדְתִּי PAST+FIRST+MF+SINGULAR+COMPLETE בָּגַד
1 A 1 בָּגַדְתָּ PAST+SECOND+M+SINGULAR+COMPLETE בָּגַד

In the properties column we have data with ‘+’ separator. But we want to split it. Let’s do it:

df_infl[['form','number', 'gender', 'number', 'property']] = df_infl['properties'].str.split('+',expand=True)
df_infl.head(2)
binyan type inflection properties verb form number gender property
0 A 1 בָּגַדְתִּי PAST+FIRST+MF+SINGULAR+COMPLETE בָּגַד PAST SINGULAR MF COMPLETE
1 A 1 בָּגַדְתָּ PAST+SECOND+M+SINGULAR+COMPLETE בָּגַד PAST SINGULAR M COMPLETE

And we can push it to our sql server really easy

import sqlalchemy
database_username = 'username'
database_password = 'password'
database_ip       = 'host'
database_name     = 'dbname'
conn = sqlalchemy.create_engine('postgresql+psycopg2://{0}:{1}@{2}/{3}'.format(database_username, database_password, database_ip, database_name))
df_infl(con=conn, name='tablename', if_exists='append')