### “MUST KNOW” from Python-Pandas for Data Science.

Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis.

1.Import Pandas to Python

import pandas as pd

2. Import data from CSV/Excel file

df=pd.read_csv('C:/Folder/mlhype.csv') #imports whole csv to pd dataframe df = pd.read_csv('C:/Folder/mlhype.csv', usecols=['abv', 'ibu']) #imports selected columns df = pd.read_excel('C:/Folder/mlhype.xlsx') #imports excel file

3. Save data to CSV/Excel

df.to_csv('C:/Folder/mlhype.csv') #saves data frame to csv df.to_excel('C:/Folder/mlhype.xlsx') #saves data frame to excel

4. Exploring data

df.head(5) #returns top 5 rows of data df.tail(5) #returns bottom 5 rows of data df.sample(5) #returns random 5 rows of data df.shape #returns number of rows and columns df.info() #returns index,data types, memory information df.describe() #returns basic statistical summary of columns

5. Basic statistical functions

df.mean() #returns mean of columns df.corr() #returns correlation table df.count() #returns count of non-null's in column df.max() #returns max value in each column df.min() #returns min value in each column df.median() #returns median of each colun df.std() #returns standard deviation of each column

6. Selecting subsets

df['ColumnName'] #returns column 'ColumnName' df[['ColumnName1','ColumnName2']] #returns multiple columns from the list df.iloc[2,:] #selection by position - whole second row df.iloc[:,2] #selection by position - whole second column df.loc[:10,'ColumnName'] #returns first 11 rows of column df.ix[2,'ColumnName'] #returns second element of column

7. Data cleansing

df.columns = ['a','b','c','d','e','f','g','h'] #rename column names df.dropna() #drops all rows that contain missing values df.fillna(0) #replaces missing values with 0 (or any other passed value) df.fillna(df.mean()) #replaces missing values with mean(or any other passed function)

8.Filtering/sorting

df[df['ColumnName'] > 0.08] #returns rows with meeting criterion df[(df['ColumnName1']>2004) & (df['ColumnName2']==9)] #returns rows meeting multiple criteria df.sort_values('ColumnName') #sorts by column in ascending order df.sort_values('ColumnName',ascending=False) #sort by column in descending order

9. Data frames concatenation

pd.concat([DateFrame1, DataFrame2],axis=0) #concatenate rows vertically pd.concat([DateFrame1, DataFrame2],axis=1) #concatenate rows horizontally

10.Adding new columns

df['NewColumn'] = 50 #creates new column with value 50 in each row df['NewColumn3'] = df['NewColumn1']+df['NewColumn2'] #new column with value equal to sum of other columns del df['NewColumn'] #deletes column

I hope you will find above useful, if you need more information on pandas, I recommend going to Pandas documentation or getting one of these books: