“MUST KNOW” from Python-Pandas for Data Science

“MUST KNOW” from Python-Pandas for Data Science.

Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis.

1.Import Pandas to Python

import pandas as pd

2. Import data from CSV/Excel file

df=pd.read_csv('C:/Folder/mlhype.csv')   #imports whole csv to pd dataframe
df = pd.read_csv('C:/Folder/mlhype.csv', usecols=['abv', 'ibu'])  #imports selected columns
df = pd.read_excel('C:/Folder/mlhype.xlsx')  #imports excel file

3. Save data to CSV/Excel

df.to_csv('C:/Folder/mlhype.csv') #saves data frame to csv
df.to_excel('C:/Folder/mlhype.xlsx') #saves data frame to excel

4. Exploring data

df.head(5) #returns top 5 rows of data
df.tail(5) #returns bottom 5 rows of data
df.sample(5) #returns random 5 rows of data
df.shape #returns number of rows and columns
df.info() #returns index,data types, memory information
df.describe() #returns basic statistical summary of columns

5. Basic statistical functions

df.mean() #returns mean of columns
df.corr() #returns correlation table
df.count() #returns count of non-null's in column
df.max() #returns max value in each column
df.min() #returns min value in each column
df.median() #returns median of each colun
df.std() #returns standard deviation of each column

6. Selecting subsets

df['ColumnName'] #returns column 'ColumnName'
df[['ColumnName1','ColumnName2']] #returns multiple columns from the list
df.iloc[2,:] #selection by position - whole second row
df.iloc[:,2] #selection by position - whole second column
df.loc[:10,'ColumnName'] #returns first 11 rows of column
df.ix[2,'ColumnName'] #returns second element of column

7. Data cleansing

df.columns = ['a','b','c','d','e','f','g','h'] #rename column names
df.dropna() #drops all rows that contain missing values
df.fillna(0) #replaces missing values with 0 (or any other passed value)
df.fillna(df.mean()) #replaces missing values with mean(or any other passed function)


df[df['ColumnName'] > 0.08] #returns rows with meeting criterion 
df[(df['ColumnName1']>2004) & (df['ColumnName2']==9)] #returns rows meeting multiple criteria
df.sort_values('ColumnName') #sorts by column in ascending order
df.sort_values('ColumnName',ascending=False) #sort by column in descending order

9. Data frames concatenation

pd.concat([DateFrame1, DataFrame2],axis=0) #concatenate rows vertically
pd.concat([DateFrame1, DataFrame2],axis=1) #concatenate rows horizontally

10.Adding new columns

df['NewColumn'] = 50 #creates new column with value 50 in each row
df['NewColumn3'] = df['NewColumn1']+df['NewColumn2'] #new column with value equal to sum of other columns
del df['NewColumn'] #deletes column

I hope you will find above useful, if you need more information on pandas, I recommend going to Pandas documentation or getting one of these books:

Free eBooks for Python.

Free eBooks to learn Python.

Below a list of 5 eBooks to learn Python for data science, they are all free to use:

  1. A Byte of Python by Swaroop C H
  2. Python for Everybody – Prof.Charles Severance
  3. Python Programming  by Wikibooks

  4. Think Python: How would you Think Like a Computer Scientist by Allen Downey

  5. Dive Into Python 3 by Mark Pilgrim

Hope you will enjoy if for some reason you prefer to have a paper copy, I would recommend these:



What is Hadoop YARN?

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. As its architectural center, YARN enhances a Hadoop compute cluster in the following ways: Multitenancy, Cluster utilization, Scalability and Compatibility. Multi-tenant data processing improves an enterprises’ return on Hadoop investments. YARNs dynamic allocation of cluster resources improves utilization over more static MapReduce rules. YARN’s resource manager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes. Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruptions to the processes that already work.

What is Hadoop Flume?

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or uncompress data, modify data by adding or removing pieces of information, and more. Flume allows you a number of different configurations and topologies, allowing you to choose the right setup for your application. Flume is a distributed system which runs across multiple machines. It can collect large volumes of data from many applications and systems. It includes mechanisms for load balancing and failover, and it can be extended and customized in many ways. Flume is a scalable, reliable, configurable and extensible system for management the movement of large volumes of data.

What is Apache Kafka?

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. The design is heavily influenced by transaction logs. Apache Kafka was originally developed by LinkedIn and was subsequently open sourced in early 2011. Graduation from the Apache Incubator occurred on 23 October 2012. Due to its widespread integration into enterprise-level infrastructures, monitoring Kafka performance at scale has become an increasingly important issue. Monitoring end-to-end performance requires tracking metrics from brokers, consumer, and producers, in addition to monitoring ZooKeeper, which is used by Kafka for coordination among consumers. There are currently several monitoring platforms to track Kafka performance, both open-source, like LinkedIn’s Burrow, as well as paid, like Datadog. In addition to these platforms, collecting Kafka data can also be performed using tools commonly bundled with Java, including JConsole.

What is Hadoop Zookeeper?

Hadoop Zookeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, etc. Applications can leverage these services to coordinate distributed processing across large clusters. Name services, group services, synchronization services, configuration management, and more, are available in Zookeeper, which means that each of these projects can embed ZooKeeper without having to build synchronization services from scratch into each project. Interaction with ZooKeeper occurs via Java or C interfaces time. Within ZooKeeper, an application can create what is called a znode (a file that persists in memory on the ZooKeeper servers). The znode can be updated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode (in ZooKeeper parlance, a server can be set up to “watch” a specific znode). Using this znode infrastructure, applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.

What is Hadoop Hbase?

Hadoop Hbase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key. HBase allows for many attributes to be grouped together into what are known as column families, such that the elements of a column family are all stored together. This is different from a row-oriented relational database, where all the columns of a given row are stored together. HBase is very flexible and therefore able to adapt to changing application requirements. HBase is built on concepts similar to those of MapReduce and HDFS (NameNode and slave nodes). In HBase a master node manages the cluster and region servers store portions of the tables and perform the work on the data. In the same way HDFS has some enterprise concerns due to the availability of the NameNode, HBase is also sensitive to the loss of its master node.

What is Hadoop Sqoop?

Hadoop Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB. Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores: Import sequential datasets from a mainframe, parallel data transfer, fast data copies, efficient data analysis, load balancing.

What is Hadoop Hive?

Hadoop Hive is a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate. Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. HQL is limited in the commands it understands, but it is still useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, the queries have a very high latency (many minutes). This makes Hive not appropriate for applications that need very fast response times, as required by a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

What is Hadoop Pig?

Hadoop Pig was initially developed at Yahoo to allow people using Hadoop to focus more on analyzing large datasets and spend less time writing mappers and reduce programs. This would allow people to do what they want to do instead of thinking about mapper and reducer tasks. Name Pig was given to the programming language with a hint on it being designed to handle any kind of data, which has a resemblance to an actual pig, who eat almost anything.
Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a runtime environment where PigLatin programs are executed. The program written in Pig can be split into three stages: LOAD, Transformations, and DUMP. First, you load the data you want to manipulate from HDFS. Then you run the data through a set of transformations (which subsequently are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.