TensorFlow Intro

TensorFlow Intro.

What is TensorFlow?

The shortest definition would be, TensorFlow is a general-purpose library for graph-based computation.

But there is a variety of other ways to define TensorFlow, for example, Rodolfo Bonnin in his book – Building Machine Learning Projects with TensorFlow brings up definition like this:

“TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) passed between them.”

To quote the TensorFlow website, TensorFlow is an “open source software library for numerical computation using data flow graphs”.  Name TensorFlow derives from the operations which neural networks perform on multidimensional data arrays, often referred to as ‘tensors’. It is using data flow graphs and is capable of building and training variety of different machine learning algorithms including deep neural networks, at the same time, it is general enough to be applicable in a wide variety of other domains as well. Flexible architecture allows deploying computation to one or more CPUs or GPU in a desktop, server, or mobile device with a single API.

TensorFlow is Google Brain’s second generation machine learning system, released as open source software in 2015. TensorFlow is available on 64-bit Linux, macOS, and mobile computing platforms including Android and iOS. TensorFlow provides a Python API, as well as C++, Haskell, Java and Go APIs. Google’s machine learning framework became lately ‘hottest’ in data science world, it is particularly useful for building deep learning systems for predictive models involving natural language processing, audio, and images.

 

What is ‘Graph’ or ‘Data Flow Graph’? What is TensorFlow Session?

 

Trying to define what TensorFlow is, it is hard to avoid using word ‘graph’, or ‘data flow graph’, so what is that? The shortest definition would be, TensorFlow Graph is a description of computations. Deep learning (neural networks with many layers) uses mostly very simple mathematical operations – just many of them, on high dimensional data structures(tensors). Neural networks can have thousands or even millions of weights. Computing them, by interpreting every step (Python) would take forever.

That’s why we create a graph made up of defined tensors and mathematical operations and even initial values for variables. Only after we’ve created this ‘recipe’ we can pass it to what TensorFlow calls a session. To compute anything, a graph must be launched in a Session. The session runs the graph using very efficient and optimized code. Not only that, but many of the operations, such as matrix multiplication, are ones that can be parallelised by supported GPU (Graphics Processing Unit) and the session will do that for you. Also, TensorFlow is built to be able to distribute the processing across multiple machines and/or GPUs.

TensorFlow programs are usually divided into a construction phase, that assembles a graph, and an execution phase that uses a session to execute operations in the graph. To do machine learning in TensorFlow, you want to create tensors, adding operations (that output other tensors), and then executing the computation (running the computational graph). In particular, it’s important to realize that when you add an operation on tensors, it doesn’t execute immediately. TensorFlow waits for you to define all the operations you want to perform and then optimizes the computation graph, ‘deciding’ how to execute the computation, before generating the data. Because of this, tensors in TensorFlow are not so much holding the data as a placeholder for holding the data, waiting for the data to arrive when a computation is executed.

 

Prerequisites:

 

NEURAL NETWORKS – basics.

Before we move on to create our first model in TensorFlow, we’ll need to get the basics right, talk a bit about the structure of a simple neural network.

A simple neural network has some input units where the input goes. It also has hidden units, so-called because from a user’s perspective they’re hidden. And there are output units, from which we get the results. Off to the side are also bias units, which are there to help control the values emitted from the hidden and output units. Connecting all of these units are a bunch of weights, which are just numbers, each of which is associated with two units. The way we train neural network is to assign values to all those weights. That’s what training a neural network does, find suitable values for those weights. One step in “running” the neural network is to multiply the value of each weight by the value of its input unit, and then to store the result in the associated unit.

There is plenty of resources available online to get more background on the neural networks architectures, few examples below:

 

MATHEMATICS

Deep learning uses very simple mathematical operations, it would be recommended to get/refresh at least basics of them.  I recommend starting from one of the following:

 

PYTHON

It would be advised to have basics Python programming before moving forward, few available resources:

 

Let’s do it… TensorFlow first example code.

 

To keep things simple let’s start with ‘Halo World’ example.

importing TensorFlow

 

import tensorflow as tf

Declaring constants/variables, TensorFlow constants can be declared using the tf.constant function, and variables with the tf.Variable function.  The first element in both is the value to be assigned the constant/variable when it is initialised.  TensorFlow will infer the type of the constant/variable initialised value, but it can also be set explicitly using the optional dtype argument. It’s important to note that, as the Python code runs through these commands, the variables haven’t actually been declared as they would have been if you just had a standard Python declaration.

x = tf.constant(2.0) 
y = tf.Variable(3.0)

Lets make our code compute something, simple multiplication.

z = y * x

Now comes the time when we would like to see the outcome, except nothing, has been computed yet… welcome to the TensorFlow. To make use of TensorFlow variables and perform calculations, Session must be created and all variables must be initialized. We can do it using the following statements.

sess = tf.Session()

init = tf.global_variables_initializer()

sess.run(init)

 

We have Session and even all constants/variables in place. Let’s see the outcome.

print("z = y * x = ", sess.run(z))

 

If you see something like this:
‘z = y * x = 6.0’
Congratulations, you have just coded you first TensorFlow ‘model’.

Below whole code in one piece:

import tensorflow as tf
x = tf.constant(2.0)
y = tf.Variable(3.0)
z = y * x
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print("z = y * x = ", sess.run(z))

This tutorial, of course, will not end up like this and will be continued soon… in next part, we will code our first neural network in TensorFlow.

If You liked this post please share it on your social media, if you have any questions or comments please make use of contact form.

Recommended reading list below:

Monthly Data Science Digest

Handpicked articles, news, and stories from Data Science world.

NEWS

  • CUDA 9 Features Revealed  – At the GPU Technology Conference, NVIDIA announced CUDA 9, the latest version of CUDA’s powerful parallel computing platform and programming model.

  • Explaining How End-to-End Deep LearninSteers a Self-Driving Car – As part of a complete software stack for autonomous driving, NVIDIA has created a deep-learning-based system, known as PilotNet, which learns to emulate the behavior of human drivers and can be deployed as a self-driving car controller.

  • Microsoft Build 2017: Microsoft AI – Amplify human ingenuity – Thanks to the convergence of three major forces — increased computing power in the cloud, powerful algorithms that run on deep neural networks and access to massive amounts of data — we’re finally able to realize the dream of AI.

 

  • AlphaGo’s next move – Chinese Go Grandmaster and world number one Ke Jie departed from his typical style of play and opened with a “3:3 point” strategy – a highly unusual approach aimed at quickly claiming corner territory at the start of the game.

 

  • Integrate Your Amazon Lex Bot with Any Messaging Service – Is your Amazon Lex chatbot ready to talk to the world? When it is, chances are that you’ll want it to be able to interact with as many users as possible. Amazon Lex offers built-in integration with Facebook, Slack and Twilio. But what if you want to connect to a messaging service that isn’t supported? Well, there’s an API for that–the Amazon Lex API.

  • How Our Company Learned to Make Better Predictions About Everything – In Silicon Valley, everyone makes bets. Founders bet years of their lives on finding product-market fit, investors bet billions on the future value of ambitious startups, and executives bet that their strategies will increase a company’s prospects. Here, predicting the future is not a theoretical superpower, it’s part of the job.

  • Are Pop Lyrics Getting More Repetitive? – In 1977, the great computer scientist Donald Knuth published a paper called The Complexity of Songs, which is basically one long joke about the repetitive lyrics of newfangled music (example quote: “the advent of modern drugs has led to demands for still less memory, and the ultimate improvement of Theorem 1 has consequently just been announced”).
  • Home advantages and wanderlust – When Burnley got beat 3-1 by Everton at Goodison Park on the 15th April, 33 games into their Premier League season, they’d gained only 4 points out of a possible 51 in their away fixtures. But during this time they’d also managed to accrue 32 points out of a possible 48 at Turf Moor; if the league table were based upon only home fixtures, they’d be in a highly impressive 6th place.
  • Facebook Wants to Merge AI Systems for a Smarter Chatbot – It’s possible to have a computer use conversational tricks that provide the appearance of intelligence for a few minutes, but the complexity of language eventually destroys the illusion.

  • MAGIC AI: THESE ARE THE OPTICAL ILLUSIONS THAT TRICK, FOOL, AND FLUMMOX COMPUTERS – To a human, a fooling image might look like a random tie-dye pattern or a burst of TV static, but show it to an AI image classifier and it’ll say with confidence: “Yep, that’s a gibbon,” or “My, what a shiny red motorbike.”

  • The Simple, Economic Value of Artificial Intelligence – How does this framing now apply to our emerging AI revolution?  After decades of promise and hype, AI seems to have finally arrived, – driven by the explosive growth of big data,  inexpensive computing power and storage, and advanced algorithms like machine learning that enable us to analyze and extract insights from all that data. 

    BOOKS

Immortal Life: A Soon To Be True Story Kindle Edition by Stanley Bing

Neural Network Programming with Python Kindle Edition by Fabio. M. Soares, Rodrigo Nunes

 

If you have found above useful, please don’t forget to share with others on social media.

“MUST KNOW” from Python-Pandas for Data Science

“MUST KNOW” from Python-Pandas for Data Science.

Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis.

1.Import Pandas to Python

import pandas as pd

2. Import data from CSV/Excel file

df=pd.read_csv('C:/Folder/mlhype.csv')   #imports whole csv to pd dataframe
df = pd.read_csv('C:/Folder/mlhype.csv', usecols=['abv', 'ibu'])  #imports selected columns
df = pd.read_excel('C:/Folder/mlhype.xlsx')  #imports excel file

3. Save data to CSV/Excel

df.to_csv('C:/Folder/mlhype.csv') #saves data frame to csv
df.to_excel('C:/Folder/mlhype.xlsx') #saves data frame to excel

4. Exploring data

df.head(5) #returns top 5 rows of data
df.tail(5) #returns bottom 5 rows of data
df.sample(5) #returns random 5 rows of data
df.shape #returns number of rows and columns
df.info() #returns index,data types, memory information
df.describe() #returns basic statistical summary of columns

5. Basic statistical functions

df.mean() #returns mean of columns
df.corr() #returns correlation table
df.count() #returns count of non-null's in column
df.max() #returns max value in each column
df.min() #returns min value in each column
df.median() #returns median of each colun
df.std() #returns standard deviation of each column

6. Selecting subsets

df['ColumnName'] #returns column 'ColumnName'
df[['ColumnName1','ColumnName2']] #returns multiple columns from the list
df.iloc[2,:] #selection by position - whole second row
df.iloc[:,2] #selection by position - whole second column
df.loc[:10,'ColumnName'] #returns first 11 rows of column
df.ix[2,'ColumnName'] #returns second element of column

7. Data cleansing

df.columns = ['a','b','c','d','e','f','g','h'] #rename column names
df.dropna() #drops all rows that contain missing values
df.fillna(0) #replaces missing values with 0 (or any other passed value)
df.fillna(df.mean()) #replaces missing values with mean(or any other passed function)

8.Filtering/sorting

df[df['ColumnName'] > 0.08] #returns rows with meeting criterion 
df[(df['ColumnName1']>2004) & (df['ColumnName2']==9)] #returns rows meeting multiple criteria
df.sort_values('ColumnName') #sorts by column in ascending order
df.sort_values('ColumnName',ascending=False) #sort by column in descending order

9. Data frames concatenation

pd.concat([DateFrame1, DataFrame2],axis=0) #concatenate rows vertically
pd.concat([DateFrame1, DataFrame2],axis=1) #concatenate rows horizontally

10.Adding new columns

df['NewColumn'] = 50 #creates new column with value 50 in each row
df['NewColumn3'] = df['NewColumn1']+df['NewColumn2'] #new column with value equal to sum of other columns
del df['NewColumn'] #deletes column

I hope you will find above useful, if you need more information on pandas, I recommend going to Pandas documentation or getting one of these books:

Free eBooks for Python.

Free eBooks to learn Python.

Below a list of 5 eBooks to learn Python for data science, they are all free to use:

  1. A Byte of Python by Swaroop C H
  2. Python for Everybody – Prof.Charles Severance
  3. Python Programming  by Wikibooks

  4. Think Python: How would you Think Like a Computer Scientist by Allen Downey

  5. Dive Into Python 3 by Mark Pilgrim

Hope you will enjoy if for some reason you prefer to have a paper copy, I would recommend these:

           

 

What is Hadoop YARN?

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. As its architectural center, YARN enhances a Hadoop compute cluster in the following ways: Multitenancy, Cluster utilization, Scalability and Compatibility. Multi-tenant data processing improves an enterprises’ return on Hadoop investments. YARNs dynamic allocation of cluster resources improves utilization over more static MapReduce rules. YARN’s resource manager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes. Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruptions to the processes that already work.

What is Hadoop Flume?

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or uncompress data, modify data by adding or removing pieces of information, and more. Flume allows you a number of different configurations and topologies, allowing you to choose the right setup for your application. Flume is a distributed system which runs across multiple machines. It can collect large volumes of data from many applications and systems. It includes mechanisms for load balancing and failover, and it can be extended and customized in many ways. Flume is a scalable, reliable, configurable and extensible system for management the movement of large volumes of data.

What is Apache Kafka?

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. The design is heavily influenced by transaction logs. Apache Kafka was originally developed by LinkedIn and was subsequently open sourced in early 2011. Graduation from the Apache Incubator occurred on 23 October 2012. Due to its widespread integration into enterprise-level infrastructures, monitoring Kafka performance at scale has become an increasingly important issue. Monitoring end-to-end performance requires tracking metrics from brokers, consumer, and producers, in addition to monitoring ZooKeeper, which is used by Kafka for coordination among consumers. There are currently several monitoring platforms to track Kafka performance, both open-source, like LinkedIn’s Burrow, as well as paid, like Datadog. In addition to these platforms, collecting Kafka data can also be performed using tools commonly bundled with Java, including JConsole.

What is Hadoop Zookeeper?

Hadoop Zookeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, etc. Applications can leverage these services to coordinate distributed processing across large clusters. Name services, group services, synchronization services, configuration management, and more, are available in Zookeeper, which means that each of these projects can embed ZooKeeper without having to build synchronization services from scratch into each project. Interaction with ZooKeeper occurs via Java or C interfaces time. Within ZooKeeper, an application can create what is called a znode (a file that persists in memory on the ZooKeeper servers). The znode can be updated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode (in ZooKeeper parlance, a server can be set up to “watch” a specific znode). Using this znode infrastructure, applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.

What is Hadoop Hbase?

Hadoop Hbase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables must use this Primary Key. HBase allows for many attributes to be grouped together into what are known as column families, such that the elements of a column family are all stored together. This is different from a row-oriented relational database, where all the columns of a given row are stored together. HBase is very flexible and therefore able to adapt to changing application requirements. HBase is built on concepts similar to those of MapReduce and HDFS (NameNode and slave nodes). In HBase a master node manages the cluster and region servers store portions of the tables and perform the work on the data. In the same way HDFS has some enterprise concerns due to the availability of the NameNode, HBase is also sensitive to the loss of its master node.

What is Hadoop Sqoop?

Hadoop Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB. Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores: Import sequential datasets from a mainframe, parallel data transfer, fast data copies, efficient data analysis, load balancing.