Free Data Science Courses available from Coursera.

Below, it is a list of data science-related courses from Coursera, I hope you will find it useful:

Introduction to Data Science in Python – University of Michigan

The Data Scientist’s Toolbox –  Johns Hopkins University

R Programming – Johns Hopkins University

Getting and Cleaning Data – Johns Hopkins University

Exploratory Data Analysis –  Johns Hopkins University

Reproducible Research – Johns Hopkins University

Statistical Inference – Johns Hopkins University

Regression Models – Johns Hopkins University

Practical Machine Learning – Johns Hopkins University

Developing Data Products – Johns Hopkins University

Introduction to Genomic Technologies – Johns Hopkins University

Genomic Data Science with Galaxy – Johns Hopkins University

Python for Genomic Data Science – Johns Hopkins University

Command Line Tools for Genomic Data Science – Johns Hopkins University

Algorithms for DNA Sequencing – Johns Hopkins University

Bioconductor for Genomic Data Science – Johns Hopkins University

Statistics for Genomic Data Science – Johns Hopkins University

Machine Learning Foundations: A Case Study Approach – University of Washington

Regression – University of Washington

Classification – University of Washington

Clustering & Retrieval – University of Washington

Communicating Data Science Results – University of Washington

Practical Predictive Analytics: Models and Methods – University of Washington

Data Manipulation at Scale: Systems and Algorithms – University of Washington

Introduction to Probability and Data – Duke University

Inferential Statistics – Duke University

Linear Regression and Modelings – Duke University

Bayesian Statistics – Duke University

Introduction to Big Data – University of California

Big Data Modeling and Management Systems – University of California

Big Data Integration and Processing – University of California

Machine Learning with Big Data – University of California

Graph Analytics for Big Data – University of California

Genomic Data Science and Clustering – University of California

Data Visualization – University of Illinois at Urbana-Champaign

Text Retrieval and Search Engines – University of Illinois at Urbana-Champaign

Text Mining and Analytics – University of Illinois at Urbana-Champaign

Pattern Discovery in Data Mining – University of Illinois at Urbana-Champaign

Cluster Analysis in Data Mining – University of Illinois at Urbana-Champaign

Data Management and Visualization – Wesleyan University

Data Analysis Tools – Wesleyan University

Regression Modeling in Practice – Wesleyan University

Machine Learning for Data Analysis – Wesleyan University

Introduction to Recommender Systems: Non-Personalized and Content-Based – University of Minnesota

Nearest Neighbor Collaborative Filtering – University of Minnesota

Recommender Systems: Evaluation and Metrics – University of Minnesota

Matrix Factorization and Advanced Techniques – University of Minnesota

Process Mining: Data science in Action – Eindhoven University of Technology

 

Machine learning competitions.

In this post, I want to share, how simple it is to start competing in machine learning tournaments – Numerai. I will go step by step, line by line explaining what is doing what and why it is required.

Numerai is a global artificial intelligence competition to predict the behavior. Numerai is a little bit similar to Kaggle but with clean datasets, so we can pass over long data cleansing process.  You just download the data, build a model, and upload your predictions, that’s it. To extract most of the data you would initially do some feature engineering, but for simplicity of this intro, we will pass this bit over.  One more thing we will pass on is splitting out validation set, the main aim of this exercise is to fit ‘machine learning’ model to training dataset. Later using fitted model, generate a prediction.  All together it shouldn’t take more than 14 simple lines of python code, you can run them as one piece or run part by part in interactive mode.

Let’s go, let’s do some machine learning…

A first thing to do is to go to numer.ai, click on ‘Download Training Data’  and download datasets, after unzipping the archive, you will have few files in there, we are interested mainly in three of them. It is worth noting what is a path to the folder as we will need it later.

I assume you have installed python and required libraries, if not there is plenty of online tutorials on how to do it, I recommend installing Anaconda distribution. It it time to open whatever IDE you use, and start coding, first few lines will be just importing what we will use later, that is Pandas and ScikitLearn.

import pandas as pd 
from sklearn.ensemble import GradientBoostingClassifier

Pandas is used to import data from csv files and do some basic data manipulations, GradientBoostingClassifier as part of ScikitLearn will be the model we will use to fit and do predict. As we have required libraries imported let’s use them… in next three lines, we will import data from csv to memory.  We will use ‘read_csv’  method from pandas, all you need to do is amend the full path to each file, wherever you have extracted numerai_datasets.zip.

train = pd.read_csv("/home/m/Numerai/numerai_datasets/numerai_training_data.csv")
test  = pd.read_csv("/home/m/Numerai/numerai_datasets/numerai_tournament_data.csv")   
sub  = pd.read_csv("/home/m/Numerai/numerai_datasets/example_predictions.csv")

What above code does it creates three data frames and imports the csv files we have we have previously extracted from downloaded numerai_datasets.zip.

‘train’ –  this dataset contains all required data to train our model, so it has both ‘features’ and ‘labels’, so you can say it has both questions and answers that our model will ‘learn’

‘test’ – this one contains features but does not contain ‘labels’, you can say it contains questions and our model will deliver answers.

‘sub’ – it is just template for uploading our prediction

Let’s move on,  in next line will copy all unique row id’s from ‘test’ to ‘sub’ to make sure each predicted value will be assigned to a right set of features, let’s say we put question number next to our answer so whoever checks the test would now.

sub["t_id"]=test["t_id"]

As we have copied the ids to ‘sub’, we don’t need them anymore in ‘test’ (all rows will stay in same order), so we can get rid of them.

test.drop("t_id", axis=1,inplace=True)

In next two lines, we will separate ‘labels’ or target values from train dataset.

labels=train["target"]

train.drop("target", axis=1,inplace=True)

As we have prepared ‘train’ dataset, we can get our model to learn from it. First, we select model we want to use, it will be Gradient BoostingClassifier from ScikitLearn – no specific reason for using this one, you can use whatever you like eg. random forest, linear regression…

grd = GradientBoostingClassifier()

As we have a model defined, let’s have it learn from ‘train’ data.

grd.fit(train,labels)

Ok, now our model is well trained and ready to make predictions, as the task is called ‘classification’ we will predict what is a probability of each set of features belongs to one of two classes ‘0’ or ‘1’.

y_pred = grd.predict_proba(test)

We have a long list of predicted probabilities called ‘y_pred’, let’s attach it to ‘id’ we had separated previously.

sub["probability"]=y_pred[:,1]

And save it in csv format, to get uploaded.

sub.to_csv("/home/m/Numerai/numerai_datasets/SimplePrediction.csv", index=False)

The last thing to do is go back to numer.ai website and click on ‘Upload Predictions’… Good luck.

This was very simplistic and introductory example to start playing with numer.ai competitions and machine learning. I will try and come back with gradually more complicated versions, if you have any questions, suggestions or comments please go to ‘About’ section and contact me directly.

The full code below:

import pandas as pd 
from sklearn.ensemble import GradientBoostingClassifier 
train = pd.read_csv("C:/Users/Downloads/numerai_datasets/numerai_training_data.csv") 
test = pd.read_csv("C:/Users/Downloads/numerai_datasets/numerai_tournament_data.csv") 
sub = pd.read_csv("C:/Users/Downloads/numerai_datasets/example_predictions.csv") 
sub["t_id"]=test["t_id"] 
test.drop("t_id", axis=1,inplace=True) 
labels=train["target"] 
train.drop("target", axis=1,inplace=True)
grd = GradientBoostingClassifier() 
grd.fit(train,labels) 
y_pred = grd.predict_proba(test) 
sub["probability"]=y_pred[:,1] 
sub.to_csv("C:/Users/Downloads/numerai_datasets/SimplePrediction.csv", index=False)

Data Science definition.

There is much debate on it, but the short definition of data science is:

“Data science is an interdisciplinary field of using scientific methods to get information from data in various forms.”

Data science involves using methods from fields of statistics, computer science and mathematics, to interpret data for business decisions. Amounts of data available in modern society grow with the input of technology in peoples lives. These massive sets of structured and unstructured data help to show patterns and trends for business opportunities or academic research. One can see an increasing number of traditional fields of science with adjective ‘computational’ or ‘quantitative’. In industry, data science transforms everything from healthcare to media and this trend shall pick up in future.