Simple Plot Graph with Python: capital letters matters

Today we will see how to create simple plot graphs in Python using Seaborn. I have found in a Data Science book (by Sinan Ozdemir) a simple graph where we can plot sales and expenditure in advertising for different media like TV, Radio, Newspaper.

Let’see how it works. Where is better to put the money? First of all import pandas and seaborn  packages and dataset

import pandas as pd
import seaborn as sns
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) #import data in CSV format using Pandas
data.head() # let'see how data are structured

TV radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

Let’s plot the data using some magic… and using seaborne package we can say to Python, to plot data having 3 x_vars, based on our 3 first column of our database, and on y_vars sales. Let’s see the result

%matplotlib inline
sns.pairplot(data, x_vars=['TV','radio','newspaper'], y_vars=['sales'], size=4.5, aspect=0.7)
Plot graph of sales for different media

But image to write the same commands now with Radio & Newspaper with capital letter. Python is very sensitive (or at least Anaconda version that I’m using), so take care of using correct name of variable.

matplotlib inline
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars=['sales'], size=4.5, aspect=0.7)

Where to find correct Variable name? You can see it from:

  • the head of your table
  • In Anaconda, on the Variable explorer in column Value

If you need further info on magic function, %matplotlib inline, you can have a look to this post on Stackoverflow

Did you like this post on a simple plot graph.  Share it to friends, colleagues or martians using below buttons and do not forget to subscribe to our fun newsletter.

How to impress your boss: Infographics, dashboard, visualization tools

An image worth more than thousand of data, sorry words. Yes, sometimes you have a lot of data to present in a very small period of data. What is better than a good dashboard where to see in a glance your KPI’s or a nice infographics to show tons of info in few seconds?

People don’t have time, so having all in one page is very useful to present but also to make a good storytelling to help your audience to digest complex info and memorize important messages.

63% of your audience could remember stories, but only 5% could remember a single statistic (Source: Stanford professor Chip Heath)

Create an infographics

You can simply create an infographics with Picktochart

Infographics Photo by rawpixel on Unsplash
Infographics Photo by rawpixel on Unsplash

Here how it works:

  • Sign in to piktochart
  • Define a template 
  • Select from the left menu with icon, graphics that you want to change or adjust (graphics, background, text, color, tools) 
  • On the top side you can save, share or download

Below you can find some examples 

A simple but powerful infographics could be found here:

How quitting smoking affect your body

Infographics on Pinterest

What now you have to do is just think about which are the data that you would like to present and how to create a good storytelling that your audience will remind.

If you need to analyze quickly your data, consider to read: How to analyze your data in 5 minutes with Panda.

Build your dashboard in 5 minutes

Dashboard helps you to understand immediately what is going well (maybe showing green numbers, up arrows) and where to investigate more maybe with other self-service reports.  

An example of interactive dashboard
An example of interactive dashboard -Photo by rawpixel on Unsplash

To create powerful visualization you need to fulfill the following requirements:

What I want to explain with this dashboard? Maybe I want to show if we have reach our sales target, or which are the most contributors for growth or products that are in delay

Test how simple and easy to read is your dashboard: go to one of your colleague with less familiarity with technology and ask to explain the content of our report. If he/she report the right message you have created a good one. Otherwise interview other people on what is difficult to read or unclear and simplify.

Create your dashboard: you have several tools to create it:

  • Excel: Best info At Chandoo.org where you will discover how to create and manage your dashboard.
  • Python: More complicated but you can define every aspect of your dashboard. 
    • Plotly and Bokeh are the modules that you can use to excel on this topic. 
    • An interesting example is this Bokeh dashboard or Kickstarter project by category and status (successfull, cancelled…) including also name of the project if you pass through 

An example of kickstarter dashboard done in Bokeh
An example of kickstarter dashboard done in Bokeh

  • R: Best choice: Here you can customize everything, using Shiny and Rmarkdown using less code than Python. 
    • An interesting example is R Cran download monitor, where in one page you can see evolution of package download, name 

CRAN download - R dashboard example
CRAN download – R dashboard example

 

 

 

 

 

 

 

 

If you want more please consider in subscribing to our fun newsletter and our recent posts

 

3 fundamentals statistic skills for data science?

Today I would like to discuss with you about statistic skills you need to develop, if you want to become a data scientist. Do you need to be graduated in Statistics to do this job? The quick reply is some Statistics is needed, but practice is more important.

Statistic skills: what are you talking about?

I really loved this quotation by Manish Tripathi 

Data Science without Statistics is like owning a Ferrari without brakes. You can enjoy sitting in Ferrari, show off your newly owned car to others, but you can’t enjoy the drive for long because you would crash land soon!

http://qr.ae/TUpAIF

A good data scientist, need to know :

-use statistic skills to explore and visualize data

-most important statistical theories (like hypothesis testing, bayesian analysis)

-know most common statistical models and define which is best to use (like linear regressions. Time series analysis) 

-evaluate if the model is working for the purpose of your analysis.

But theory is not all, so the best way to learn about Stat skills is through practical approach. So don’t expect to become a good data scientist only reading books or learning theory. 

Use statistics skills to explore data: 

Understand & summarize your data: 

If you are new in the world of data, dataset and graph, you can start from this free course  : Analyzing categorical data provided by Khan Academy.  Here you will learn how to identify individuals, variables, read different  types of graphs and much more. I suggest to stop at first module, if you are at a basic level.

Grouping & Visualization

This is a fondamental exercise to be done with your dataset. Let’use a free dataset on regarding Wine reviews Kaggle dataset

In this dataset you  will have 130k wine reviews with wines coming from all over the world, scored by wine taster from 0 to 100, including a lot of info related to qualitative features of the wine. 

We would like to understand in few minutes:

  • which variable do we have in the dataset? 
  • which data types are (numbers? if so what kind of numbers – maybe strings of text)
  • How many data we have? Number of rows and columns 
  • For some variables which are minimum, average and maximum values
  • which is the median (mid point of the data set) and the mode (most frequent observation)?
  • How is the distribution of our data? For on?e variable which is the average observation for the first 25% of my dataset, and for 50%

To see these analysis in action, have a look to: Python: Analyze your data in 5 minutes with Panda

Take advantage of Simple statitiscal concept

Let’s briefly report same simple statistical concept that it will be deep dive in separate post

  • Descriptive statistics: you are probably familiar with mean, median, mode, ranges and quartile.  This info will help you to understand how looks like your dataset. 

Coming back to our Wine dataset just with one command you can identify many of these information. In this case you will see that your database has around 130.000 records, with an average points (coming from reviews) of 88,45 and a reported average price of 35,36$

Statistic skills: Analyze your data with pyton and panda_df.describe
Simple statistic skills: descpritive statistics with pyton and pandas

Minimum value is 80 and 4€ for price and max is 100 for variable points and 3.300 for price (Wow!!)

Percentiles:25%, also called first quartile: it means that observation 32.492 is represe
nting 25% of your dataset (in ascending order). This observation has an average review of 86 and a price of 17$. 

Interesting to see that to arrive to 50% of this database you will increment only 2 value in points (88) but +30% in price (25%)

More specifics on percentile could be found at Statistichowto or at Statistic for Dummies

  • Distributions: explain you how it is possible (probable) that your data will be distributed. More famous is normal distribution, also knowned as “bell curve” (that happens many time in nature). Another important distribution curve, is binomial, that easily represent two status, i.e success or failure of a new drug.We will discuss about distributions in a separate post about distributions.

In the next topic we will discuss also about Hypothesis testing, Regression model, Time series analysis and other Intermediate Statistical concepts

Stay connected and subscribe to our newsletter to learn more about how to became a great data scientist. 

If you have liked this post on fundamentals statistic skills for a great data scientist, please sharing it through social buttons. Let me know your comment or thought adding a comment!

 

Analyze data with Python in 5 minutes using Pandas

Ready to learn how to analyze data with Python in few minutes, without knowing too much about Python language? You can easily import 130.000 rows in few sceonds with pandas module for Python. And using less than 10 commands you can explore number of records, column, and start to know mean, max & minimum and a lot more on your dataset

 pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Let’see the code to import CSV. Less than a minute

import pandas as pd #Import module pandas

#Using Panda to load csv

Location = r'C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv'

df = pd.read_csv(Location) #Read CSV in location

Attention: change your path, changing  ‘C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv’ with your path. If you are not familiar with this check it out: 3 simple way to change your path in Anaconda/Python

WHAT KIND OF VARIABLE WE HAVE IN THE DATASET?

Using Anaconda, analyzing data with Python and Panda will be simple. We can see that now our df (Dataframe) has around 130k records and 14 variables(129971,14)

Using command:

df.dtypes

you can easily check what kind of data do you have. Here we have basically all variable as an object , while first variable, points are integer number (int64) and price include floating number (with decimals)

Out[4]:

Unnamed: 0                 int64

country                   object

description               object

designation               object

points                     int64

price                    float64

province                  object

region_1                  object

region_2                  object

taster_name               object

taster_twitter_handle     object

title                     object

variety                   object

winery                    object

dtype: object


If you want to learn more about which are most common Statistical & Math skills, discover our dedicated page.

LEARN MAIN STATISTICS WITH ONE COMMAND:DF.DESCRIBE

Let’s focus only only on  numerical variables. Point: This represent a number from 0 to 100 as scoring of wine taster. Price, no need to explain

We can see if we digit:

df.describe

Analyze your data with pyton and panda_df.describe

We can see that variable Points are available for all data (129.971) and has a minimum of 80 and a max of 100, with an average of 88,45

Our first 25% of dataset (32.492) has an average point of 86, 50% of dataset (88) 

Of course you can do the same in Excel, but you need to create several cell and write several formulas. So Python will help you here to save some time

UNDERSTAND IF YOU HAVE MISSING VALUES IN YOUR DATASET

Another powerful command to analyze data with python is understanding the quality of your dataset. Do you have some missing value? How many? In which variables?

Also here do you see the value of using python instead

Write these two lines of code and you will find how many missing values you have in your dataset for every columns

null_columns=df.columns[df.isnull().any()]

df[null_columns].isnull().sum()

 

WHICH IS THE AVERAGE SCORING AND PRICE IN ANY COUNTRY?

One of the most commont things to analyze data with Python, is to understand average data, maybe grouping for some of your variables.If you want to know which is the average score and price by country, you can use

 .groupby ().operations,

where in the parenthesis you need to put variable to be grouped and after the operation that you want to do .mean or .sum for example

df.groupby(‘country’).mean()

Anayze data with Python groupby pandas
group by

So we will discover that in our dataset, average price in Argentina is 24,5$ with an average of 86,7$, beter than Austria that has 90 point in average but you have to pay 30$

Of course you can groupby by multiple column (‘country’,’region’)

FILTER ONLY DATA WITH PRICE >90$

Maybe you are interesting to easily know how many permutations you have in your database that fit with a particular threshold. In this case we would like to know how many records has a price >90$. Result is more than 4.000 records

df[df.price>90]

HOW TO EXPORT IN CSV OR EXCEL WHEN YOU ANALYZE DATA WITH PYTHON ?

Easy, just write

df.to_csv('first.csv') #creating a csv file called first

df.to_excel('first.xlsx', sheet_name='Sheet1') #creating a xlsx file called first, in sheet 1

OTHER USEFUL COMMAND TO ANALYZE DATA WITH PYTHON

Df.head= See head of your dataset

Df.columns= show your columns names

Df.tail[3] = show latest 3 record of your dataset

Df.index= show you the range of your dataset

Stay tuned next time we will describe how to user wordcloud to describe string variables. Subscribe to our newsletter for more news

Don’t forget to leave your comment and to share the article if you like it. 

5 thing you can do better with data mining

Today we will discover more about data mining. If you are not familiar with this concept, it is better that you start to understand more what is behind. We are talking about powerful tools and techique that will help you to get insight from your big data. 

A simple definition is: 

Data mining is the sum of technique and methodologies  to collect information from different sources and manage in automatic way through algorithms and logical patterns

How data mining could help you to collect data ?

Data are growing fastly not only on social and open database, but everywhere. 

With data mining tecnique like data scraping (taking data from internet, like ecommerce price , weather data, stock exchange…),  you can increase number of datasources that you can use for your analysis.

Did you know that you can get data also from images. Discover more here: 

https://www.datacamp.com/community/tutorials/datasets-for-images

In few minutes with very small line of code you can learn how to web scraping data using Python and R

How to group your data: clustering analysis

Image a big databases with many customers. It often happen that you have a lot of different groups of customer . Clustering analysis could easily identify which are customer with affinity that you can address in a similar group target,  maybe because they are similar to size order, purchase need, purchase attitude.

Clustering income vs education
                                 Example of cluster from www.dummies.com

This will help you or your firm to set different pricing, product and general marketing strategies more focused for that particular target.

Using Python or R will help you to identify clusters (see below an example of 3 clusters)

Other examples could be find here:

Cluster Analysis by JMP

 

Regression analysis: identify future output based on historical data

Consider a dataset with icecream sales of last three years and one with temperature information. With regression you can create an algorithm to estimate how much icecream you can sales based on expected temperature

Interesting article that clarify more regression, expecially on marketing 

Anomaly detection

Yes, how many times you have seen dataset with errors like typo distraction or duplicated info. Through specific tools and Machine learning you can easily identify and prevent this kind of error analyzing historical data and suggesting correct value.

How much time you can save from more robust and clear data set? Data scientist usually pass from 70% to 90% cleaning data 

Classification analysis: a powerful data mining technique

In this field are growing machine learning algorithm and chatbot that in future could try to solve most of our questions, maybe about a product features, classifying our question base on common patterns. 

Could be also interesting to identify common words in books, text, maybe through Wordcloud.

Signup to our newsletter to know soon how to analyze through wordcloud any text with Python and discover more info on datamining tools and techniques

Our mission: Why another website on datascience?

Our mission: Why another website on datascience?

I have started this website with a clear mission: to help all people that manage data, like Data analyst, Financial Controller, IT/IS people, Business partners and who else in the business that would like to base their decisions starting from robust data and analysis.

After many years as Finance Controller, Business Partner and Pricing Manager I have seen that people tend to maintain status quo. So changing is difficult, what I have discovered during my experience is that changes are more easy if you have robust data behind that help you to understand reason to change and benefit of new solutions.

Following Fromzerotodatascience.com you will:

–  save time in your analysis using tips and tricks and most powerful tools (i.e Excel, Python,R)

–  Enhance data quality (quickly understand dataset, manage missing values, …)

Continue reading “Our mission: Why another website on datascience?”

How looks like a good data scientist?

data scientist
Data scientist

Data scientist, what an appealing definition, but what kind of skills and which tools do you need to start? What is a good definition of data scientist?

Let’s discover on the web some good resources that will help us

Bernand Marr, in his9 steps to become a data scientist from scratch” makes a good syntesis. I just reported what I like most:

  • Math & Statistical skills: mmh, not very appealing in some cases. I remind my study in Statistics and how much theoretical looks like. Probably I will discover soon that could be used in a more pratical way. 
  • Learn tools: most of the activity will be cleaning your data to make good analysis. Remember garbage in, garbage out
  • Community: life of data scientist in same cases is not easy, many different tools and languages to be used. Having some peers and community places where asking help and support is fundamental.
  • Practice: someone has said that 90% of what you will learn is training by doing. I don’t know if the percentage is realistic, but certainly what I have learn most is from my trials, success and fortunately from my mistakes

Continue reading “How looks like a good data scientist?”

Installing Python with no admin rights

4 easy step to easy installing Python with no admin rights

Ready to install Python. Yes, but if I would like it to have it in portable way (maybe on an external hard disk?). You soon discover how you can manage this.

In few seconds you will starts how to install Python on any PC. But first question is which kind of Python version do I need to download? Continue reading “Installing Python with no admin rights”

SoDS 2018 – Summer of data science 2018

Hi all,

temperature is rising, holiday are coming. Yes we are in Summer. Why don’t get this fantastic time to learn more about Data Science? Today I will introduce to a good practice called SoDS

The first way to became a data scientist is to start to doing. Why not starting with Summer of Data science (SoDS)? 

 #SoDS 2018 was created by Renee @BecomingDataSci with the goal to learn something new about data science. So, now you are wondering how to partecipate. 

Is very easy:

In Week 1, you just start thinking what you want to learn. Just make a short list of things maybe regarding different topics (maybe have a look to my article “ How looks like a good data scientist”)

Maybe are you interested to learn more about math & statistical skills or you want to know more about Tools like, R, Sas, Python, or maybe find mates for a data competition on Kaggle

So are you ready for SoDS Week 1?

You have one week to search on the web and decide what you you want to learn.

Write in the comments what do you like to learn. It will an helpful way to decide to really start to become Data Scientist and to have a lot of fun.

My journey to become a Data scientist

Hi all,

I’m Frank, and I would like to share with you my journey on Data scientist. Discover with me what we need to learn to become a successful Data scientist and what is behind this world.

I’m not an IT guy, but I like helping people to take decision based on numbers.

How many business or personal decision was taken with wrong data or no data at all?

So follow me and we will learn together tools and techniques to treat informations, present it and make the difference in your personal and business environment.

What I want to do?

I would like to make a simple synthesis of resources that are available on the net and see how we can use it. Through this exercise we will learn to be more productive and efficient in analyzing data.

What you have to expect?

Dummy & simple language, limiting at minimum technical and abstract concept and lot of interesting analysis.

See you soon for my first topic!

Write in the comment any topic that you are interested to focus, I will do my best to reply