Analyze data with Python in 5 minutes using Pandas

Ready to learn how to analyze data with Python in few minutes, without knowing too much about Python language? You can easily import 130.000 rows in few sceonds with pandas module for Python. And using less than 10 commands you can explore number of records, column, and start to know mean, max & minimum and a lot more on your dataset

 pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Let’see the code to import CSV. Less than a minute

import pandas as pd #Import module pandas

#Using Panda to load csv

Location = r'C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv'

df = pd.read_csv(Location) #Read CSV in location

Attention: change your path, changing  ‘C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv’ with your path. If you are not familiar with this check it out: 3 simple way to change your path in Anaconda/Python

WHAT KIND OF VARIABLE WE HAVE IN THE DATASET?

Using Anaconda, analyzing data with Python and Panda will be simple. We can see that now our df (Dataframe) has around 130k records and 14 variables(129971,14)

Using command:

df.dtypes

you can easily check what kind of data do you have. Here we have basically all variable as an object , while first variable, points are integer number (int64) and price include floating number (with decimals)

Out[4]:

Unnamed: 0                 int64

country                   object

description               object

designation               object

points                     int64

price                    float64

province                  object

region_1                  object

region_2                  object

taster_name               object

taster_twitter_handle     object

title                     object

variety                   object

winery                    object

dtype: object


If you want to learn more about which are most common Statistical & Math skills, discover our dedicated page.

LEARN MAIN STATISTICS WITH ONE COMMAND:DF.DESCRIBE

Let’s focus only only on  numerical variables. Point: This represent a number from 0 to 100 as scoring of wine taster. Price, no need to explain

We can see if we digit:

df.describe

Analyze your data with pyton and panda_df.describe

We can see that variable Points are available for all data (129.971) and has a minimum of 80 and a max of 100, with an average of 88,45

Our first 25% of dataset (32.492) has an average point of 86, 50% of dataset (88) 

Of course you can do the same in Excel, but you need to create several cell and write several formulas. So Python will help you here to save some time

UNDERSTAND IF YOU HAVE MISSING VALUES IN YOUR DATASET

Another powerful command to analyze data with python is understanding the quality of your dataset. Do you have some missing value? How many? In which variables?

Also here do you see the value of using python instead

Write these two lines of code and you will find how many missing values you have in your dataset for every columns

null_columns=df.columns[df.isnull().any()]

df[null_columns].isnull().sum()

 

WHICH IS THE AVERAGE SCORING AND PRICE IN ANY COUNTRY?

One of the most commont things to analyze data with Python, is to understand average data, maybe grouping for some of your variables.If you want to know which is the average score and price by country, you can use

 .groupby ().operations,

where in the parenthesis you need to put variable to be grouped and after the operation that you want to do .mean or .sum for example

df.groupby(‘country’).mean()
Anayze data with Python groupby pandas
group by

So we will discover that in our dataset, average price in Argentina is 24,5$ with an average of 86,7$, beter than Austria that has 90 point in average but you have to pay 30$

Of course you can groupby by multiple column (‘country’,’region’)

FILTER ONLY DATA WITH PRICE >90$

Maybe you are interesting to easily know how many permutations you have in your database that fit with a particular threshold. In this case we would like to know how many records has a price >90$. Result is more than 4.000 records

df[df.price>90]

HOW TO EXPORT IN CSV OR EXCEL WHEN YOU ANALYZE DATA WITH PYTHON ?

Easy, just write

df.to_csv('first.csv') #creating a csv file called first

df.to_excel('first.xlsx', sheet_name='Sheet1') #creating a xlsx file called first, in sheet 1

OTHER USEFUL COMMAND TO ANALYZE DATA WITH PYTHON

Df.head= See head of your dataset

Df.columns= show your columns names

Df.tail[3] = show latest 3 record of your dataset

Df.index= show you the range of your dataset

Stay tuned next time we will describe how to user wordcloud to describe string variables. Subscribe to our newsletter for more news

Don’t forget to leave your comment and to share the article if you like it. 

5 thing you can do better with data mining

Today we will discover more about data mining. If you are not familiar with this concept, it is better that you start to understand more what is behind. We are talking about powerful tools and techique that will help you to get insight from your big data. 

A simple definition is: 

Data mining is the sum of technique and methodologies  to collect information from different sources and manage in automatic way through algorithms and logical patterns

How data mining could help you to collect data ?

Data are growing fastly not only on social and open database, but everywhere. 

With data mining tecnique like data scraping (taking data from internet, like ecommerce price , weather data, stock exchange…),  you can increase number of datasources that you can use for your analysis.

Did you know that you can get data also from images. Discover more here: 

https://www.datacamp.com/community/tutorials/datasets-for-images

In few minutes with very small line of code you can learn how to web scraping data using Python and R

How to group your data: clustering analysis

Image a big databases with many customers. It often happen that you have a lot of different groups of customer . Clustering analysis could easily identify which are customer with affinity that you can address in a similar group target,  maybe because they are similar to size order, purchase need, purchase attitude.

Clustering income vs education
                                 Example of cluster from www.dummies.com

This will help you or your firm to set different pricing, product and general marketing strategies more focused for that particular target.

Using Python or R will help you to identify clusters (see below an example of 3 clusters)

Other examples could be find here:

Cluster Analysis by JMP

 

Regression analysis: identify future output based on historical data

Consider a dataset with icecream sales of last three years and one with temperature information. With regression you can create an algorithm to estimate how much icecream you can sales based on expected temperature

Interesting article that clarify more regression, expecially on marketing 

Anomaly detection

Yes, how many times you have seen dataset with errors like typo distraction or duplicated info. Through specific tools and Machine learning you can easily identify and prevent this kind of error analyzing historical data and suggesting correct value.

How much time you can save from more robust and clear data set? Data scientist usually pass from 70% to 90% cleaning data 

Classification analysis: a powerful data mining technique

In this field are growing machine learning algorithm and chatbot that in future could try to solve most of our questions, maybe about a product features, classifying our question base on common patterns. 

Could be also interesting to identify common words in books, text, maybe through Wordcloud.

Signup to our newsletter to know soon how to analyze through wordcloud any text with Python and discover more info on datamining tools and techniques

Our mission: Why another website on datascience?

Our mission: Why another website on datascience?

I have started this website with a clear mission: to help all people that manage data, like Data analyst, Financial Controller, IT/IS people, Business partners and who else in the business that would like to base their decisions starting from robust data and analysis.

After many years as Finance Controller, Business Partner and Pricing Manager I have seen that people tend to maintain status quo. So changing is difficult, what I have discovered during my experience is that changes are more easy if you have robust data behind that help you to understand reason to change and benefit of new solutions.

Following Fromzerotodatascience.com you will:

–  save time in your analysis using tips and tricks and most powerful tools (i.e Excel, Python,R)

–  Enhance data quality (quickly understand dataset, manage missing values, …)

Continue reading “Our mission: Why another website on datascience?”