Ready to learn how to analyze data with Python in few minutes, without knowing too much about Python language? In this brief Python tutorial, you will learn how easy is importing 130.000 rows in few seconds, with Pandas module for Python. With less than 10 line of code you can explore number of records, column names, and start to know more your data with some exploratory analysis like mean, max & minimum.
But let’s go, step by steps. What is Pandas?
pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Let’see the code to import CSV. Less than a minute
#Import module pandas import pandas as pd #Using Panda to load csv Location = r'C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv' df = pd.read_csv(Location) #Read CSV in location
Attention: change your path, changing ‘C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv’ with your path. If you are not familiar with this check it out: 3 simple way to change your path in Anaconda/Python
What you will learn at the end of this Python tutorial on Analyze data in 5 minutes using Pandas
- Undertand variables of your dataset with df.dtypes
- Main statistics with df.describe
- Find missing value
- Group info with .groupby, i.e by country
- Filter/subset your data
- Export your data in Excel
- Learn other useful command like df.head/tail
WHAT KIND OF VARIABLE WE HAVE IN THE DATASET?
In the first part of this Python tutorial, we will use Anaconda and Pandas to start exploring our data. We can see that now our df (Dataframe) has around 130k records and 14 variables(129971,14)
you can easily check what kind of data do you have. Here we have basically all variable as an object , while first variable, points are integer number (int64) and price include floating number (with decimals)
Out: Unnamed: 0 int64 country object description object designation object points int64 price float64 province object region_1 object region_2 object taster_name object taster_twitter_handle object title object variety object winery object dtype: object
If you want to learn more about which are most common Statistical & Math skills for a data scientist, discover our dedicated page.
PYTHON TUTORIAL: LEARN MAIN STATISTICS WITH ONE COMMAND:DF.DESCRIBE
Let’s focus only only on numerical variables.
- Point: This represent a number from 0 to 100 as scoring of wine taster.
- Price, no need to explain
Here we introduce a great command, df.describe, to analyze mean, standard deviation, min, max and quartile distribution of our data:
We can see that variable Points are available for all data (129.971) and has a minimum of 80 and a max of 100, with an average of 88,45
Our first 25% of dataset (32.492) has an average point of 86, 50% of dataset (88)
Of course you can do the same in Excel, but you need to create several cell and write several formulas. So now you have learned in this Python tutorial how to save some time in exploring data.
PYTHON TUTORIAL: UNDERSTAND IF YOU HAVE MISSING VALUES IN YOUR DATASET
Another powerful command to analyze data with python is understanding the quality of your dataset. Do you have missing value? How many? In which variables?
Let’s continue our Python tutorial, where soon you willsee the value of using Python instead of Excel.
Write these two lines of code and you will find how many missing values you have in your dataset for every columns
WHICH IS THE AVERAGE SCORING AND PRICE IN ANY COUNTRY?
One of the most common things to analyze data with Python, is understanding average data, maybe grouping for some of your variables.If you want to know which is the average score and price by country, you can use
where in the parenthesis you need to put variable to be grouped and after the operation that you want to do .mean or .sum for example
So we will discover that in our dataset, average price in Argentina is 24,5$ with an average of 86,7$, better than Austria that has 90 point in average but you have to pay 30$
Of course you can groupby by multiple column (‘country’,’region’)
FILTER OR USE SUBSET WITH WINE WITH PRICE >90$
Maybe you are interesting to easily know how many permutations you have in your database that fit with a particular threshold. In this case we would like to know how many wines has a price >90$. Result is more than 4.000 records!!!
HOW TO EXPORT IN CSV OR EXCEL WHEN YOU ANALYZE DATA WITH PYTHON ?
Easy, just write
df.to_csv('first.csv') #creating a csv file called first df.to_excel('first.xlsx', sheet_name='Sheet1') #creating a xlsx file called first, in sheet 1
OTHER USEFUL COMMAND TO ANALYZE DATA WITH PYTHON
Df.head= See head of your dataset
Df.columns= show your columns names
Df.tail = show latest 3 record of your dataset
Df.index= show you the range of your dataset
Stay tuned next time we will describe how to user wordcloud to describe string variables. Subscribe to our newsletter for more news
Don’t forget to leave your comment and to share the article if you like it.