## 3 fundamentals statistic skills for data science?

Today I would like to discuss with you about statistic skills you need to develop, if you want to become a data scientist. Do you need to be graduated in Statistics to do this job? The quick reply is some Statistics is needed, but practice is more important.

## Statistic skills: what are you talking about?

I really loved this quotation by Manish Tripathi

Data Science without Statistics is like owning a Ferrari without brakes. You can enjoy sitting in Ferrari, show off your newly owned car to others, but you can’t enjoy the drive for long because you would crash land soon!

http://qr.ae/TUpAIF

A good data scientist, need to know :

-use statistic skills to explore and visualize data

-most important statistical theories (like hypothesis testing, bayesian analysis)

-know most common statistical models and define which is best to use (like linear regressions. Time series analysis)

-evaluate if the model is working for the purpose of your analysis.

But theory is not all, so the best way to learn about Stat skills is through practical approach. So don’t expect to become a good data scientist only reading books or learning theory.

## Use statistics skills to explore data:

If you are new in the world of data, dataset and graph, you can start from this free course  : Analyzing categorical data provided by Khan Academy.  Here you will learn how to identify individuals, variables, read different  types of graphs and much more. I suggest to stop at first module, if you are at a basic level.

Grouping & Visualization

This is a fondamental exercise to be done with your dataset. Let’use a free dataset on regarding Wine reviews Kaggle dataset

In this dataset you  will have 130k wine reviews with wines coming from all over the world, scored by wine taster from 0 to 100, including a lot of info related to qualitative features of the wine.

We would like to understand in few minutes:

• which variable do we have in the dataset?
• which data types are (numbers? if so what kind of numbers – maybe strings of text)
• How many data we have? Number of rows and columns
• For some variables which are minimum, average and maximum values
• which is the median (mid point of the data set) and the mode (most frequent observation)?
• How is the distribution of our data? For on?e variable which is the average observation for the first 25% of my dataset, and for 50%

To see these analysis in action, have a look to: Python: Analyze your data in 5 minutes with Panda

## Take advantage of Simple statitiscal concept

Let’s briefly report same simple statistical concept that it will be deep dive in separate post

• Descriptive statistics: you are probably familiar with mean, median, mode, ranges and quartile.  This info will help you to understand how looks like your dataset.

Coming back to our Wine dataset just with one command you can identify many of these information. In this case you will see that your database has around 130.000 records, with an average points (coming from reviews) of 88,45 and a reported average price of 35,36\$

Minimum value is 80 and 4€ for price and max is 100 for variable points and 3.300 for price (Wow!!)

Percentiles:25%, also called first quartile: it means that observation 32.492 is represe
nting 25% of your dataset (in ascending order). This observation has an average review of 86 and a price of 17\$.

Interesting to see that to arrive to 50% of this database you will increment only 2 value in points (88) but +30% in price (25%)

More specifics on percentile could be found at Statistichowto or at Statistic for Dummies

• Distributions: explain you how it is possible (probable) that your data will be distributed. More famous is normal distribution, also knowned as “bell curve” (that happens many time in nature). Another important distribution curve, is binomial, that easily represent two status, i.e success or failure of a new drug.We will discuss about distributions in a separate post about distributions.

In the next topic we will discuss also about Hypothesis testing, Regression model, Time series analysis and other Intermediate Statistical concepts

If you have liked this post on fundamentals statistic skills for a great data scientist, please sharing it through social buttons. Let me know your comment or thought adding a comment!

## Python tutorial: Analyze data in 5 minutes using Pandas

Ready to learn how to analyze data with Python in few minutes, without knowing too much about Python language? In this brief Python tutorial, you will learn how easy is importing 130.000 rows in few seconds, with Pandas module for Python. With less than 10 line of code you can explore number of records, column names, and start to know more your data with some exploratory analysis like mean, max & minimum.

But let’s go, step by steps. What is Pandas?

pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Let’see the code to import CSV. Less than a minute

```#Import module pandas
import pandas as pd

Location = r'C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv'

Attention: change your path, changing  ‘C:\DATASET\WINE REVIEWS\winemag-data-130k-v2.csv’ with your path. If you are not familiar with this check it out: 3 simple way to change your path in Anaconda/Python

What you will learn at the end of this Python tutorial on Analyze data in 5 minutes using Pandas

1. Undertand variables of your dataset with df.dtypes
2. Main statistics with df.describe
3. Find missing value
4. Group info with .groupby, i.e by country
6. Export your data in Excel
7. Learn other useful command like df.head/tail

## WHAT KIND OF VARIABLE WE HAVE IN THE DATASET?

In the first part of this Python tutorial, we will use Anaconda  and Pandas to start exploring our data. We can see that now our df (Dataframe) has around 130k records and 14 variables(129971,14)

Using command:

`df.dtypes`

you can easily check what kind of data do you have. Here we have basically all variable as an object , while first variable, points are integer number (int64) and price include floating number (with decimals)

```Out[4]:

Unnamed: 0                 int64

country                   object

description               object

designation               object

points                     int64

price                    float64

province                  object

region_1                  object

region_2                  object

taster_name               object

title                     object

variety                   object

winery                    object

dtype: object

```

If you want to learn more about which are most common Statistical & Math skills for a data scientist, discover our dedicated page.

## PYTHON TUTORIAL: LEARN MAIN STATISTICS WITH ONE COMMAND:DF.DESCRIBE

Let’s focus only only on  numerical variables.

• Point: This represent a number from 0 to 100 as scoring of wine taster.
• Price, no need to explain

Here we introduce a great command, df.describe, to analyze mean, standard deviation, min, max and quartile distribution of our data:

```df.describe

```

We can see that variable Points are available for all data (129.971) and has a minimum of 80 and a max of 100, with an average of 88,45

Our first 25% of dataset (32.492) has an average point of 86, 50% of dataset (88)

Of course you can do the same in Excel, but you need to create several cell and write several formulas. So  now you have learned in this Python tutorial how to save some time in exploring data.

## PYTHON TUTORIAL: UNDERSTAND IF YOU HAVE MISSING VALUES IN YOUR DATASET

Another powerful command to analyze data with python is understanding the quality of your dataset. Do you have missing value? How many? In which variables?

Let’s continue our Python tutorial, where soon you willsee the value of using Python instead of Excel.

Write these two lines of code and you will find how many missing values you have in your dataset for every columns

```null_columns=df.columns[df.isnull().any()]

df[null_columns].isnull().sum()```

## WHICH IS THE AVERAGE SCORING AND PRICE IN ANY COUNTRY?

One of the most common things to analyze data with Python, is  understanding average data, maybe grouping for some of your variables.If you want to know which is the average score and price by country, you can use

.groupby ().operations,

where in the parenthesis you need to put variable to be grouped and after the operation that you want to do .mean or .sum for example

`df.groupby(‘country’).mean()`

So we will discover that in our dataset, average price in Argentina is 24,5\$ with an average of 86,7\$, better than Austria that has 90 point in average but you have to pay 30\$

Of course you can groupby by multiple column (‘country’,’region’)

## FILTER OR USE SUBSET WITH WINE WITH PRICE >90\$

Maybe you are interesting to easily know how many permutations you have in your database that fit with a particular threshold. In this case we would like to know how many wines has a price >90\$. Result is more than 4.000 records!!!

`df[df.price>90]`

## HOW TO EXPORT IN CSV OR EXCEL WHEN YOU ANALYZE DATA WITH PYTHON ?

Easy, just write

```df.to_csv('first.csv') #creating a csv file called first

df.to_excel('first.xlsx', sheet_name='Sheet1') #creating a xlsx file called first, in sheet 1```