28 March 2016

Exploratory Data analysis With R – Part I

Exploratory Data Analysis: The First Statistical Glance of the Data

In this blog, we will learn about the basic analysis tasks what we should apply to our data before we go ahead and build complex models.

We will discuss the basic statistical properties that almost all the data have and can be used to extract information from the data. These steps are commonly known as Exploratory Data Analysis (EDA).

Exploratory Data Analysis

John Tukey suggested using EDA to collect and analyze data—not to confirm a hypothesis, but to form a hypothesis that could later be confirmed through other methods.

In statistics, EDA is an approach to analyze data sets to summarize their main characteristics, with the help of descriptive statistics and visual methods. Primarily, EDA is used for visualizing what the data can tell us about itself without performing a complex operation on it.

Before making inferences from data, it is essential to examine all of its variables.

Why?

To listen to the data:

maximize insight into a data set
to detect mistakes i.e. detect outliers and anomalies
extract important variables
to see patterns in the data
to find violations of statistical assumptions
to generate hypotheses
test underlying assumptions

…because if you don’t, you may have trouble later.

Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data.

In this blog, we will discuss:

numerical summaries or descriptive statistics
check details of data density and
graphical analysis

Dataset

Following are the components of a data/dataset:

A data set is often represented as a matrix
There is a row for each unit
There is a column for each variable
A unit is an object that can be measured, such as a person, or a thing
A variable is a characteristic of a unit that can be assigned a number or a category

Dimensionality of Data Sets

Univariate: Measurement made on one variable per subject
Bivariate: Measurement made on two variables per subject
Multivariate: Measurement made on many variables per subject

Type of variables

Qualitative: Variables take on values that are names or labels.

Example: The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier.

Types:
- Nominal: It does not matter which way the categories are ordered in tabular or graphical displays of the data — all orderings are equally meaningful. For example, a student’s religion (Atheist, Christian, Muslim, Hindu, …) is nominal.
- Ordinal: A categorical variable whose categories can be meaningfully ordered is called ordinal. For example, a student’s grade in an exam (A, B, C or Fail) is ordinal.

Quantitative: Variables that are measured on a numeric or quantitative scale.

Example: Age, count of anything etc.

Types:
- Discrete: A discrete variable is one that cannot take on all values within the limits of the variable.

For example, number of children is a discrete numerical variable (a count). The variable cannot have the value 1.7

Continuous: If a variable can take on any value between two specified values, it is called a continuous variable.

For example, age of a human: 25 years, 10 months, 2 days, 5 hours

Numerical Summaries of Data

Numerical measures are useful in situations which require decision making and inferences to be drawn based on data. The following measures are discussed below:

• Central Tendency measures

They are computed to give a “center” around which the measurements in the data are distributed
To check the central tendency of the data, compute the following:

mean
median
mode

• Variation or Variability measures

They describe “data spread” or how distant are the measurements from the center
To check the Variation or spread of the data compute the following:

Range
Variance
Standard Deviation
Inter Quartile Range (IQR)

• Relative Standing measures

Percentile
Quartiles

Let us now discuss the above measures in more detail.

Central Tendency measures

The Mean: It is the average of the observations

To calculate the average x of a set of observations, add their value and divide by the number of observations:

The Median: It is the value which is exactly in the middle

Calculation:
- If there are odd number of observations, find the middle value
- If there are even number of observations, find the middle two values and average them
  For example:
  Age of participants: 17 19 21 22 22 33 23 38
  Median = (22+22)/2 = 22

Note: Which is the best Location Measure?

Mean is best for symmetric distributions without outliers

0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Median = 3

Median is useful for skewed distributions or data with outliers

0 1 2 3 4 5 6 7 8 9 10

Mean = 4 Median = 3

The Mode: The mode is the number that is repeated more often than any other

Example: 1, 1, 1, 1, 14, 14, 16, 18, 21

Mode = 1 since it has been repeated most

The Minimum: Minimum value available in that observation list
The Maximum: Maximum value available in that observation list

Variation or Variability measures

The Range:
- Complete spread of the data
- To calculate range: Maximum – Minimum
- Displays all windows in which all possible observations are recorded

The Variance: Average of squared deviations of values from the mean

Increasing contribution to the variance as you go farther from the mean.

The Standard Deviation:

Variance is arbitrary
- What does it mean to have a variance of 10.8 or 2.2 or 1 459.092 or 0.000001?
- Nothing. But if you could “standardize” that value, you could talk about any variance (i.e. deviation) in equivalent terms.
Standard deviations are simply the square root of the variance
- Standard Deviation simply scales the number that you gain from variance, so that it can be used as a standard unit

Note:

Empirical Rule

For any normal distribution, especially if their histogram is bell-shaped,

About 68% of the observations are within 1 SD of the mean.
About 95% of the observations are within 2 SDs of the mean.

Nearly all observations are within 3 SDs of the mean.

The IQR: The “Interquartile Range” is the range from first quartile i.e. Q1 to third quartile i.e. Q3:

Example:

Quartiles: Quartiles are the values that divide a list of numbers into quarters.

First put the list of numbers in order
Then divide the list into four equal parts
The Quartiles are at the “cuts”

Example: 17,19,21,22,27,33,23,38,40

Put them in order: 17 19 21 22 23 27 33 38 40

Divide the list into quarters:

17 19 21 22 23 27 33 38 40

And the result is:

Quartile 1 (Q1) = 21
Quartile 2 (Q2), which is also the Median, = 23
Quartile 3 (Q3) = 33

Sometimes a “cut” is between two numbers that is the Quartile is the average of the two numbers.

Example: 17 19 21 22 23 27 33 38

The numbers are already in order

Cut the list into quarters:

17 19 21 22 23 27 33 38

In this case Quartile 2 is half way between 5 and 6:

Q2 = (22+23)/2 = 22.5

And the result is:

Quartile 1 (Q1) = (19+21)/2=20.0
Quartile 2 (Q2) = 22.5
Quartile 3 (Q3) = 30.0

Relative Standing measures

Percentiles and Quartiles:

Measures of relative standing can be used to compare values from different data sets, or to compare values within the same data set.
To calculate quartiles and percentiles, the data must be ordered from smallest to largest.

Percentiles

A percentile is a measure used in statistics indicating the value, below which a given percentage of observations in a group of observations fall.
For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

Quartiles

The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3).

Other Attributes

Checking the relationship between the available variables

Covariance

The covariance of two variables x and y in a data sample measures how the two are linearly related.
A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.
The sample covariance is defined in terms of the sample means as:

Check the shape of the data

Skewness

Skewness is a measure of the symmetry in a distribution
A distribution or data set is symmetric if the left and right of the center point looks exactly the same
A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0
Skewness essentially measures the relative size of the two tails

If the value is negative, it implies that the distribution of the data is slightly skewed to the left or negatively skewed
If the value is positive, it implies that the distribution of the data is slightly skewed to the right or positively skewed

Kurtosis

Measure of the “tailedness” of the probability distribution of a real-valued random variable
Kurtosis is a measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack outliers
The kurtosis of any univariate normal distribution is 3. It is common to compare the kurtosis of a distribution to this value.
Distribution with kurtosis less than 3 are said to be platykurtic. An example of a platykurtic distribution is the uniform distribution, which does not have positive-valued tails.
Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero slowly when compared with a Gaussian.

The ‘Best’ way to summarize data sets:

First step is to summarize each variable in the data set.
Then, the best way to summarize a variable depends on its characteristics i.e. whether it is qualitative or quantitative:
Then summarize each variable with respect to other variables present in the dataset

Hope this overview on Exploratory Data Analysis was useful. In the next blog we will apply all these numerical summaries on a bank’s loan dataset with the help of a popular and open source statistical tool called R.

AcadGild

Exploratory Data analysis With R – Part I

Check the shape of the data

Skewness

Kurtosis

Related

Satyam

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Check the shape of the data

Skewness

Kurtosis

Share this:

Related

Satyam

Related Posts

Leave a Reply