13 April 2016

Exploratory Data Analysis: Graphical Data Analysis with R

In this blog, we will discuss about visualizing the most important attributes of data through graphical exploratory data analysis with R. We will also learn about the suitability of visualization in different scenarios.

We recommend users to go through our previous blogs on Exploratory Data Analysis to have better understanding of the concepts discussed further.

EDA part – I

EDA part – II

Introduction to Data Visualization

Data visualization is the presentation of data through pictures and shapes. Data visualization can be considered as a modern equivalent of visual communication. It enables decision makers to grasp difficult concepts or identify new patterns in analytics.

The primary goal of visual representation of data is to communicate information clearly and efficiently to users through statistical graphics, plots, information graphics, tables, and charts. Effective visualization helps users to reason and analyze data and evidence. It makes complex data more accessible, understandable, and usable.

Human brain has a tendency to learn things much faster when the data is presented using different shapes and colorful images. Charts and graphs can be used rather than spreadsheets, reports or numbers to visualize large amount of complex data.

Data visualization is a quick, easy way of conveying concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.

It is said that one meaningful picture is powerful than thousand words.

Data visualization can help us in the following ways:

• Identify areas that need attention or improvement

• Clarify which factors influence customer behavior

• Help you understand the right placement of the products.

• Predict sales volume

Dataset

Following is a bank data for the loan given out by a bank:

The fields are defined as follows:

loan_id:
- Each new loan is identified by this number
- Discrete variable since it is like a count of the loan

amount:
- The amount that has been given as a loan
- Continuous variable since it can be fractioned between given intervals like R.s. 2.5

Duration (in months):
- Repayment period
- Discrete since it is countable and is restricted to months

payments:
- Total amount that has been repaid
- Continuous variable since it can be fractioned between given intervals like R.s. 2.5

status:
- The status of the loan i.e. whether the customer has paid it on time or not.
- Ordinal categorical/Qualitative variable

We will see how the data can be visualized.

Central Tendency Measures

A central tendency measure is a value which can best describe an entire set of observations.

It can be measured by:

Mean
Median
Mode

Histogram is the best choice for visualizing central tendency of data.

Histogram

Histograms are a special form of bar chart where the data represents continuous rather than discrete categories.

There are no gaps between the columns representing the different categories.

In a bar chart, the length of the bar indicates the size of the category, but in a histogram it is the area of the bar that is proportional to the size of the category.

This difference is due to the fact that in a histogram both the x-axis and y-axis have a scale, whereas in a bar chart only the y-axis has a scale.

In the example below, a histogram has been used to show the average height of children of different ages in 1837. A histogram is used because age is a continuous rather than a discrete category.

Let us plot these on our loan data:

Load the data and check column names:

loan = read.table(“E:/mystuff/datasets/data_berka/loan.asc”, header=TRUE,sep=”;”)

names(loan)

[1] “loan_id” “account_id” “date” “amount” “duration” “payments” “status”

Attach the data frame for current use in the environment. This will help us to refer to columns directly by their names

attach(loan)

Pick relevant columns from the data; here it is (“amount” “duration” “status”)

loan_date_loan_amt_payment_duration<-loan[,c(4:5,7)]

Here amount is a continuous variable, duration is discrete variable and status is an ordinal variable. Before we plot histogram on amount, we need to scale down the variable by converting it into thousand units.

loan_date_loan_amt_payment_duration$amount_in_thousand <- loan_date_loan_amt_payment_duration$amount/1000

plot the histogram of amount variable.

hist(amount_in_thousand, # histogram

breaks = 10, # 10 equal division of also known as bins of amount

col=”peachpuff”, # column color

border=”black”, # border color

xlab = “loan amount(in thousand)”,

main = “loan duration “)

Let’s construct a pdf histogram where y-axis value will be probability of the distribution rather than frequency.

hist(amount_in_thousand, # histogram

breaks = 10,

col=”peachpuff”, # column color

border=”black”,

prob = TRUE, # show densities instead of frequencies

xlab = “loan amount(in thousand)”,

main = “loan amount pdf”)

Add a density line onto the pdf

lines(density(amount_in_thousand), # density plot

lwd = 2, # thickness of line

col = “chocolate3” # color for the line)

Next, we’ll add a line for the mean

abline(v = mean(amount_in_thousand),

col = “royalblue”, # color for line marking mean for the distribution lwd = 2)

We’ll add a line for the median

abline(v = median(amount_in_thousand),

col = “red”,lwd = 2)

Next, we will mark the mode of our distribution. We’ll first construct a method for mode

# we do not have inbuilt function for mode in R,so we create one

mode <- function(v) {

uniqv <- unique(v)

print(uniqv[which.max(tabulate(match(v, uniqv)))])

}

Now, use the above function then use it to plot the line on pdf.

# And a line for the mode:

abline(v = mode(amount_in_thousand),

col = “green”,lwd = 2)

Now, we will add legends to the graph then put an interpretation to it:

#We add a legend, so it will be easy to tell which line is which.

legend(x = “topright”, # location of legend within plot area

c(“Density plot”, “Mean”, “Median”,”Mode”), # name of the plot lines that we plotted

col = c(“chocolate3”, “royalblue”, “red”,”green”), # color for each line

lwd = c(2, 2, 2,2) # line width for each plotted line)

Interpretation: Here, we can see that mean, median and mode are far away from each other which means loan amount is not uniformly distributed. Mean lies somewhere at 150 thousand.

The middle loan amount is around 120 thousand. The amount that has been given as loan is around 30000, as our mode value lies in that region.

For categorical data, we use Bar chart to check the frequency of each category.

Bar Chart

Bar charts are used to display and compare the density, frequency or other measure (e.g. mean) for different discrete categories of data.

There are several variations of the standard bar chart including horizontal bar charts, grouped or component charts, and stacked bar charts.

Bar charts are useful for displaying data that are classified into nominal or ordinal categories.

3.2.1 Types of Bar Charts

Following are the different types of Bar charts.

Vertical Bar Charts

Bar charts normally have vertical bars. Taller the bar, larger is the category.

# Simple vertical Bar Plot

counts <- table(status)

barplot(counts, main=”Loan repayment grades”,xlab=”Grades”,ylab = “no of customers”)

Horizontal Bar Charts

It is also possible to draw bar charts in such a way that the bars are horizontal. Longer is the bar, larger is the category.

It is useful when different categories have long titles that would be difficult to include below a vertical bar, or when there are a large number of different categories and there is insufficient space to fit all the columns required for a vertical bar chart across the page.

# Simple horizontal Bar Plot

counts <- table(status)

barplot(counts, main=”Loan repayment grades”,xlab=”Grades”,ylab = “no of customers”,horiz = T)

Grouped Bar Charts

They are used to display information about different sub-groups of the main category. A separate bar represents each of the sub-groups and these are usually colored or shaded differently to distinguish between them.

In such cases, a legend or key is usually provided to indicate the sub-group and color that it represents.

# Grouped Bar Plot

barplot(counts, main=”loan repayment grade vs loan duration “,xlab=”loan repayment grade”,ylab = “loan duration”,col=c(“red”,”blue”,”chocolate”,”yellow”,”green”),

legend = rownames(counts), beside=TRUE)

Stacked Bar Charts

They are similar to grouped bar charts in that they are used to display information about the sub-groups that make up the different categories.

In stacked bar charts, the bars representing the sub-groups are placed on top of each other to make a single column, or side by side to make a single bar.

The overall height or length of the bar shows the total size of the category whilst different colours are used to indicate the relative contribution of the different sub-groups.

# Stacked Bar Plot with Colors and Legend

counts <- table(duration, status)

barplot(counts, main=”loan repayment grade vs loan duration “,xlab=”loan repayment grade”,ylab = “loan duration”,

col=c(“red”,”blue”,”chocolate”,”yellow”,”green”),

legend = rownames(counts))

We hope this blog was useful. If you have any questions, feel free to contact us at support@acadgild.com.

AcadGild

Exploratory Data Analysis: Graphical Data Analysis with R

Introduction to Data Visualization

Dataset

Central Tendency Measures

Histogram

Bar Chart

3.2.1 Types of Bar Charts

Related

Satyam

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Introduction to Data Visualization

Dataset

Central Tendency Measures

Histogram

Bar Chart

3.2.1 Types of Bar Charts

Share this:

Related

Satyam

Related Posts

Leave a Reply