05 April 2016

Exploratory Data Analysis -Part II

In this EBook, we will discuss Exploratory Data Analysis in detail. With the help of bank data, we will explore data analysis using a very popular open source tool called R.

The Dataset

Following is a bank data for the loan given out from bank:

	loan_id	amount	duration	payments	status
1	5314	96396	12	8033	B
2	5316	165960	36	4610	A
3	6863	127080	60	2118	A
4	5325	105804	36	2939	A
5	7240	274740	60	4579	A

The fields are defined as follows:

loan_id:

Each new loan is identified by this number
Discrete variable since it is like a count of the loan

amount:

The amount that has been given as a loan
Continuous variable since it can be fractioned between given intervals like R.s. 2.5

duration (in months):

Repayment period
Discrete since it is countable and is restricted to months

payments:

Total amount that has been repaid
Continuous variable since it can be fractioned between given intervals like Rs. 2.5

status:

What is the status of the loan i.e., has the customer has made the payment on time or has defaulted?
Ordinal categorical/qualitative variable

Before we perform any analysis on the data, it needs to be loaded into the tool. In R, the dataset is usually loaded in the form of data frames.

Let’s read the data into a data frame object

1	loan = read.table("D:/loan.asc", header=TRUE,sep=";")

Once we have the data loaded for analysis, let’s check the column headers

1	names(loan)

[1] “loan_id” “amount” “duration” “payments” “status”

Now, let’s have a look at the dimension of data.

Use dim() to obtain the dimensions of the data set (number of rows and number of columns).

dim(loan)

[1] 682 5

R does have option to check vertical and horizontal dimension separately.

Use nrow() and ncol() to get the number of rows and number of columns, respectively.

1	nrow(loan)

[1] 682

1	ncol(loan)

[1] 5

Before you move ahead, if you want to have a quick glimpse on the data structure, then use str() function.

The str() function returns useful pieces of information, including the above useful outputs and the type of data for each column.

str(loan)

‘data.frame’: 682 obs. of 5 variables:

$ loan_id : int 5314 5316 6863 5325 7240 6687 7284 6111 7235 5997 …

$ amount : int 96396 165960 127080 105804 274740 87840 52788 174744 154416 117024 …

$ duration: int 12 36 60 36 60 24 12 24 48 24 …

$ payments: num 8033 4610 2118 2939 4579 …

$ status : Factor w/ 4 levels “A”,”B”,”C”,”D”: 2 1 1 1 1 1 1 2 1 1 …

From the above summary, it is clear that:

the data is of type data frame
loan_id column has complete observation of type int
amount also has int values
payments are long integers known as numeric in R
status is of type factors having four levels A, B, C, D

Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6

1	head(loan)

loan_id amount duration payments status

1 5314 96396 12 8033 B

2 5316 165960 36 4610 A

3 6863 127080 60 2118 A

4 5325 105804 36 2939 A

5 7240 274740 60 4579 A

6 6687 87840 24 3660 A

However, you can specify any number with head in order to fetch the given top n rows. For example, to fetch top 10 rows of a data frame, we use n=10 as a parameter to head function

1	head(loan,n = 10)

loan_id amount duration payments status

1 5314 96396 12 8033 B

2 5316 165960 36 4610 A

3 6863 127080 60 2118 A

4 5325 105804 36 2939 A

5 7240 274740 60 4579 A

6 6687 87840 24 3660 A

7 7284 52788 12 4399 A

8 6111 174744 24 7281 B

9 7235 154416 48 3217 A

10 5997 117024 24 4876 A

1	tail(loan)

loan_id amount duration payments status

677 5027 160920 36 4470 C

678 4989 352704 48 7348 C

679 5221 52512 12 4376 C

680 6402 139488 24 5812 C

681 5346 55632 24 2318 C

682 6748 240900 60 4015 C

To fetch any number of bottom records, we can pass the specified number to tail() like we did in head. For example, to get last 10 records of a data frame pass n=10 to tail()

1	tail(loan,n = 10)

loan_id amount duration payments status

673 5 036 91248 48 1901 C

674 5644 276084 36 7669 C

675 6856 163332 36 4537 C

676 5428 230220 36 6395 C

677 5027 160920 36 4470 C

678 4989 352704 48 7348 C

679 5221 52512 12 4376 C

680 6402 139488 24 5812 C

681 5346 55632 24 2318 C

682 6748 240900 60 4015 C

In the above str() function, the status is of type factor. Let’s now discuss what this is and where it is used.

Factor

A “factor” is a categorical variable which takes a specific set of values. All the specific set of categories that a vector can take are called levels.

For Example:

1	loan$status

[1] B A A A A A A B A A A A B A A A A A A B A A A C A A A A B A C A A A A A A A C C A

[42] B C A A C A A A B A A A A A A A A B A A C B C A C A A B A A C D A B A A B C A A A

[83] A A A A A A A A A A A A B B A A A A A A A A A A A A A A B B A A C D A C C A A A A

[124] D B A A D C A A C D A C D C C A A B A C D D A A A C A A C A A A B C A B A C A C C

[165] C C A A A A C C A C A A A A C A C C C A C A C A A C A A A A B C C A C B A A C A A

[206] A C C C C C A C C C C A D A B B C C D C C C C C C D A D C D C D B C B A C C B C C

[247] C C C C C C C A C C C A C C C A C C A C C C C A C A A C C A A D C D C B C C C C C

[288] A C A A D C C A C C C C A C C C A A C A A A A C C C C C A C C A C C C A C D C C A

If you check the class of this column, you will find that it is as factor.

1	class(loan$status)

[1] “factor”

However, to get the character representation of the factors, use level function.

1	class(levels(loan$status))

[1] “character”

To obtain all of the categories or levels of a categorical variable, use the levels() function.

1	levels(loan$status)

[1] “A” “B” “C” “D”

3.Numerical Summary

Let us find the numerical summary of the loan data:

Find central tendency measure of each variable in data frame

Find the mean average of each variable:

1	mean(loan$amount)

[1] 151410.2

mean(loan$duration)

[1] 36.49267

mean(loan$payments)

[1] 4190.664

Apply mean function on all variables

1	sapply(loan[,c(2:4)],mean)

amount duration payments

151410.17595 36.49267 4190.66422

Similarly, apply median to find the central observation in all three variables

median(loan$amount)

median(loan$duration)

median(loan$payments)

sapply(loan[,c(2:4)],median)

R doesn’t have a built-in mode function; we can write our own mode function to find the most frequently occurring observation in all the variables:

# Create the function.

mode <- function(v) {

uniqv <- unique(v)

print(uniqv[which.max(tabulate(match(v, uniqv)))])

}

mode(loan$amount)

mode(loan$duration)

mode(loan$payments)

sapply(loan[,c(2:4)],mode)

mode(loan$status)

or,

We can tabulate the frequency of the variable, sort them in descending order and pick the first value to get the most frequent item.

1	names(sort(-table(loan$status)))[1]

Check variation in the variable observations:

## variation measure

#variance

var(loan$amount)

var(loan$duration)

var(loan$payments)

sapply(loan[,c(2:4)],var)

#standard deviation

sd(loan$amount)

> sd(loan$duration)

> sd(loan$payments)

> sapply(loan[,c(2:4)],sd)

#range

> range(loan$amount)

> range(loan$duration)

> range(loan$payments)

> sapply(loan[,c(2:4)],range)

Find the relative standing measure

#quartile

# Q1=25% implies that, 25% of the amount falls below or equal to 66696, 25% of repayment done within 24 months, 1/4^th of the payment done to the bank is of amount 2477 or less

while $Q_2=50% is the median, and thus half of the amount are below or equal to 116928, ½ of the loan duration is 36 months and ½ of the payment done to bank is 3934 or less and so on…

And the minimum and maximum values for amount are 4980 and 59020 respectively, minimum and maximum repayment duration in months are 12 and 60 respectively. Similarly, the minimum and maximum installment paid to the bank are 304 and 9910 respectively

quantile(loan$amount,type=2)

quantile(loan$duration,type=2)

quantile(loan$payments,type=2)

sapply(loan[,c(2:4)],quantile,type=2)

amount duration payments

0% 4980 12 304

25% 66696 24 2477

50% 116928 36 3934

75% 210744 48 5814

100% 590820 60 9910

1	##percentile

The fifteenth percentile $P_{15}$=15% is interpreted as 15% of the samples fall below or equal to the given value in the variable while 85% fall above it. The thirty-fifth percentile $P_{35}$=35%, implies that 35% of the values fall below or equal to the given value in the variable while 65% fall above it.

quantile(loan$amount,prob = c(0.15, 0.25, 0.35))

quantile(loan$duration,prob = c(0.15, 0.25, 0.35))

quantile(loan$payments,prob = c(0.15, 0.25, 0.35))

sapply(loan[,c(2:4)],quantile,prob = c(0.15, 0.25, 0.35))

Check the shape of the data in order to know the spread as well as outlier location.

library(moments)

#skewness

skewness(loan$amount)

skewness(loan$duration)

skewness(loan$payments)

sapply(loan[,c(2:4)],skewness)

amount duration payments

1.11175782 -0.02952005 0.35359475

i.e., amount and payments are right skewed, however payment distribution is less skewed and is almost symmetric, whereas duration is slightly left skewed which again can be neglected.

To check the tail of distribution we can use kutosis()

1	sapply(loan[,c(2:4)],kurtosis)

amount duration payments

3.822919 1.681499 2.298750

Hope this blog was useful. In the next blog, we will learn the how to summarize data in terms of graphs in R along with their interpretation.

AcadGild

Exploratory Data Analysis -Part II

The Dataset

Factor

Related

Satyam

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

The Dataset

Factor

Share this:

Related

Satyam

Related Posts

Leave a Reply