In this EBook, we will discuss Exploratory Data Analysis in detail. With the help of bank data, we will explore data analysis using a very popular open source tool called R.
Following is a bank data for the loan given out from bank:
loan_id | amount | duration | payments | status | |
1 | 5314 | 96396 | 12 | 8033 | B |
2 | 5316 | 165960 | 36 | 4610 | A |
3 | 6863 | 127080 | 60 | 2118 | A |
4 | 5325 | 105804 | 36 | 2939 | A |
5 | 7240 | 274740 | 60 | 4579 | A |
The fields are defined as follows:
- loan_id:
-
Each new loan is identified by this number
-
Discrete variable since it is like a count of the loan
-
amount:
-
The amount that has been given as a loan
-
Continuous variable since it can be fractioned between given intervals like R.s. 2.5
-
duration (in months):
-
Repayment period
-
Discrete since it is countable and is restricted to months
-
payments:
-
Total amount that has been repaid
- Continuous variable since it can be fractioned between given intervals like Rs. 2.5
- status:
- What is the status of the loan i.e., has the customer has made the payment on time or has defaulted?
- Ordinal categorical/qualitative variable
Before we perform any analysis on the data, it needs to be loaded into the tool. In R, the dataset is usually loaded in the form of data frames.
-
Let’s read the data into a data frame object
1 |
loan = read.table("D:/loan.asc", header=TRUE,sep=";") |
- Once we have the data loaded for analysis, let’s check the column headers
1 |
names(loan) |
[1] “loan_id” “amount” “duration” “payments” “status” |
- Now, let’s have a look at the dimension of data.
- Use dim() to obtain the dimensions of the data set (number of rows and number of columns).
1 |
dim(loan) |
[1] 682 5 |
- R does have option to check vertical and horizontal dimension separately.
- Use nrow() and ncol() to get the number of rows and number of columns, respectively.
1 |
nrow(loan) |
[1] 682 |
1 |
ncol(loan) |
[1] 5 |
-
Before you move ahead, if you want to have a quick glimpse on the data structure, then use str() function.
-
The str() function returns useful pieces of information, including the above useful outputs and the type of data for each column.
1 |
str(loan) |
‘data.frame’: 682 obs. of 5 variables:
$ loan_id : int 5314 5316 6863 5325 7240 6687 7284 6111 7235 5997 … $ amount : int 96396 165960 127080 105804 274740 87840 52788 174744 154416 117024 … $ duration: int 12 36 60 36 60 24 12 24 48 24 … $ payments: num 8033 4610 2118 2939 4579 … $ status : Factor w/ 4 levels “A”,”B”,”C”,”D”: 2 1 1 1 1 1 1 2 1 1 … |
From the above summary, it is clear that:
- the data is of type data frame
- loan_id column has complete observation of type int
- amount also has int values
- payments are long integers known as numeric in R
- status is of type factors having four levels A, B, C, D
- Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6
1 |
head(loan) |
loan_id amount duration payments status
1 5314 96396 12 8033 B 2 5316 165960 36 4610 A 3 6863 127080 60 2118 A 4 5325 105804 36 2939 A 5 7240 274740 60 4579 A 6 6687 87840 24 3660 A |
However, you can specify any number with head in order to fetch the given top n rows. For example, to fetch top 10 rows of a data frame, we use n=10 as a parameter to head function
1 |
head(loan,n = 10) |
loan_id amount duration payments status
1 5314 96396 12 8033 B 2 5316 165960 36 4610 A 3 6863 127080 60 2118 A 4 5325 105804 36 2939 A 5 7240 274740 60 4579 A 6 6687 87840 24 3660 A 7 7284 52788 12 4399 A 8 6111 174744 24 7281 B 9 7235 154416 48 3217 A 10 5997 117024 24 4876 A |
1 |
tail(loan) |
loan_id amount duration payments status
677 5027 160920 36 4470 C 678 4989 352704 48 7348 C 679 5221 52512 12 4376 C 680 6402 139488 24 5812 C 681 5346 55632 24 2318 C 682 6748 240900 60 4015 C |
- To fetch any number of bottom records, we can pass the specified number to tail() like we did in head. For example, to get last 10 records of a data frame pass n=10 to tail()
1 |
tail(loan,n = 10) |
loan_id amount duration payments status
673 5 036 91248 48 1901 C 674 5644 276084 36 7669 C 675 6856 163332 36 4537 C 676 5428 230220 36 6395 C 677 5027 160920 36 4470 C 678 4989 352704 48 7348 C 679 5221 52512 12 4376 C 680 6402 139488 24 5812 C 681 5346 55632 24 2318 C 682 6748 240900 60 4015 C |
In the above str() function, the status is of type factor. Let’s now discuss what this is and where it is used.
A “factor” is a categorical variable which takes a specific set of values. All the specific set of categories that a vector can take are called levels.
For Example:
1 |
loan$status |
[1] B A A A A A A B A A A A B A A A A A A B A A A C A A A A B A C A A A A A A A C C A
[42] B C A A C A A A B A A A A A A A A B A A C B C A C A A B A A C D A B A A B C A A A [83] A A A A A A A A A A A A B B A A A A A A A A A A A A A A B B A A C D A C C A A A A [124] D B A A D C A A C D A C D C C A A B A C D D A A A C A A C A A A B C A B A C A C C [165] C C A A A A C C A C A A A A C A C C C A C A C A A C A A A A B C C A C B A A C A A [206] A C C C C C A C C C C A D A B B C C D C C C C C C D A D C D C D B C B A C C B C C [247] C C C C C C C A C C C A C C C A C C A C C C C A C A A C C A A D C D C B C C C C C [288] A C A A D C C A C C C C A C C C A A C A A A A C C C C C A C C A C C C A C D C C A |
- If you check the class of this column, you will find that it is as factor.
1 |
class(loan$status) |
[1] “factor” |
- However, to get the character representation of the factors, use level function.
1 |
class(levels(loan$status)) |
[1] “character” |
- To obtain all of the categories or levels of a categorical variable, use the levels() function.
1 |
levels(loan$status) |
[1] “A” “B” “C” “D” |
3.Numerical Summary
Let us find the numerical summary of the loan data:
- Find central tendency measure of each variable in data frame
- Find the mean average of each variable:
1 |
mean(loan$amount) |
[1] 151410.2 |
1 2 3 |
mean(loan$duration) mean(loan$duration) |
[1] 36.49267 |
1 2 3 |
mean(loan$payments) mean(loan$payments) |
[1] 4190.664 |
- Apply mean function on all variables
1 |
sapply(loan[,c(2:4)],mean) |
amount duration payments
151410.17595 36.49267 4190.66422 |
- Similarly, apply median to find the central observation in all three variables
1 2 3 4 5 6 7 |
median(loan$amount) median(loan$duration) median(loan$payments) sapply(loan[,c(2:4)],median) |
-
R doesn’t have a built-in mode function; we can write our own mode function to find the most frequently occurring observation in all the variables:
# Create the function. mode <- function(v) { uniqv <- unique(v) print(uniqv[which.max(tabulate(match(v, uniqv)))]) } |
1 2 3 4 5 6 7 8 9 |
mode(loan$amount) mode(loan$duration) mode(loan$payments) sapply(loan[,c(2:4)],mode) mode(loan$status) |
or,
- We can tabulate the frequency of the variable, sort them in descending order and pick the first value to get the most frequent item.
1 |
names(sort(-table(loan$status)))[1] |
- Check variation in the variable observations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
## variation measure #variance var(loan$amount) var(loan$duration) var(loan$payments) sapply(loan[,c(2:4)],var) #standard deviation sd(loan$amount) > sd(loan$duration) > sd(loan$payments) > sapply(loan[,c(2:4)],sd) #range > range(loan$amount) > range(loan$duration) > range(loan$payments) > sapply(loan[,c(2:4)],range) |
- Find the relative standing measure
1 |
#quartile |
# Q1=25% implies that, 25% of the amount falls below or equal to 66696, 25% of repayment done within 24 months, 1/4th of the payment done to the bank is of amount 2477 or less
while $Q_2=50% is the median, and thus half of the amount are below or equal to 116928, ½ of the loan duration is 36 months and ½ of the payment done to bank is 3934 or less and so on…
And the minimum and maximum values for amount are 4980 and 59020 respectively, minimum and maximum repayment duration in months are 12 and 60 respectively. Similarly, the minimum and maximum installment paid to the bank are 304 and 9910 respectively
1 2 3 4 5 6 7 |
quantile(loan$amount,type=2) quantile(loan$duration,type=2) quantile(loan$payments,type=2) sapply(loan[,c(2:4)],quantile,type=2) |
amount duration payments
0% 4980 12 304 25% 66696 24 2477 50% 116928 36 3934 75% 210744 48 5814 100% 590820 60 9910 |
1 |
##percentile |
The fifteenth percentile $P_{15}$=15% is interpreted as 15% of the samples fall below or equal to the given value in the variable while 85% fall above it. The thirty-fifth percentile $P_{35}$=35%, implies that 35% of the values fall below or equal to the given value in the variable while 65% fall above it.
1 2 3 4 5 6 7 |
quantile(loan$amount,prob = c(0.15, 0.25, 0.35)) quantile(loan$duration,prob = c(0.15, 0.25, 0.35)) quantile(loan$payments,prob = c(0.15, 0.25, 0.35)) sapply(loan[,c(2:4)],quantile,prob = c(0.15, 0.25, 0.35)) |
Check the shape of the data in order to know the spread as well as outlier location.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
library(moments) #skewness skewness(loan$amount) skewness(loan$duration) skewness(loan$payments) sapply(loan[,c(2:4)],skewness) sapply(loan[,c(2:4)],skewness) |
amount duration payments
1.11175782 -0.02952005 0.35359475 |
i.e., amount and payments are right skewed, however payment distribution is less skewed and is almost symmetric, whereas duration is slightly left skewed which again can be neglected.
To check the tail of distribution we can use kutosis()
1 |
sapply(loan[,c(2:4)],kurtosis) |
amount duration payments
3.822919 1.681499 2.298750 |
Hope this blog was useful. In the next blog, we will learn the how to summarize data in terms of graphs in R along with their interpretation.
Leave a Reply