23 August 2016

Machine Learning with Spark – Part 3

In this post, we will be working on a dataset from a bank and try to find some patterns using Exploratory Data Analysis.

Before we go ahead let’s get acquainted with the data set.

Dataset:

Source Information

Professor Dr. Hans Hofmann

Institut f”ur Statistik und “Okonometrie

Universit”at Hamburg

FB Wirtschaftswissenschaften

Von-Melle-Park 5

2000 Hamburg 13

Use Case:

When a bank receives a loan application, based on the applicant’s profile, the bank has to make a decision regarding whether to go ahead with the loan approval or not. The two types of risks associated with the bank’s decision are:

If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank.
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank.

Objective of Analysis:

Minimization of Risk and Maximization of Profit on Behalf of the Bank.

To minimize loss from the bank’s perspective, the bank needs a decision rule regarding whom to give approval of the loan and whom not to. An applicant’s demographic and socio-economic profiles are considered by the loan managers before a decision is taken regarding his/her loan application.

The German Credit Data contains data for 1000 loan applicants, on 20 variables and the classification whether an applicant is considered to be a Good or a Bad credit risk.

A predictive model developed on this data is expected to provide the bank manager with a guidance for making a decision whether to approve a loan to a prospective applicant, based on their profile.

Attributes :

Creditability,

Account Balance,

Duration of Credit (month),

Payment Status of Previous Credit,

Purpose,

Credit Amount,

Value Savings/Stocks,

Length of current employment,

Instalment per cent,

Sex & Marital Status,

Guarantors,

Duration in Current address,

Most valuable available asset,

Age (years),

Concurrent Credits,

Type of apartment,

No of Credits at this Bank,

Occupation,No of dependents,

Telephone,

Foreign Worker

Let’s begin by loading the data into Spark .This is a structured dataset, so as per our initial discussion, dataFrame would be the best choice for such data.

We can load the data into data frame through two ways:

Load it through RDD i.e. construct an RDD, define the schema and then convert it into a data frame.

raw_data_file=hdfs_project_root+’german_credit.csv’
from pyspark.sql import Row
# Read file from hdfsraw_data = sc.textFile(raw_data_file)# Read first line as header and replace spaces from themheader=raw_data.first().replace(‘ ‘,’_’)
# remove header from actual dataonly_data = raw_data.filter(lambda r : r.replace(‘ ‘,’_’) != header)
# By deafault everything is read as string,convert the data to double/floatdef to_type(data):return [float(i) for i in data]
from pyspark.sql.types import *
#This line if you want the deafault header in the schema#fields = [StructField(field_name,DoubleType(), True) for field_name in header.split(‘,’)]#schema = StructType(fields)
# Changing the header as the default headers are lengthy

schema = StructType([

StructField(‘creditability’, DoubleType(), True),

StructField(‘balance’, DoubleType(), True),

StructField(‘duration’, DoubleType(), True),

StructField(‘history’, DoubleType(), True),

StructField(‘purpose’, DoubleType(), True),

StructField(‘amount’, DoubleType(), True),

StructField(‘savings’, DoubleType(), True),

StructField(’employment’, DoubleType(), True),

StructField(‘instPercent’, DoubleType(), True),

StructField(‘sexMarried’, DoubleType(), True),

StructField(‘guarantors’, DoubleType(), True),

StructField(‘residenceDuration’, DoubleType(), True),

StructField(‘assets’, DoubleType(), True),

StructField(‘age’, DoubleType(), True),

StructField(‘concCredit’, DoubleType(), True),

StructField(‘apartment’, DoubleType(), True),

StructField(‘credits’, DoubleType(), True),

StructField(‘occupation’, DoubleType(), True),

StructField(‘dependents’, DoubleType(), True),

StructField(‘hasPhone’, DoubleType(), True),

StructField(‘foreign’, DoubleType(), True)

])

# Map the schema to the data and create data frame

Customers = sqlContext.createDataFrame(only_data, schema)

Load the data through the spark_csv module

Customers = sqlContext.read.format(“com.databricks.spark.csv”) \

.option(“header”, “true”) \

.option(“inferschema”, “true”) \

.option(“mode”, “DROPMALFORMED”) \

.load(“./german_credit.csv”)

We can choose any of the above procedure, but spark_csv module would be better as we can skip all the boiler plate codes for parsing the columns and then separating the headers.

As of now, the schemas and headers are both inferred from the data. This is a very powerful feature of a data frame. However, in our data, the headers are too long, making it difficult to be used.

We can change the headers and the data types of the columns explicitly by defining the datatype and headers in StructField, enclosed within a StructType and then pass it to the CSV loader.

from pyspark.sql.types import *

schema = StructType([

StructField(‘creditability’, DoubleType(), True),

StructField(‘balance’, DoubleType(), True),

StructField(‘duration’, DoubleType(), True),

StructField(‘history’, DoubleType(), True),

StructField(‘purpose’, DoubleType(), True),

StructField(‘amount’, DoubleType(), True),

StructField(‘savings’, DoubleType(), True),

StructField(’employment’, DoubleType(), True),

StructField(‘instPercent’, DoubleType(), True),

StructField(‘sexMarried’, DoubleType(), True),

StructField(‘guarantors’, DoubleType(), True),

StructField(‘residenceDuration’, DoubleType(), True),

StructField(‘assets’, DoubleType(), True),

StructField(‘age’, DoubleType(), True),

StructField(‘concCredit’, DoubleType(), True),

StructField(‘apartment’, DoubleType(), True),

StructField(‘credits’, DoubleType(), True),

StructField(‘occupation’, DoubleType(), True),

StructField(‘dependents’, DoubleType(), True),

StructField(‘hasPhone’, DoubleType(), True),

StructField(‘foreign’, DoubleType(), True)

])
Customers = sqlContext.read.format(“com.databricks.spark.csv”) \

.option(“header”, “true”) \

.option(“inferschema”, “true”) \

.option(“mode”, “DROPMALFORMED”) \

.load(“./german_credit.csv”,schema=schema)

Once the data has been loaded as data frame we can check the schema as shown below.

Customers.printSchema()

root

|– creditability: double (nullable = true)

|– balance: double (nullable = true)

|– duration: double (nullable = true)

|– history: double (nullable = true)

|– purpose: double (nullable = true)

|– amount: double (nullable = true)

|– savings: double (nullable = true)

|– employment: double (nullable = true)

|– instPercent: double (nullable = true)

|– sexMarried: double (nullable = true)

|– guarantors: double (nullable = true)

|– residenceDuration: double (nullable = true)

|– assets: double (nullable = true)

|– age: double (nullable = true)

|– concCredit: double (nullable = true)

|– apartment: double (nullable = true)

|– credits: double (nullable = true)

|– occupation: double (nullable = true)

|– dependents: double (nullable = true)

|– hasPhone: double (nullable = true)

|– foreign: double (nullable = true)

Now, let’s do a quick check on the average account balance, credit amount and loan duration, as per the credibility.

# register the Customers frame as table

Customers.registerTempTable(“credit”)
# query the credability table to check average balance amount,average loan and average duration for

# each class of customer i.e. 1 and 0

results = sqlContext.sql(“SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, \

avg(duration) as avgdur FROM credit GROUP BY creditability “)
# check the result of the query

results.show()

+————-+——————+——————+——————+

+————-+——————+——————+——————+

| 1.0|2.8657142857142857| 2985.442857142857|19.207142857142856|

| 0.0|1.9033333333333333|3938.1266666666666| 24.86|

+————-+——————+——————+——————+

Similarly, we can do a quick check on the statistical summary of all numerical columns in the data frame.

Customers.describe().show()

This becomes very difficult to identify the summary for each column. We can see the summary by separating them out.

Customers.describe(‘creditability’,’balance’,’duration’,’history’,’purpose’).show()
Customers.describe(“amount”,”savings”, “employment”, “instPercent”,”sexMarried”).show()
Customers.describe(“guarantors”,”residenceDuration”, “assets”, “age”, “concCredit”).show()
Customers.describe(“apartment”,”credits”, “occupation”, “dependents”, “hasPhone”).show()
Customers.describe(“foreign”).show()

Now we have the data loaded into our driver as a dataframe. In the next blog we would employ various EDA schemes to have a closer look into the data and also define the data dictionary of the data.

AcadGild

Machine Learning with Spark – Part 3

Dataset:

Use Case:

Objective of Analysis:

Related

Manjunath N

Related Posts

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Dataset:

Use Case:

Objective of Analysis:

Share this:

Related

Manjunath N

Related Posts

Leave a Reply