Home / Big Data and Hadoop • Spark / HealthCare Use Case With Apache Spark

Health-Care-Data-Analysis-using-Spark-part1

22 March 2016

HealthCare Use Case With Apache Spark

Spark can be used for a variety of use cases which can be performed on data, such as ETL (Extract, Transform and Load), analysis (both interactive and batch), streaming etc.

In this blog, we will explore and see how we can use Spark for ETL and descriptive analysis. We will make use of the patient data sets to compute a statistical summary of the data sample.

How can Spark help healthcare?

A number of use cases in healthcare institutions are well suited for a big data solution. Some of the academic or research oriented healthcare institutions are either experimenting with big data or using it in advanced research projects. In healthcare industry, there is large volume of data that is being generated. Electronic Health Record (EMR) alone collects a huge amount of data. But apart from EMRs, there are various other sources of data in healthcare industry.

Over the last decade, pharmaceutical companies have been aggregating years of research and development data into medical databases and because of this, the patient records have been digitized. In parallel, recent technical advances have made it easier to collect and analyze information from multiple sources which is a major benefit for health care institutions, since data for a single patient may come from various hospitals, laboratories, and physician offices.

With a lot of medical data coming from various sources, guided decisions can be made from the insights gained through big data by using various Machine Learning Algorithms. Traditionally, physicians use their judgment while making treatment decisions, but in the last few years there has been a shift towards evidence-based medicine. This involves systematical review of clinical data and making treatment decisions based on the best available information. Aggregating individual data sets into big-data algorithms often provides the most robust evidence, since nuances in sub populations (such as the presence of patients with gluten allergies) may be so rare that they are not readily apparent in small samples.

About the data

It is difficult and expensive to access Electronic Medical Records (EMRs) due to privacy concerns

and technical problems. In healthcare, HIPAA compliance is non-negotiable. Nothing is more

important than the privacy and security of patient data.

Hence, to overcome this problem the data is generated using a machine as per pre-defined

criteria.

The database contains the same characteristics that exist in the actual medical database such as patients’ admission details, demographics, socioeconomic details, labs, medications, etc.

The database records and features are customizable. The generated data is around 2 GB of the simulated EMR data.

If you require the data, feel free to contact support@acadild.com

Patients.csv

Field_name	patient_id	DOB	Gender	marital_status	smoking_status	city
Row1	1	2001-09-22	F	Divorced	Once	Achhnera

The fields in this data set are defined as follows:

patient_id: Each new patient is identified by this number
DOB: The patient’s date of birth. We have considered patients born on or after 28-12-1950
Gender: F-Female; M-Male
marital_status: Divorced, Single or Married
smoking_status: Smoking habit of the patient
city: The city to which the patient belongs

Diagnois.csv

Field_name	diagnosis_id	admission_id	patient_id	diagnosis_ICD10_code
Row1	1	17477	3082	I2781

The fields in this data set are defined as follows:

Diagnosis_id: This is a unique id for each diagnosis
Admission-id: This is a unique id for every patient admitted to the hospital
Patient_id: Each patient is identified by this number
diagnosis_ICD10_code: Standard code for every diagnosis that has been standardized in the healthcare industry. This code is independent of the hospital and hence can be used to identify a diagnosis across hospitals.

Admission.csv

Patient encounters are continuously recorded into the hospital database as and when they visit the hospital. The following data set is thus generated:

Field_name	admission_id	patient_id	admission_date	discharge_date
Row1	1	9062	2014-10-07	2015-09-19

Kick start your career in spark by attending free webinar on April 9th 2016.

The fields in this data set are defined as follows:

Admission_id: Each time a patient comes to hospital consultation, he/she is assigned a new number

Patient_id: Each patient is identified by this number

Admission_date: The day when the patient is admitted to the hospital
discharge_date: The day when the patient is discharge

ICD-10-diagnosis-cleaned.txt

Field_name	ICD_10_Code	Diagnosis_desription
Row1	A000	Cholera due to Vibrio cholerae 01, biovar cholerae

The fields in the data set are defined as follows:

ICD_10_Code: ICD-10 (International Classification of Disease version 10) code is assigned for each standard diagnosis
Diagnosis_description: Description of the diagnosis

The Scenario

We will consider a scenario where we will use a hypothetical EMR, similar to the one which exists in actual healthcare institutions. The patient’s data has a variety of parameters associated with it, for example, basic demographic information (gender, location, etc.), patients identified diagnosis, etc.

A typical data science project flow is shown below:

We’ll use Python, PySpark and MLib to compute some basic statistics for our dashboard. It involves some of the typical steps to be followed in Spark and get started with your own use case:

Reading data from File System into a Spark RDD
Applying transformations to “massage” the data into a pair RDD
Compute summary statistics for each user and check the distribution of data

Scenario I:

Calculate patient’s age and age group from his date of birth given in EMR

Load the data from patients.csv

patientfile = sc.textFile(‘file:///opt/spark_usecases/medical/datasets/patients.csv’)

Check the number of records which are going to be processed

patientfile.count()

1001

Calculate patient’s age and age group and then save it on to the memory.

We will be repeating operations on this RDD.

# create feature/attribute from the existing attributes like age and age group

patient_demographics = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: patient_attributes(line))

patient_demographics.persist()

#patient_id,DOB,gender,marital_status,smoking_status,city

def patient_attributes(str):

l = str.split(“,”)

return [l[0],l[1], l[2], l[3], l[4], l[5],int(prepare_date(l[1])),age_group(int(prepare_date(l[1])))]

def prepare_date(date_form):

year,month,day = [int(x) for x in date_form.split(“-“)]

try :

born = date(year, month, day)

except ValueError: # raised when birth date is February 29 and the current year is not a leap year

born = date(year, month, day-1)

return calculate_age(born)

def calculate_age(born):

today = date.today()

return today.year – born.year – ((today.month, today.day) < (born.month, born.day))

def age_group(age):

if age < 10 :

return ‘0-10’

elif age < 20:

return ’10-20′

elif age < 30:

return ’20-30′

elif age < 40:

return ’30-40′

elif age < 50:

return ’40-50′

elif age < 60:

return ’50-60′

elif age < 70:

return ’60-70′

elif age < 80:

return ’70-80′

else :

return ’80+’

Scenario II

Find the distribution of data for each patient attribute

Find the distribution of male and female patients

patient_gender = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: (line.split(‘,’)[2].strip(),1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()

[(524, u’F’), (476, u’M’)]

Find distribution for married_status

patient_married_status = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: (line.split(‘,’)[3].strip(),1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()

[(372, u’Divorced’), (321, u’Single’), (307, u’Married’)]

Find distribution for different age groups

patient_age_group_wise = patient_demographics.map(lambda line : (line[7],1)).reduceByKey(lambda a,b:a+b).map(lambda

line:(line[1],line[0])).sortByKey(False).collect()

[(166, ’10-20′), (162, ’50-60′), (152, ’40-50′), (151, ’30-40′), (139, ‘0-10′), (138, ’20-30′), (92, ’60-70’)]

Find top 5 cities from where we have most number of patients with patient frequency

patient_city_wise = patient_demographics.map(lambda line : (line[5],1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).take(5)

[(5, u’Talegaon Dabhade’), (5, u’Adityapur’), (5, u’Mandamarri’), (5, u’Sikar’), (4, u’Pratapgarh’)]

Find distribution smoking_status/smoking habit

patient_smoking_wise = patient_demographics.map(lambda line : (line[4],1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()

[(256, u’Frequently’), (256, u’No’), (247, u’Once’), (241, u’Occasionally’)]

Hope this blog was helpful in giving you an overview on benefits of Spark in the healthcare industry.

In the next blog we will create a profile of each user with various diagnosis, procedure and other attributes which can be obtained from the data.

Satyam

Satyam Kumar is a Big Data Professional, working in AcadGild with rich experience in Big Data technologies like Hadoop, Spark, NoSQL and other related technologies. He strives to code in Programming languages like Java and Python and have been responsible for development of various projects and blogs related to Hadoop ecosystem and Spark. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields.

Hadoop Tutorial: Combiners in Hadoop

August 25, 2016
Hadoop Tutorial: HBase Admin DDL Commands (Java API)

August 24, 2016
Machine Learning with Spark – Part 3

August 23, 2016

3 Comments

Farooq Reply to Farooq

July 30, 2016 at 10:45 pm

Excellent use case
nice
can you please send me the data to farooqm8@gmail.com
I tried to write mail to this mail id support@acadild.com but it was failed.
villa Reply to villa

August 21, 2016 at 6:30 am

I am interested in putting in big data analytics test in the health sector , but I need the data to test since you have 2 GB
villa Reply to villa

August 21, 2016 at 6:33 am

I am interested in putting in big data analytics test in the health sector , but I need the data to test since you have 2 GB
adrianamaria25@hotmail.com

AcadGild