Spark can be used for a variety of use cases which can be performed on data, such as ETL (Extract, Transform and Load), analysis (both interactive and batch), streaming etc.
In this blog, we will explore and see how we can use Spark for ETL and descriptive analysis. We will make use of the patient data sets to compute a statistical summary of the data sample.
How can Spark help healthcare?
A number of use cases in healthcare institutions are well suited for a big data solution. Some of the academic or research oriented healthcare institutions are either experimenting with big data or using it in advanced research projects. In healthcare industry, there is large volume of data that is being generated. Electronic Health Record (EMR) alone collects a huge amount of data. But apart from EMRs, there are various other sources of data in healthcare industry.
Over the last decade, pharmaceutical companies have been aggregating years of research and development data into medical databases and because of this, the patient records have been digitized. In parallel, recent technical advances have made it easier to collect and analyze information from multiple sources which is a major benefit for health care institutions, since data for a single patient may come from various hospitals, laboratories, and physician offices.
With a lot of medical data coming from various sources, guided decisions can be made from the insights gained through big data by using various Machine Learning Algorithms. Traditionally, physicians use their judgment while making treatment decisions, but in the last few years there has been a shift towards evidence-based medicine. This involves systematical review of clinical data and making treatment decisions based on the best available information. Aggregating individual data sets into big-data algorithms often provides the most robust evidence, since nuances in sub populations (such as the presence of patients with gluten allergies) may be so rare that they are not readily apparent in small samples.
About the data
It is difficult and expensive to access Electronic Medical Records (EMRs) due to privacy concerns
and technical problems. In healthcare, HIPAA compliance is non-negotiable. Nothing is more
important than the privacy and security of patient data.
Hence, to overcome this problem the data is generated using a machine as per pre-defined
criteria.
The database contains the same characteristics that exist in the actual medical database such as patients’ admission details, demographics, socioeconomic details, labs, medications, etc.
The database records and features are customizable. The generated data is around 2 GB of the simulated EMR data.
If you require the data, feel free to contact support@acadild.com
Patients.csv
Field_name |
patient_id |
DOB |
Gender |
marital_status |
smoking_status |
city |
Row1 |
1 |
2001-09-22 |
F |
Divorced |
Once |
Achhnera |
The fields in this data set are defined as follows:
- patient_id: Each new patient is identified by this number
- DOB: The patient’s date of birth. We have considered patients born on or after 28-12-1950
- Gender: F-Female; M-Male
- marital_status: Divorced, Single or Married
- smoking_status: Smoking habit of the patient
- city: The city to which the patient belongs
Diagnois.csv
Field_name |
diagnosis_id |
admission_id |
patient_id |
diagnosis_ICD10_code |
Row1 |
1 |
17477 |
3082 |
I2781 |
The fields in this data set are defined as follows:
- Diagnosis_id: This is a unique id for each diagnosis
- Admission-id: This is a unique id for every patient admitted to the hospital
- Patient_id: Each patient is identified by this number
- diagnosis_ICD10_code: Standard code for every diagnosis that has been standardized in the healthcare industry. This code is independent of the hospital and hence can be used to identify a diagnosis across hospitals.
Admission.csv
Patient encounters are continuously recorded into the hospital database as and when they visit the hospital. The following data set is thus generated:
Field_name |
admission_id |
patient_id |
admission_date |
discharge_date |
Row1 |
1 |
9062 |
2014-10-07 |
2015-09-19 |
Kick start your career in spark by attending free webinar on April 9th 2016.
The fields in this data set are defined as follows:
- Admission_id: Each time a patient comes to hospital consultation, he/she is assigned a new number
- Patient_id: Each patient is identified by this number
- Admission_date: The day when the patient is admitted to the hospital
- discharge_date: The day when the patient is discharge
ICD-10-diagnosis-cleaned.txt
Field_name |
ICD_10_Code |
Diagnosis_desription |
Row1 |
A000 |
Cholera due to Vibrio cholerae 01, biovar cholerae |
The fields in the data set are defined as follows:
- ICD_10_Code: ICD-10 (International Classification of Disease version 10) code is assigned for each standard diagnosis
- Diagnosis_description: Description of the diagnosis
The Scenario
We will consider a scenario where we will use a hypothetical EMR, similar to the one which exists in actual healthcare institutions. The patient’s data has a variety of parameters associated with it, for example, basic demographic information (gender, location, etc.), patients identified diagnosis, etc.
A typical data science project flow is shown below:
We’ll use Python, PySpark and MLib to compute some basic statistics for our dashboard. It involves some of the typical steps to be followed in Spark and get started with your own use case:
-
Reading data from File System into a Spark RDD
-
Applying transformations to “massage” the data into a pair RDD
-
Compute summary statistics for each user and check the distribution of data
Scenario I:
Calculate patient’s age and age group from his date of birth given in EMR
- Load the data from patients.csv
patientfile = sc.textFile(‘file:///opt/spark_usecases/medical/datasets/patients.csv’)
- Check the number of records which are going to be processed
patientfile.count()
1001
- Calculate patient’s age and age group and then save it on to the memory.
We will be repeating operations on this RDD.
# create feature/attribute from the existing attributes like age and age group
patient_demographics = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: patient_attributes(line))
patient_demographics.persist()
#patient_id,DOB,gender,marital_status,smoking_status,city def patient_attributes(str): l = str.split(“,”) return [l[0],l[1], l[2], l[3], l[4], l[5],int(prepare_date(l[1])),age_group(int(prepare_date(l[1])))] |
def prepare_date(date_form): year,month,day = [int(x) for x in date_form.split(“-“)] try : born = date(year, month, day) except ValueError: # raised when birth date is February 29 and the current year is not a leap year born = date(year, month, day-1) return calculate_age(born) |
def calculate_age(born): today = date.today() return today.year – born.year – ((today.month, today.day) < (born.month, born.day)) |
def age_group(age): if age < 10 : return ‘0-10’ elif age < 20: return ’10-20′ elif age < 30: return ’20-30′ elif age < 40: return ’30-40′ elif age < 50: return ’40-50′ elif age < 60: return ’50-60′ elif age < 70: return ’60-70′ elif age < 80: return ’70-80′ else : return ’80+’ |
Scenario II
Find the distribution of data for each patient attribute
- Find the distribution of male and female patients
patient_gender = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: (line.split(‘,’)[2].strip(),1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()
[(524, u’F’), (476, u’M’)]
- Find distribution for married_status
patient_married_status = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: (line.split(‘,’)[3].strip(),1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()
[(372, u’Divorced’), (321, u’Single’), (307, u’Married’)]
- Find distribution for different age groups
patient_age_group_wise = patient_demographics.map(lambda line : (line[7],1)).reduceByKey(lambda a,b:a+b).map(lambda
line:(line[1],line[0])).sortByKey(False).collect()
[(166, ’10-20′), (162, ’50-60′), (152, ’40-50′), (151, ’30-40′), (139, ‘0-10′), (138, ’20-30′), (92, ’60-70’)]
- Find top 5 cities from where we have most number of patients with patient frequency
patient_city_wise = patient_demographics.map(lambda line : (line[5],1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).take(5)
[(5, u’Talegaon Dabhade’), (5, u’Adityapur’), (5, u’Mandamarri’), (5, u’Sikar’), (4, u’Pratapgarh’)]
- Find distribution smoking_status/smoking habit
patient_smoking_wise = patient_demographics.map(lambda line : (line[4],1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()
[(256, u’Frequently’), (256, u’No’), (247, u’Once’), (241, u’Occasionally’)]
Hope this blog was helpful in giving you an overview on benefits of Spark in the healthcare industry.
In the next blog we will create a profile of each user with various diagnosis, procedure and other attributes which can be obtained from the data.
Excellent use case
nice
can you please send me the data to farooqm8@gmail.com
I tried to write mail to this mail id support@acadild.com but it was failed.
I am interested in putting in big data analytics test in the health sector , but I need the data to test since you have 2 GB
I am interested in putting in big data analytics test in the health sector , but I need the data to test since you have 2 GB
adrianamaria25@hotmail.com