In this post, we will be discussing how to convert data in XML format to JSON format using Hadoop Map Reduce.
Note: In order to convert XML to JSON using this procedure, your XML data should be in proper record format.
Java Map Reduce Program
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.json.JSONException; import org.json.JSONObject; import org.json.XML; public class xml_parsing { public static class TokenizerMapper extends Mapper<Object, Text, Text, NullWritable>{ public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String xml_data =value.toString(); try { JSONObject xml_to_json = XML.toJSONObject(xml_data); String json_string = xml_to_json.toString(); context.write(new Text(json_string.toString()),NullWritable.get()); } catch (JSONException je) { System.out.println(je.toString()); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "xml_to_json"); job.setJarByClass(xml_parsing.class); job.setNumReduceTasks(0); job.setMapperClass(TokenizerMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
Let’s have a quick walk through of the above MapReduce code.
For converting XML data to JSON we require org.json jar file through which we can read XML data and convert it into JSON format. You can download this file from this link.
After downloading the jar file, add the jar file into your Project Build path. Also, add this jar file into your Hadoop’s Map Reduce lib directory which is in the following path :
Hadoop–>Share–>hadoop–>mapreduce–>lib
Also, make sure to copy the jar file into the Map Reduce lib folder also.
Mapper Class
Since we are converting one format of data into another format, it includes no processing logic so we can end up with one Map phase itself. Lets see the mapper class of the program.
Map Method for Converting XML to JSON Format:
The Map method for converting XML to JSON is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String xml_data =value.toString(); try { JSONObject xml_to_json = XML.toJSONObject(xml_data); String json_string = xml_to_json.toString(); context.write(new Text(json_string.toString()),NullWritable.get()); } catch (JSONException je) { System.out.println(je.toString()); } |
In the above program, the String xml_data will read a XML record and converts it into a String using the toString() method.
This XML record is converted into JSON object using the XML.toJSONobject() method present in the jar file which we have added.
Now, the XML record will be converted into a JSON object.
The obtained JSON object will be again converted into String using the toString() method. With this our use case is completed.
We need to write this converted JSON object into the context. Since we do not have any values in this program, we are making the value as NullWritable.
On running the program, we can convert the XML to JSON object.
The input given for this particular program is as follows:
Sample Input
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/kiran/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/kiran/hadoop/datanode</value> </property> <property> <name>dfs.block.size</name> <value>67108864</value> </property> <property> <name>dfs.support.broken.append</name> <value>true</value> <final>true</final> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration> |
Sample Output
After running the above program with the given input, we got the following output which is in JSON format.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
{"name":"dfs.replication"} {"value":1} {} {"name":"dfs.namenode.name.dir"} {"value":"/home/kiran/hadoop/namenode"} {} {"name":"dfs.datanode.data.dir"} {"value":"/home/kiran/hadoop/datanode"} {} {"name":"dfs.block.size"} {"value":67108864} {} |
Hope this post has helped you learn how to convert XML to JSON in Hadoop MapReduce. Keep visiting our blog for more updates on Big Data and other technologies.
In our next post, we will be discussing the procedure for converting XML to JSON in which records are not split properly.
Leave a Reply