This blog focuses on converting the XML format of data into CSV format using pig commands.
Now we will take a sample XML data. After installing hadoop we get many configuration files in xml format and in this case we are taking hdfs-site.xml as our input data.
Our hdfs-site.xml file looks like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/kiran/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/kiran/hadoop/datanode</value> </property> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration> |
Now we will convert the data inside this file to CSV format using pig.
1 |
A = load '/hdfs-site.xml' using org.apache.pig.piggybank.storage.XMLLoader('property') as (x:chararray); |
Here we will load the xml file using the default XML loader available in pig, inside the XML loader we are specifying that our root element is property and we are storing the whole thing with an alias name x as chararray.
1 |
B = foreach A generate REPLACE(x,'[\\n]','') as x; |
Here we are bringing the contents between the property tag in one line. This looks like as mentioned below
<property><name>dfs.replication</name><value>1</value></property>
1 |
C = foreach B generate REGEX_EXTRACT_ALL(x,'.*(?:<name>)([^<]*).*(?:<value>)([^<]*).*'); |
Now we are removing the brackets by using the above mentioned regular expression.
Before flatten statement the output looks like this.
1 |
D =FOREACH C GENERATE FLATTEN (($0)); |
Here by using flatten it will remove the remaining brackets.Now the Final result looks like this.
The above output will be stored in a file using CSV loader available in pig by using the below command:
1 |
STORE D INTO '/pig_conversions/xml_to_csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(); |
This output is stored in the location /pig_conversions/xml_to_csv with name part-m-00000 of HDFS. We can download and see the contents of the file.
This is the final output which is in CSV format. We can now easily perform analysis on this data.
Hope this blog helped you in learning how to convert XML data into CSV.
Keep visiting our site for more updates on BigData and other technologies.
Hi,
What is “org.apache.pig.piggybank.storage.XMLLoader”. Is this pre-defined in pig or we have to register piggybank first.
Thanks.
Hi Nitin,
org.apache.pig.piggybank.storage.XMLLoader is a default XML loader available in the latest versions of pig. Here we have used pig-0.15.0. You need not to register the piggybank, piggybank will be present in the pig library itself.
Hi
how we can convert multiple xml files to separate csv files using apache pig ?