Home / Big Data and Hadoop / Converting XML into CSV Using Pig

04 January 2016

Converting XML into CSV Using Pig

This blog focuses on converting the XML format of data into CSV format using pig commands.

Now we will take a sample XML data. After installing hadoop we get many configuration files in xml format and in this case we are taking hdfs-site.xml as our input data.

Our hdfs-site.xml file looks like this.

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>/home/kiran/hadoop/namenode</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>/home/kiran/hadoop/datanode</value>

</property>

<name>dfs.block.size</name>

</property>

</configuration>

Now we will convert the data inside this file to CSV format using pig.

1	A = load '/hdfs-site.xml' using org.apache.pig.piggybank.storage.XMLLoader('property') as (x:chararray);

Here we will load the xml file using the default XML loader available in pig, inside the XML loader we are specifying that our root element is property and we are storing the whole thing with an alias name x as chararray.

1	B = foreach A generate REPLACE(x,'[\\n]','') as x;

Here we are bringing the contents between the property tag in one line. This looks like as mentioned below

<property><name>dfs.replication</name><value>1</value></property>

1	C = foreach B generate REGEX_EXTRACT_ALL(x,'.(?:<name>)([^<]).(?:<value>)([^<]).*');

Now we are removing the brackets by using the above mentioned regular expression.

Before flatten statement the output looks like this.

1	D =FOREACH C GENERATE FLATTEN (($0));

Here by using flatten it will remove the remaining brackets.Now the Final result looks like this.

The above output will be stored in a file using CSV loader available in pig by using the below command:

1	STORE D INTO '/pig_conversions/xml_to_csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage();

This output is stored in the location /pig_conversions/xml_to_csv with name part-m-00000 of HDFS. We can download and see the contents of the file.

This is the final output which is in CSV format. We can now easily perform analysis on this data.

Hope this blog helped you in learning how to convert XML data into CSV.

Keep visiting our site for more updates on BigData and other technologies.

Kiran Krishna

Kiran Krishna Innamuri is a Passionate Big Data enthusiast having expertise in Hadoop and Spark Development. He is a passionate Java and scala programmer. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields. We provide courses for Android Programming, Big Data, Front End, Full Stack, AngularJS, NodeJS and Android Programming for children.

Hadoop Tutorial: Combiners in Hadoop

August 25, 2016
Hadoop Tutorial: HBase Admin DDL Commands (Java API)

August 24, 2016
Machine Learning with Spark – Part 3

August 23, 2016

3 Comments

Nitin Kashyap Reply to Nitin

January 7, 2016 at 11:05 am

Hi,

What is “org.apache.pig.piggybank.storage.XMLLoader”. Is this pre-defined in pig or we have to register piggybank first.

Thanks.
- AcadGild Reply to AcadGild
  
  January 8, 2016 at 1:05 pm
  
  Hi Nitin,
  
  org.apache.pig.piggybank.storage.XMLLoader is a default XML loader available in the latest versions of pig. Here we have used pig-0.15.0. You need not to register the piggybank, piggybank will be present in the pig library itself.
Bala Reply to Bala

March 3, 2016 at 3:23 pm

Hi
how we can convert multiple xml files to separate csv files using apache pig ?

AcadGild

Converting XML into CSV Using Pig

Related

Kiran Krishna

Related Posts

3 Comments

Leave a Reply to AcadGild Cancel reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Share this:

Related

Kiran Krishna

Related Posts

3 Comments

Leave a Reply to AcadGild Cancel reply