Use Talend Open Studio for Big Data to ETL to Hadoop

Talend Open Studio for Big Data is a powerful ETL tool which is also open source. You can download and use it to do ETL to and from Hadoop including both HDFS and Hive.

Talend Install steps

Downloaded the free Talend Open Studio for Big Data from https://www.talend.com/products/big-data/big-data-open-studio/

The download file location is set to c:\temp due to filename too long error during extract of zip file if you use longer subdirectory names.

Now run the file C:\temp\TOS_BD-20170623_1246-V6.4.1\TOS_BD-20170623_1246-V6.4.1\TOS_BD-win-x86_64.exe

Select the project Local_Project_bigdata_demo – java in the startup prompt. Review the below blog to understand how to load XML data into Hive:

XML Data Processing Using Talend Open Studio

http://www.kalyanhadooptraining.com/2014/10/xml-data-processing-using-talend-open.html

  1. Open up Talend Open Studio and a project.

2. Right click on Job Designs and select Create job – gcXMLFeedTest

3. Right click on Metadata->File xml in left menu and select Create file xml

step 2 of the wizard select Input XML and click, Next.

In step 3 of the wizard select the input XML file.

Step 4 is where you start to see the real power of  Talend Open Studio. The Source Schema list on the left displays the schema of the XML file.  The Target Schema section provides you with a way of defining an output schema for the XML. Using XPath you can now define the required elements from the input XML file. You can drag the element which will repeat itself in the XML to Xpath loop expression section. In this case the element catalog_title is the element that embeds all information for a single movie/title.

Next, you can traverse through all the nodes on the left and drag the required elements to the right under the Fields to extract section. You can also provide custom column names under the Column Namesection. Once you are done dragging all the required fields, click on Refresh Preview to see a preview of the data. This preview helps one get a quick idea of how the data will be parsed. Click Finish.

NEXT STEP:

Double click on the job, gcXMLFeedTest to open it up in the workspace. Drag the newly created XML Metadata. Also, drag the tFileOutputDelimited_1 component from the Palette on the right.

Right click on the XMLinputfile node and select, Row->Main and join it to tFileOutputDelimited_1 node.

Select the tFileOutputDelimited_1 node and go to the “Component” tab at the bottom of the workspace. Update the configurations, Field Separator to “,” and set the File Name to the desired path and name.

We are now ready to test out our job. Click on the Run icon on the toolbar to execute the job. Once executed you should see processing status as ok.

The above job is going to read the XML file, extract the fields and generate a comma separated text file with the extracted data.

As you can see the big XML node has now more readable as a simple comma separated record. This was a simple one-to-one mapping to from XML to CSV. Talend Open Studio is way more powerful than this. You can add a new component to the job to apply transformations to data coming in from the XML.

After running the job we get the xml data in a csv file out.csv.

 

NEXT setup hive connection and load to hive table:

Drag and drop the four components from the right Palette to the designer:

tHiveConnection_1

tHDFS_Put_4

tHiveRow_1

tHiveClose_1

Select Cloudera and Cloudera CDH5.10(YARN) and Install the external jars.

Set the connection properties for Hadoop and Hive.

In the THDFSput you will have to use the:

Local Directory:  “c:/temp”      —- note the forward slash not backward slash

HDFS Directory: “/user/hive/mydata”

Files:

Filemask                                 New Name

“out.csv”                                  “out.csv”

In tHiveRow_1 component you will need to have:

“LOAD DATA INPATH ‘/user/hive/mydata/out.csv’ OVERWRITE into table sandbox.xmlmetadata”

Make sure the sandbox.metadata was already created in Hive earlier.

Run the ETL pipeline to load data to Hive.

 

 

Below are some more good blogs that show how to process XML data in Talend and load to Hive.

https://www.talendbyexample.com/talend-xml-reference.html

 

 

http://talend.tips/category/big-data/

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.