Talend Open Studio for Big Data is a powerful ETL tool which is also open source. You can download and use it to do ETL to and from Hadoop including both HDFS and Hive.
Talend Install steps
Downloaded the free Talend Open Studio for Big Data from https://www.talend.com/products/big-data/big-data-open-studio/
The download file location is set to c:\temp due to filename too long error during extract of zip file if you use longer subdirectory names.
Now run the file C:\temp\TOS_BD-20170623_1246-V6.4.1\TOS_BD-20170623_1246-V6.4.1\TOS_BD-win-x86_64.exe
Select the project Local_Project_bigdata_demo – java in the startup prompt. Review the below blog to understand how to load XML data into Hive:
XML Data Processing Using Talend Open Studio
- Open up Talend Open Studio and a project.
2. Right click on Job Designs and select Create job – gcXMLFeedTest
3. Right click on Metadata->File xml in left menu and select Create file xml
step 2 of the wizard select Input XML and click, Next.
In step 3 of the wizard select the input XML file.
Step 4 is where you start to see the real power of Talend Open Studio. The Source Schema list on the left displays the schema of the XML file. The Target Schema section provides you with a way of defining an output schema for the XML. Using XPath you can now define the required elements from the input XML file. You can drag the element which will repeat itself in the XML to Xpath loop expression section. In this case the element catalog_title is the element that embeds all information for a single movie/title.
Next, you can traverse through all the nodes on the left and drag the required elements to the right under the Fields to extract section. You can also provide custom column names under the Column Namesection. Once you are done dragging all the required fields, click on Refresh Preview to see a preview of the data. This preview helps one get a quick idea of how the data will be parsed. Click Finish.
Double click on the job, gcXMLFeedTest to open it up in the workspace. Drag the newly created XML Metadata. Also, drag the tFileOutputDelimited_1 component from the Palette on the right.
Right click on the XMLinputfile node and select, Row->Main and join it to tFileOutputDelimited_1 node.
Select the tFileOutputDelimited_1 node and go to the “Component” tab at the bottom of the workspace. Update the configurations, Field Separator to “,” and set the File Name to the desired path and name.
We are now ready to test out our job. Click on the Run icon on the toolbar to execute the job. Once executed you should see processing status as ok.
The above job is going to read the XML file, extract the fields and generate a comma separated text file with the extracted data.
As you can see the big XML node has now more readable as a simple comma separated record. This was a simple one-to-one mapping to from XML to CSV. Talend Open Studio is way more powerful than this. You can add a new component to the job to apply transformations to data coming in from the XML.
After running the job we get the xml data in a csv file out.csv.
NEXT setup hive connection and load to hive table:
Drag and drop the four components from the right Palette to the designer:
Select Cloudera and Cloudera CDH5.10(YARN) and Install the external jars.
Set the connection properties for Hadoop and Hive.
In the THDFSput you will have to use the:
Local Directory: “c:/temp” —- note the forward slash not backward slash
HDFS Directory: “/user/hive/mydata”
Filemask New Name
In tHiveRow_1 component you will need to have:
“LOAD DATA INPATH ‘/user/hive/mydata/out.csv’ OVERWRITE into table sandbox.xmlmetadata”
Make sure the sandbox.metadata was already created in Hive earlier.
Run the ETL pipeline to load data to Hive.
Below are some more good blogs that show how to process XML data in Talend and load to Hive.