Category Archives: Hue

Cloudera Search (Solr) install steps

The following steps are used to install Cloudera Search which is based on Apache Solr.

Environment:

Cloudera CDH 5.12.x

solr-spec 4.10.3

Deploying Cloudera Search

Cloudera Search (powered by Apache Solr) is included in CDH 5. If you have installed CDH 5.0 or higher, you do not need to perform any additional actions to install Search.

When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, and uses ZooKeeper to simplify management, which results in a cluster of coordinating Apache Solr servers.

For Cloudera Manager installations, if you have not yet added the Solr service to your cluster, do so now from the Cloudera Manager->Home->Add Service dropdown.  The Add a Service wizard automatically configures and initializes the Solr service.

Select the hosts in the next screen in CMgr where you want to deploy the SOLR servers. I selected two hosts which also had zookeeper running on them. Keep the default Zookeeper Znode /solr and HDFS Data Dirfectory /solr in the next screen.

However while running the add service got an error that “Sentry requires that authentication be turned on for Solr.”

Clicked back on the page and this time removed the dependency of Sentry and only kept HDFS and Zookeeper. And also in another browser went to the Cloudera Manager->Solr–>configuration and set Sentry Service to None. After that retried the install and it ran successfully.

Next in CMgr Hue->configuration->Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini remove Search from blacklist.

[desktop]

app_blacklist=hbase,impala

 

NEXT STEP:

Login to Hue and try to click on the Query->Dashboard . But got an error:

Error!

HTTPConnectionPool(host=’localhost’, port=8983): Max retries exceeded with url: /solr/admin/cores?user.name=hue&doAs=hive&wt=json (Caused by NewConnectionError(‘<requests.packages.urllib3.connection.HTTPConnection object at 0x7fba0805bad0>: Failed to establish a new connection: [Errno 113] No route to host’,))

To resolve this issue go to Cloudera manager->Hue->configuration and search for solr.

In the Solr Service->Hue(Service Wide) click on Solr button if it has none clicked.

This will automatically update these lines in the hue.ini

-app_blacklist=spark,zookeeper,hbase,impala,search
+app_blacklist=spark,zookeeper,hbase,impala
+[search]
+solr_url=http://solrhostxyz.com:8983/solr

Click restart in CMgr. After that clicking Hue->Query->Dashboard works and gives a new message:

It seems there is nothing to search on..

What about creating a new index?

This indicates that Hue is now successfully configured with Solr search.

NEXT STEPS:

A little about SolrCores and Collections

On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore‘s on different machines. We call all of these SolrCores that make up one logical index a collection. A collection is a essentially a single index that spans many SolrCore‘s, both for index scaling as well as redundancy.

EXAMPLE:  Easy indexing of data into Solr with ETL operations

http://gethue.com/easy-indexing-of-data-into-solr/

Selecting data

We need to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.

Click on Hue->Indexes menu option on left. (Note: Index and Collection are mostly the same). Click on add Index.

Give the index a name and the location of the .csv file which was uploaded to HDFS using Hue. It will recognize the fields and data type and we can create the index by clicking finish.

After the index is created we can see it on the left under Collections. Click on the index then click on Search. This will display a Dashboard and you can create various Dashboards from the data in Hue.

NEXT STEP:

If you go to the SOLR admin page you can get lot of details about SOLR and the index(collection) created.

http://mysolrhost:8983

 

Next running a list command on the Solr server gives the list of collections:

/root>solrctl instancedir –list

index1_reviews

managedTemplate

managedTemplateSecure

predefinedTemplate

predefinedTemplateSecure

schemalessTemplate

schemalessTemplateSecure

 

Reference:

https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cm_ig_install_search.html

http://gethue.com/easy-indexing-of-data-into-solr/

 

 

Advertisements

Business Intelligence, ETL and Data Science tools

Opensource BI / ETL tools:

Talend = ETL tool, leader in Gartner Magic Quadrant

HUE = Hadoop Analytics server

Jupyter Notebook = Datascience BI tool

Pentaho = desktop and server version

KNIME = Data Science leader in Gartner Magic Quadrant 2017 desktop version

PowerBI = desktop free version

Oracle SQL Developer = Free SQL tool

 

Commercial BI tools:

Tableau = desktop and server version

MicroStrategy = desktop and server version

Qlik = desktop and server version

Microsoft SSIS/SSAS/SSRS

RapidMiner = Data Science tool

 

 

 

Install Anaconda Python package on Cloudera CDH.

 

This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook.

Install Steps:

Installing the Anaconda Parcel

1.From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.

2.Click the “Configuration” button on the top right of the Parcels page.

3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/

4.Click the “Save Changes” button.

NOTE:  The Anaconda parcel still didn’t show up in the Parcels list. There was a Null Pointer Exception in the Cloudera manager log.

2017-11-17 12:43:08,730 INFO 459754680@scm-web-5592:com.cloudera.server.web.cmf.ParcelController: Synchronizing repos based on user request admin

2017-11-17 12:43:08,880 WARN ParcelUpdateService:com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable: Error while executing CmfEntityManager task

java.lang.NullPointerException

at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)

at com.google.common.collect.Collections2.filter(Collections2.java:92)

at com.cloudera.parcel.components.ParcelDownloaderImpl$RepositoryInfo.getParcelsWithValidNames(ParcelDownloaderImpl.java:673)

 

Workaround to resolve this is to remove the following Sqoop urls in the Cloudera Manager Parcels ->Configuration – Remote Parcel Repository URLs for both the http and https urls. After that the Anaconda parcels showed up in the parcels list.

http://archive.cloudera.com/sqoop-connectors/parcels/latest/

https://archive.cloudera.com/sqoop-connectors/parcels/latest/

 

5.Click the “Download” button to the right of the Anaconda parcel listing.

6.After the parcel is downloaded, click the “Distribute” and “Activate” buttons to distribute and activate the parcel to all of the cluster nodes.

7.After the parcel is activated, Anaconda is now available on all of the cluster nodes. 8.The Cloudera Manager Spark status will show stale configuration. Restart and deploy the stale configurations by clicking the button in CM.

9. Now Anaconda is deployed and can be used with Hue pySpark notebook. Test a small example in Hue->Query->Editor->pySpark:

Example: Pandas test

#!/usr/bin/env python

def import_pandas(x):
import pandas
return x+10

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x)).collect()

 


Result

[11, 12, 13, 14]

 

Now you can start using all python libraries in Anaconda package like pandas, NumPy and SciPy etc. using Hue pySpark notebook.

References:

http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/