Thursday, 31 October 2013

HBase – Overview of Architecture and Data Model


HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage architecture. It can manage structured and semi-structured data and has some built-in features such as scalability, versioning, compression and garbage collection. Since its uses write-ahead logging and distributed configuration, it can provide fault-tolerance and quick recovery from individual server failures. HBase built on top of Hadoop / HDFS and the data stored in HBase can be manipulated using Hadoop’s MapReduce capabilities.
Let’s now take a look at how HBase (a column-oriented database) is different from some other data structures and concepts that we are familiar with Row-Oriented vs. Column-Oriented data stores. As shown below, in a row-oriented data store, a row is a unit of data that is read or written together. In a column-oriented data store, the data in a column is stored together and hence quickly retrieved.

Row-oriented data stores:

  • Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the data in a row is required.
  • Easy to read and write records
  • Well suited for OLTP systems
  • Not efficient in performing operations applicable to the entire dataset and hence aggregation is an expensive operation
  • Typical compression mechanisms provide less effective results than those on column-oriented data stores

Column-oriented data stores:

  • Data is stored and retrieved in columns and hence can read only relevant data if only some data is required
  • Read and Write are typically slower operations
  • Well suited for OLAP systems
  • Can efficiently perform operations applicable to the entire dataset and hence enables aggregation over many rows and columns
  • Permits high compression rates due to few distinct values in columns

Introduction Relational Databases vs. HBase

When talking of data stores, we first think of Relational Databases with structured data storage and a sophisticated query engine. However, a Relational Database incurs a big penalty to improve performance as the data size increases. HBase, on the other hand, is designed from the ground up to provide scalability and partitioning to enable efficient data structure serialization, storage and retrieval.
Differences between a Relational Database and HBase are:

Relational Database:
  • Is Based on a Fixed Schema
  • Is a Row-oriented datastore
  • Is designed to store Normalized Data
  • Contains thin tables
  • Has no built-in support for partitioning.
  • Is Schema-less
  • Is a Column-oriented datastore
  • Is designed to store Denormalized Data
  • Contains wide and sparsely populated tables
  • Supports Automatic Partitioning

HDFS vs. HBase

HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing of data but doesn’t provide fast individual record lookups. HBase is built on top of HDFS and is designed to provide access to single rows of data in large tables. 

Differences between HDFS and HBase are

  • Is suited for High Latency operations batch processing
  • Data is primarily accessed through MapReduce
  • Is designed for batch processing and hence doesn’t have a concept of random reads/writes
  • Is built for Low Latency operations
  • Provides access to single rows from billions of records
  • Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift 

HBase Architecture

The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions – HRegions.
Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the same number of Regions.

The HMaster in the HBase is responsible for
  • Performing Administration
  • Managing and Monitoring the Cluster
  • Assigning Regions to the Region Servers
  • Controlling the Load Balancing and Failover
On the other hand, the HRegionServer perform the following work
  • Hosting and managing Regions
  • Splitting the Regions automatically
  • Handling the read/write requests
  • Communicating with the Clients directly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families (explained below). The MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data from HBase, the clients read the required Region information from the .META table and directly communicate with the appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive) 

HBase Data Model

The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across the cluster. The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Columns, Cells and Versions.

Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a Table.

Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are always treated as a byte[].

Column Families – Data in a row are grouped together as Column Families. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that proper care be taken when designing Column Families in table. The table above shows Customer and Sales Column Families. The Customer Column Family is made up 2 columns – Name and City, whereas the Sales Column Families is made up to 2 columns – Product and Amount.

Columns – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon – example: columnfamily:columnname. There can be multiple Columns within a Column Family and Rows within a table can have varied number of Columns.

Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].

Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of versions of data retained in a column family is configurable and this value by default is 3.

Sunday, 27 October 2013

Tarball installation of CDH4 with Yarn on RHEL 5.7

Step: 1

Download the tarball from cloudera Site or simple click  here

Step: 2

Untar the tarball on anyplace or you ca do in home directory as I did

$  tar –xvzf  hadoop-2.0.0-cdh4.1.2.tar.gz

Step: 3

Set the different home directory in /etc/profile

export JAVA_HOME=/usr/java/jdk1.6.0_22
export PATH=/usr/java/jdk1.6.0_22/bin:"$PATH"
export HADOOP_HOME=/home/hadoop/hadoop-2.0.0-cdh4.1.2

Step: 4

Create hadoop directory in /etc , Create a softlink in /etc/hadoop/conf  - > $HADOOP_HOME/etc/hadoop

$ ln –s  /home/hadoop/hadoop-2.0.0-cdh4.1.2/etc/hadoop  /etc/hadoop/conf

Step : 5

Create different directories listed here

For datanode

$ mkdir  ~/dfs/dn1   ~/dfs/dn2
$ mkdir   /var/log/hadoop

For Namenode

$ mkdir  ~/dfs/nn   ~/dfs/nn1

For SecondaryNamenode

$  mkdir  ~/dfs/snn

for Nodemanager

$ mkdir  ~/yarn/local-dir1  ~/yarn/local-dir2
$ mkdir ~/yarn/apps

for Mapred

$ mkdir  ~/yarn/tasks1  ~/yarn/tasks2

Step: 6

After Creating all the Directories now its time for setting up Hadoop Conf File, Add following properties in there xml files.

1:-  Core-Site.xml

2:- mapred-site.xml

3:- Hdfs-site.xml

4: Yarn-Site.xml

Step: 7

After completion of xmls edits, now we need to bit modify the hadoop-en.sh and yarn-env.sh ,


Replace this line


motive is to increase the memory requirement of hadop clients , to run the jobs.


If  you are running jobs with different user then yarn , change here


Step : 8

Now you completed the hadoop installation , this is the time to format and run the daemons process.

Format hadoop filesystem

$  $HADOOP_HOME/bin/hdfs  namenode –format

Once all the necessary configuration is complete, distribute the files to theHADOOP_CONF_DIR directory on all the machines.

export HADOOP_CONF_DIR=/etc/hadoop/conf
$ $HADOOP_HOME/bin/hadoop fs –mkdir /user

Step : 9

Hadoop Startup

Start the HDFS with the following command, run on the designated
  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode

  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs $1 secondarynamenode

Run a script to start DataNodes on all slaves:

  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

Start the YARN with the following command, run on the designated

  $ $HADOOP_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

Run a script to start NodeManagers on all slaves:

  $ $HADOOP_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager

Start the MapReduce JobHistory Server with the following command, run on the designated server:
  $ $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR

Hadoop Shutdown

Stop the NameNode with the following command, run on the designated NameNode:

  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode

Run a script to stop DataNodes on all slaves:

  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode

Stop the ResourceManager with the following command, run on the designated ResourceManager:

  $ $HADOOP_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager

Run a script to stop NodeManagers on all slaves:

  $ $HADOOP_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager

Stop the WebAppProxy server. If multiple servers are used with load balancing it should be run on each of them:

  $ $HADOOP_HOME/bin/yarn stop proxyserver --config $HADOOP_CONF_DIR

Stop the MapReduce JobHistory Server with the following command, run on the designated server:

  $ $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR

Now After starting successfully you can Check the namenode Url at


Yarn URl at


Step: 10

Running any Example to check if its working or not

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar pi 5 100

Run the Pi example to verify that and you can see it on yarn url, if its working or not.

Some Tweaks:

You can set alias in .profile

alias  hd=”$HADOOP_HOME/bin/hadoop ”

or create small shell script to start and stop those process like

$ vi dfs.sh

  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs $1 namenode
  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs $1 datanode
  $ $HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs $1 secondarynamenode

after adding these lines you can start the process just by running dfs.sh

$ sh dfs.sh start/stop

Saturday, 26 October 2013

what are the challenges of cloud computing?

Ensuring adequate performance. The inherent limitations of the Internet apply to cloud computing. These performance limitations can take the form of delays caused by demand and traffic spikes, slow downs caused by malicious traffic/attacks, and last mile performance issues, among others.

Ensuring adequate security. Many cloud-based applications involve confidential data and personal information. Therefore, one of the key barriers cloud providers have had to overcome is the perception that cloud-based services are less secure than desktop-based or data center-based services.

Ensuring the costs of cloud computing remain competitive. As a leading provider of infrastructure for cloud computing, we are uniquely positioned to help customers overcome the challenges of cloud computing and fully realize its many benefits. Our Intelligent Platform consists of more than 100,000 servers all over the world, running securely and delivering a significant percentage of the world’s cloud computing applications and services.

Disadvantages of Cloud Computing

While cloud computing service is a great innovation in the field of computing but still, there are a number of reasons why people not want to adopt cloud computing for their particular need.


One major disadvantages of cloud computing is user’s dependency on the provider. Internet users don’t have their data stored with them.


Cloud computing services means taking services from remote servers. There is always insecurity regarding stored documents because users does not have control over their software. Nothing can be recovered if their servers go out of service.

Requires a Constant internet connection

The most obvious disadvantage is that Cloud computing completely relies on network connections.

It makes your business dependent on the reliability of your Internet connection. When it’s offline, you’re offline. If you do not have an Internet connection, you can't access anything, even your own data. A dead internet connection means no work. Similarly, a low-speed Internet connection, such as that found with dial-up services, makes cloud computing painful at best and often impossible. Web-based apps often require a lot of bandwidth to download,. In other words, cloud computing isn't for the slow connection.


Security and privacy are the biggest concerns about cloud computing. Companies might feel uncomfortable knowing that their data is stored in a virtual server which makes responsibility on the security of the data difficult to determine and even users might feel uncomfortable handing over their data to a third party.

Privacy is another big issue with the cloud computing server. To make cloud servers more secure to ensure that a clients data is not accessed by any unauthorized users, cloud service providers have developed password protected accounts, security servers through which all data being transferred must pass and data encryption technique.

Migration Issue

Migration problem is also a big concern about cloud computing. If the user wants to switch to some other Provider then it is not easy to transfer huge data from one provider to another.

Cloud Computing – Types of Cloud

Cloud computing is usually described in one of two ways. Either based on the cloud location, or on the service that the cloud is offering.

Based on a cloud location, we can classify cloud as:

  1. public,
  2. private,
  3. hybrid
  4. community cloud

Based on a service that the cloud is offering, we are speaking of either:

  1. IaaS (Infrastructure-as-a-Service)
  2. PaaS (Platform-as-a-Service)
  3. SaaS (Software-as-a-Service)
  4. or, Storage, Database, Information, Process, Application, Integration, Security, Management, Testing-as-a-service

Where Do I Pull the Switch: Cloud Location

public cloud mean that the whole computing infrastructure is located on the premises of a cloud computing company that offers the cloud service. The location remains, thus, separate from the customer and he has no physical control over the infrastructure.

As public clouds use shared resources, they do excel mostly in performance, but are also most vulnerable to various attacks.

  • Virtually unlimited resources – You can instantly provision virtually unlimited amount of resources
  • Scalability & Elasticity – You can scale up or down your resources to meet demand peaks and lows, and you pay only for what you use
  • Pay as you go – No Capex, you pay a monthly bill

  • Lack of Perceived Amortization Benefits on Investment – I have come across, large enterprises which shy away from Public Cloud as it is perceived that the amortization benefits on Capital Investments are higher than the Operational Expense benefits incurred on Cloud.
  • Compliance – Public Cloud providers may not be following all the regulatory compliance required by an organization.

If you are a startup or a small enterprise, Public Cloud is the best option to adapt Cloud Computing. You get access to the best in class resources on a pay as you go basis without any initial investments. Also you save on the amount to be spent on maintenance of the resources.
Private cloud means using a cloud infrastructure (network) solely by one customer/organization. It is not shared with others, yet it is remotely located. If the cloud is externally hosted. The companies have an option of choosing an on-premise private cloud as well, which is more expensive, but they do have a physical control over the infrastructure.

The security and control level is highest while using a private network. Yet, the cost reduction can be minimal, if the company needs to invest in an on-premise cloud infrastructure.

  • Security – There is a sense of security among organizations as the data resides on premise
  • Compliance – Enterprises can comply to the compliance standards required for their industries and follow the required corporate governance structure for their organizations
  • Costs – Organizations need to own the hardware, storage and networking resources upfront and also spend on the maintenance of all the resources
  • Complexity – Private Clouds are complex to deploy and maintain because of the complex virtualization of the hardware resources

If you are a large enterprise, it makes sense to capitalize on your existing investments which are already made on hardware infrastructure and have a private cloud deployment on top of it.

Hybrid cloud means, using both private and public clouds, depending on their purpose.

  • Flexibility – Organizations can make use of various Public and Private Clouds to utilize the advantages of both the deployment models.
  • Cloud Bursting – You can run an application on private cloud and burst it to public cloud to meet demand peaks
  • Complexity – To deploy a hybrid model is quite complex because of the varying standards of each provider

Large enterprises are adopting this model and using it in multiple ways like Storage & Archiving, Cloud bursting, development and test on Public Cloud & Production on Private Cloud among others.

Community cloud implies an infrastructure that is shared between organizations, usually with the shared data and data management concerns. For example, a community cloud can belong to a government of a single country. Community clouds can be located both on and off the premises.

What Can I Do With It: Cloud Service

What do we mean by cloud computing services? Cloud computing comes in three basic flavors: software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS).

Software as a Service (SaaS)

SaaS is far and away the most common model of cloud service: Companies buy access to an application but have no responsibility for (and no control over) its implementation. More than 60% of companies that Nemertes works with already use at least one (and often several ) applications that they get via SaaS, ranging from horizontally useful tools such as customer relationship management (as with Salesforce.com) to more vertically specific tools for such tasks as insurance claims adjustment, classroom scheduling and medical billing management.

Platform as a Service (PaaS)

PaaS involves providing a platform on which a customer can run its own applications. For example, a small company might have a Java application to which it has trouble providing enough resources during holiday peak loads. The company might go to a platform provider, such as Akamai, to run the system on its Java application server framework. Microsoft, Force.com and Google also provide platforms on which customers can run applications.

Infrastructure as a Service (Iaas)

IaaS allows an organization to run entire data center application stacks, from the operating system up to the application, on a service provider's infrastructure. Amazon's Elastic Compute Cloud is perhaps the most famous public cloud infrastructure available.

Friday, 25 October 2013

Benifits of Cloud Computing

Achieve economies of scale – increase volume output or productivity with fewer people. Your cost per unit, project or product plummets.

Reduce spending on technology infrastructure. Maintain easy access to your information with minimal upfront spending. Pay as you go (weekly, quarterly or yearly), based on demand.

Globalize your workforce on the cheap. People worldwide can access the cloud, provided they have an Internet connection.

Streamline processes. Get more work done in less time with less people.

Reduce capital costs. There’s no need to spend big money on hardware, software or licensing fees.

Improve accessibility. You have access anytime, anywhere, making your life so much easier!

Monitor projects more effectively. Stay within budget and ahead of completion cycle times.

Less personnel training is needed. It takes fewer people to do more work on a cloud, with a minimal learning curve on hardware and software issues.

Minimize licensing new software. Stretch and grow without the need to buy expensive software licenses or programs.

Improve flexibility. You can change direction without serious “people” or “financial” issues at stake.

Introduction to Cloud Computing

Cloud computing, or something being in the cloud, is an expression used to describe a variety of different types of computing concepts that involve a large number of computers connected through a real-time communication network such as the Internet

cloud computing is a synonym for distributed computing over a network and means the ability to run a program on many connected computers at the same time.

The phrase is also more commonly used to refer to network-based services which appear to be provided by real server hardware, which in fact are served up by virtual hardware, simulated by software running on one or more real machines. Such virtual servers do not physically exist and can therefore be moved around and scaled up (or down) on the fly without affecting the end user

Wednesday, 23 October 2013

Why do I Need Hadoop

Too Much Data

Hadoop provides storage for Big Data at reasonable cost

Storing Big Data using traditional storage can be expensive. Hadoop is built around commodity hardware. Hence it can provide fairly large storage for a reasonable cost. Hadoop has been used in the field at Peta byte scale.

Hadoop allows to capture new or more data

Some times organizations don't capture a type of data, because it was too cost prohibitive to store it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored.

One example would be web site click logs. Because the volume of these logs can be very high, not many organizations captured these. Now with Hadoop it is possible to capture and store the logs

With Hadoop, you can store data longer

To manage the volume of data stored, companies periodically purge older data. For example only logs for the last 3 months could be stored and older logs were deleted. With Hadoop it is possible to store the historical data longer. This allows new analytics to be done on older historical data.

For example, take click logs from a web site. Few years ago, these logs were stored for a brief period of time to calculate statics like popular pages ..etc. Now with Hadoop it is viable to store these click logs for longer period of time.

Hadoop provides scalable analytics

There is no point in storing all the data, if we can't analyze them. Hadoop not only provides distributed storage, but also distributed processing as well. Meaning we can crunch a large volume of data in parallel. The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of peta bytes.

Hadoop provides rich analytics

Native Map Reduce supports Java as primary programming language. Other languages like Ruby, Python and R can be used as well.

Of course writing custom Map Reduce code is not the only way to analyze data in Hadoop. Higher level Map Reduce is available. For example a tool named Pig takes english like data flow language and translates them into Map Reduce. Another tool Hive, takes SQL queries and runs them using Map Reduce.

Business Intelligence (BI) tools can provide even higher level of analysis. Quite a few BI tools can work with Hadoop and analyze data stored in Hadoop.

Big Data

What is Big Data

Big Data is very large, loosely structured data set that defies traditional storage.

"Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time”. – wiki

So big that a single data set may contains few terabytes to many petabytes of data.

Human Generated Data and Machine Generated Data

Human Generated Data is emails, documents, photos and tweets. We are generating this data faster than ever. Just imagine the number of videos uploaded to You Tube and tweets swirling around. This data can be Big Data too.

Machine Generated Data is a new breed of data. This category consists of sensor data, and logs generated by 'machines' such as email logs, click stream logs, etc. Machine generated data is orders of magnitude larger than Human Generated Data.

Before 'Hadoop' was in the scene, the machine generated data was mostly ignored and not captured. It is because dealing with the volume was NOT possible, or NOT cost effective.

Where does Big Data come from

Original big data was the web data -- as in the entire Internet! Remember Hadoop was built to index the web. These days Big data comes from multiple sources.

  • Web Data -- still it is big data
  • Social media data : Sites like Facebook, Twitter, LinkedIn generate a large amount of data
  • Click stream data : when users navigate a website, the clicks are logged for further analysis (like navigation patterns). Click stream data is important in on line advertising and and E-Commerce
  • sensor data : sensors embedded in roads to monitor traffic and misc. other applications generate a large volume of data

Examples of Big Data in the Real world

  • Facebook : has 40 PB of data and captures 100 TB / day
  • Yahoo : 60 PB of data
  • Twitter : 8 TB / day
  • EBay : 40 PB of data, captures 50 TB / day

Challenges of Big Data

Size of Big Data

Big data is... well... big in size! How much data constitute Big Data is not very clear cut. So lets not get bogged down in that debate. For a small company that is used to dealing with data in gigabytes, 10TB of data would be BIG. However for companies like Facebook and Yahoo, peta bytes is big.

Just the size of big data, makes it impossible (or at least cost prohibitive) to store in traditional storage like databases or conventional filers.

We are talking about cost to store gigabytes of data. Using traditional storage filers can cost a lot of money to store Big Data.

Big Data is unstructured or semi structured

A lot of Big Data is unstructured. For example click stream log data might look like time stamp, user_id, page, referrer_page 
Lack of structure makes relational databases not well suited to store Big Data.

Plus, not many databases can cope with storing billions of rows of data.

No point in just storing big data, if we can't process it

Storing Big Data is part of the game. We have to process it to mine intelligence out of it. Traditional storage systems are pretty 'dumb' as in they just store bits -- They don't offer any processing power.

The traditional data processing model has data stored in a 'storage cluster', which is copied over to a 'compute cluster' for processing, and the results are written back to the storage cluster.

This model however doesn't quite work for Big Data because copying so much data out to a compute cluster might be too time consuming or impossible. So what is the answer?

One solution is to process Big Data 'in place' -- as in a storage cluster doubling as a compute cluster.

How Hadoop solves the Big Data problem

Hadoop clusters scale horizontally

More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.

Hadoop can handle unstructured / semi-structured data

Hadoop doesn't enforce a 'schema' on the data it stores. It can handle arbitrary text and binary data. So Hadoop can 'digest' any unstructured data easily.

Hadoop clusters provides storage and computing

We saw how having separate storage and processing clusters is not the best fit for Big Data. Hadoop clusters provide storage and distributed computing all in one.

Sunday, 20 October 2013

HDInsight Installation on Windows Platform

First Install Windows 8 then after install the HDInsight. HDInsight installer is powered by Microsoft Web Platform Installer. To download it you can use the following link:

After installing Microsoft WPI (Web Platform installer), run it with administrator privileges.
Search Hadoop in the search box. It will locate the HDInsight preview service installer for windows.

Click add to add installer to Microsoft WPI installation cart.

Accept the license and click install and wait for WPI to do the installation for you.

The installer also includes the hadoop package and IIS components.

Verifications of HDInsight Installation Success:

Four shortcuts will be created on the desktop, when the installer is finished.

Double-click on the Hadoop Command Line icon on the desktop. It should open to the C:\Hadoop\hadoop-1.1.0-SNAPSHOT> prompt. Type cd .. to navigate to the Hadoop directory and type dir to see what it contains. It should have the items shown below.

Double Click on the Microsoft HDInsight Dashboard icon on the desktop to open the local dashboard that provides the starting point for using the Developer Preview. This takes you to the Your Clusters dashboard.

Click on the local (hdfs) tile to open the main HDInsight dashboard.

There are six green tiles on the HDInsight Dashboard:

Getting Started: Links to hello world tutorial, training, support forums, feature voting channels, release notes, .NET SDK for Hadoop, and related links for HDInsight and Hadoop.

Interactive Console: The interactive console provided by Microsoft to simplify configuring and running Hadoop jobs and interacting with the deployed clusters. This simplified approach using JavaScript and Hive consoles enables IT professionals and a wider group of developers to deal with big data management and analysis by providing an accessible path into the Hadoop framework.

Remote Desktop: Remote into your HDFS cluster.

Job History: Records the jobs that have been run on your cluster.

Downloads: Downloads for Hive ODBC driver and Hive Add-in for Excel that enable Hadoop integration with Microsoft BI Solutions. Connect Excel to the Hive data warehouse framework in the Hadoop cluster via the Hive ODBC driver to access and view data in the cluster.

Documentation: Links to the Windows Azure HDInsight Service documentation, most of which is also valid for the HDInsight Server Developer Preview.

There are also three purple tiles that link to tools for authoring and submitting jobs from the dashboard, which are discussed in the next section.

Create Map Reduce Jobs: Takes you to a UI for specifying Map Reduce jobs and the cluster on which you are submitting them to run.

Create Hive Jobs: Takes you to the interactive Hive Console from which HiveQL, a dialect of SQL, can be used to query data stored in you HDFS cluster. Hive is for analysts with strong SQL skills providing an SQL-like interface and a relational data model. Hive uses a language called HiveQL; a dialect of SQL.

Deploy Samples: Takes you to the gallery of samples that ship with the HDInsight DInsight Server Developer Preview. This is a set of sample jobs that you can run to quickly and easily get started with HDInsight.


After you have created your cluster, click on the Deploy Samples tile in the main dashboard to see the four samples that ship in the Hadoop Sample Gallery.

To run one of these samples, simply click on the tile, follow any instructions provided on the particular sample page, and click on the Deploy to your Cluster button. The simplest example is provided by the WordCount sample. Note that the relevant java and jar files are provided for each sample and can be downloaded and opened for inspection.

Thursday, 17 October 2013

What's in the HDP Sandbox and Installing the Sandbox

Use the HDP Sandbox to Develop Your Hadoop Admin and Development Skills
Unless you have your own Hadoop Cluster to play with, I strongly recommend you get the HDP Sandbox up and running on your laptop.   What's nice about the HDP Sandbox is that it is 100% open source.  The features and frameworks are free, you're not learning from some vendor's proprietary Hadoop version that has features they will charge you for.   With the Sandbox and HDP you are learning Hadoop from a true open source perspective.

The Sandbox contains:

  • A fully functional Hadoop cluster running Ambari to play with.  You can run examples and sample code. Being able to use the HDP Sandbox is a great way to get hands on practice as you are learning.
  • Your choice of Type 2 Hypervisors (VMware, VirtualBox or Hyper-V) to install Hadoop on.
  • Hadoop is running on Centos 6.4 and using Java 1.6.0_24 (in VMware VM).
  • MySQL and Postgres database servers for the Hadoop cluster.
  • Ability to log in as root in the Centos OS and have command line access to your Hadoop cluster.
  • Ambari the management and monitoring tool for Apache Hadoop and Openstack.
  • Hue is included in the HDP Sandbox.  Hue is a GUI containing:

  1. Query editors for Hive, Pig and HCatalog
  2. File Browser for HDFS,
  3. Job Designer/Browser for MapReduce
  4. Oozie editor/dashboard
  5. Pig, HBase and Bash shells
  6. A collection of Hadoop APIs.

With the Hadoop Sandbox you can:

  • Point and click and run through the tutorials and videos.  Hit the Update button to get the latest tutorials.
  • Use Ambari to manage and monitor your Hadoop cluster.
  • Use the Linux bash shell to log into Centos as root and get command line access to your Hadoop environment.
  • Run a jps command and see all the master servers, data nodes and HBase processes running in your Hadoop cluster.  
  • At the Linux prompt get access to your configuration files and administration scripts. 
  • Use the Hue GUI to run pig, hive, hcatalog commands.
  • Download tools like Datameer and Talend and access your Hadoop cluster from popular tools in the ecosystem.
  • Download data from the Internet and practice data ingestion into your Hadoop cluster.
  • Use Sqoop and the MySQL database that is running to practice moving data between a relational database and a Hadoop cluster.  (Reminder: This MySQL database is a meta-database for your Hadoop cluster so be careful playing with this.  In real life you would not use a meta-database to play, you'd create a separate MySQL database server.
  • If using VMware Fusion you can create snapshots of your VM, so you can always roll back.

Downloading the HDP Sandbox and Working with an OVA File

The number one gotcha when installing the HDP Sandbox on a laptop is virtualization is not turned on in the BIOS.  If you have problems this is the first thing to check.

I choose the VMware VM, which downloads the Hortonworks+Sandbox+1.3+VMware+RC6.ova file. An OVA (open virtual appliance) is a single file distribution of a OVF stored in the TAR format.  A OVF (Open Virtualization Format) is a portable package created to standardize the deployment of a virtual appliance.  An OVF package structure has a number of files: a descriptor file, optional manifest and certificate files, optional disk images, and optional resource files (i.e. ISO’s). The optional disk image files can be VMware vmdk’s, or any other supported disk image file. 

VMware Fusion converts the virtual machine from OVF format to VMware runtime (.vmx) format.
I went to the VMware Fusion menu bar and selected File - Import and imported the OVA file. Fusion performs OVF specification conformance and virtual hardware compliance checks.  Once complete you can start the VM.

When you start the VM, if you are asked to upgrade the VM, I choose yes. You'll then be prompted to initiate your Hortonworks Sandbox session, and to open a browser and enter a URL address like: This will take you to a registration page.  When you finish registration it brings up the Sandbox.
  • Instructions are provided for how to start Ambari (management tool), how to login to the VM as root and how to set up your hosts file.
  • Instructions are provided on how to get your cursor back from the VM.
In summary, you download the Sandbox VM file, import it, start the VM and instructions will lead you down the Hadoop yellow brick road.  When you start the VM, the initial screen will show you the URL for bringing up the management interface and also how to log in as root in a terminal window. Accessing Ambari Mgmt Interface, 
  • The browser URL was (yours may be different) to get to Videos, Tutorials, Sandbox and Ambari setup instructions.
  • Running on Mac OS X, hit   Ctrl-Alt-F5 to get a root terminal window. Log in as root/hadoop.
  • Make sure you know how to get out of the VM window.  On Mac it is Ctrl-Alt-F5.
  • Get access to Ambari interface with port 8080, i.e.

Getting Started with the HDP Sandbox
Start with the following steps:
  • Get Ambari up and running.  Follow all the instructions.
  • Bring up Hue.  Look at all the interfaces and shells you have access to.
  • Log in as root using a terminal interface. In Sandbox 1.3 service accounts are root/hadoop for superuser and hue/hadoop for ordinary user.
  • Watch the videos.
  • Run through the tutorials. 
Here is the Sandbox welcome screen.  You are now walking into the light of Big Data and Hadoop.  :) 

A few commands to get you familiar with the Sandbox environment:
# java -version
# ifconfig
# uname -a
# tail /etc/redhat-release
# ps -ef | grep mysqld
# ps -ef | grep postgres
# jps

You can run a jps command and see key Hadoop processes running such as the NameNode, Secondary NameNode, JobTracker, DataNode, TaskTracker, HMaster, RegionServer and AmbariServer.

If you cd to the /etc/hadoop/conf directory, you can see the Hadoop configuration files.  Hint: core-site.xml, mapred-site.xml and hdfs-site.xml are good files to learn for the HDP admin certification test.  :)  

If you cd to the /usr/lib/hadoop/bin  directory, you can see a number of the Hadoop admin scripts.