Hadoop Fundamentals

As more and more data becomes available, companies and organisations want to embark into big data projects.

- data in terabytes and petabytes

- its very complex to query that much of data

- so it hinders scalability, speed(real-time), others(Queryability, sophisticated processing) when using RDBMS

How it Hadoop came into significance?

Database choices we have,

Filesystem:

Hadoop filesystem(there is a level of sophistication around the management of filesystem
xml, etc

Databases: NoSQL(key/value, column store, etc)
Hadoop itself is not a database but an alternative filesystem with a processing library(most commonly in Hadoop implementations you will NoSQL implementations)

- Hadoop is most commonly implemented with HBASE. HDFS(introduced by google to index the entire internet which is called the GFS(Google File System)). HBase is a NoSQL Database which is very commonly used with Hadoop solutions. It is wide column store which means it is a database that consist of a key and then one to n number of values.

So what makes NoSQL database NoSQL, is that this wide column implementation can vary. So the next row might not have some attributes which the previous column had because the width of the column varies with the amount of the data to be inserted.

CAP Theory

Consistency: very important for transactions

Availability: uptime, make copies of data as a back ups

Partitioning: Scalability, you can split your data across multiple machines and you can continue to grow amount of data that you work with.

Traditional databases focus and are good if you want consistency and availability.

CAP theory says that you need only two of the three aspects of CAP theory this is where Hadoop comes into play.

Hadoop covers

Scalability(Partitioning): commodity hardware for data storage

Flexibility(Availability): you can scale database infinitely (commodity hardware for distributed processing)

Companies like Facebook and Yahoo are using Hadoop to manage their large amount of data and also they saving cost by using commodity hardware.

What kind of Data is good for Hadoop?

Line of Business data: Its usually transactional and usually not good fit for Hadoop.

Behavioural data: This is the data which we batch process and is often a great fit for Hadoop. For example, Healthcare, more and more information is being made available from our devices which we use and wear like fitbit or any other kind of monitoring device. It is a good idea to integrate it with other types of solutions like medical records or medical diagnostics. Medical diagnostics would be kept in line-of-business system because they will be transactional. because you want to know which prescription you took and you don't want to have data inconsistency there. But in the case of fitbit, you just want to integrate all the information and its okay to lose one or two reading because that is just to grasp your behaviour.

So Hadoop is not a replacement for RDBMS but its a supplement.

What is Hadoop?

It consists of two components and often times it is deployed with other projects as well.

1st Component: Open Source data storage, HDFS(Hadoop Distributed File System)

2nd Component: Processing API, MapReduce

Most commonly in most of the projects Hadoop includes other projects or libraries like HBase, Hive, Pig, etc.

Hadoop Distributions

First set of distributions are 100% open source and you can find those under Apache foundation and the core distribution is called Apache Hadoop. There are commercial distributions like, Cloudera, HortonWorks, and MapR. They wrap around some version of the open source distribution and they will provide additional tooling and monitoring and management with other libraries.

Its also very common for companies to use Hadoop clusters on the cloud like AWS(Amazon Web Services), Microsoft Azure. So you can use Amazon distribution where you can use Apache Hadoop distribution or you can use MapR distribution with AWS. But not all commercial versions are available on the cloud so these points should be considered while selecting a particular cloud-based Hadoop distribution.

Hadoop is,

Cheaper: Scales to petabytes or more
Faster: parallel data processing
Better: Suited to particular types of 'Big data'

Risk Modelling (insurance company, credit card company to see the pattern for fraudulent transaction)

Customer Churn analysis, collect as much information, also to check activities of users to not let them leave.

Recommendation engine (Netflix)

Ad-Targeting

Transactional analysis

Threat analysis

Search Analysis

Examples, Facebook, eBay, Amazon, yahoo!, American Airlines, The New York Times, IBM, Orbitz etc.

3 roles -> Hadoop developer, Administrator and Architect.

Ishan Tiwari

Sunday, 29 January 2017

Hadoop Fundamentals