Daniel @ nowhere: Big Data and Hadoop

Spring.Taiwan

Read the book "Understanding Big Data" today. Half way done.
Here's some excerpts from the book:

Understanding Big Data

2013-02-28 15:45:37

the term Big Data applies to information that can’t be processed or analyzed using tradi-tional processes or tools

2013-02-28 15:46:12

An IBM survey found that over half of the busi-ness leaders today realize they don’t have access to the insights they need to do their jobs

2013-02-28 15:48:45

Even if every bit of this data was relational (and it’s not), it is all going to be raw and have very different for-mats, which makes processing it in a traditional relational system impractical or impossible

2013-02-28 15:49:05

variety combining to create the Big Data problem

2013-02-28 15:51:31

Three characteristics define Big Data: vol-ume, variety, and velocity.

2013-02-28 15:54:17

the opportunity exists, with the right technology platform, to ana-lyze almost all of the data (or at least more of it by identifying the data that’s useful to you) to gain a better understanding of your business, your customers, and the marketplace

2013-02-28 15:55:35

a fundamental shift in analysis require-ments from traditional structured data to include raw, semis-tructured, and unstructured data as part of the decision-making and insight process

2013-02-28 15:56:27

To capi-talize on the Big Data opportunity, enterprises must be able to analyze all types of data, both re-lational and nonrelational: text, sensor data,

2013-02-28 15:56:50

Dealing effectively with Big Data requires that you perform analytics against the volume and variety of data while it is still in motion, not just after it is at rest

2013-02-28 15:58:11

Hadoop-based platform is well suited to deal with semistructured and unstructured data, as well as when a data discovery process is needed

2013-02-28 16:07:26

it is about dis-covery and making the once near-impossible possible from a scalability and analysis per-spec-t

2013-02-28 16:09:19

creator Doug Cutting’s son gave to his stuffed toy elephant

2013-02-28 16:09:44

duce)—more on these in a

2013-02-28 16:10:23

this redundancy provides fault toler-ance and a capability for the Hadoop cluster to heal itself

2013-02-28 16:10:55

Some of the more notable Hadoop-related projects include: Apache Avro (for data serializa-tion), Cassandra and HBase (databases), Chukwa (a monitoring sys-tem spe-cifically designed with large distributed systems in mind), Hive (provides ad hoc SQL-like queries for data aggregation and summariza-tion), Mahout (a machine learning library), Pig (a high-level Hadoop programming language that provides a data-flow language and execution framework for parallel computation), ZooKeeper (provides coordination services for distributed ap-plications), and more

2013-02-28 16:12:23

throughout the cluster

2013-02-28 16:13:54

For Hadoop deployments using a SAN or NAS, the extra network commu-nica-tion overhead can cause performance bottle

2013-02-28 16:14:59

an individual file is actually stored as smaller blocks that are repli-cated across multiple servers in the entire cluster

2013-02-28 16:16:09

default size of these blocks for Apache Hadoop is 64 MB

2013-02-28 16:21:41

All of Hadoop’s data placement logic is managed by a special server called NameNode

2013-02-28 16:28:14

All of the NameNode’s infor-mation is stored in memory, which allows it to provide quick response times

2013-02-28 16:30:33

Any data loss in this metadata will result in a permanent loss of corresponding data in the cluster

2013-02-28 16:32:25

devel-oper doesn’t have to deal with the concepts of the NameNode and where data is stored—Hadoop does that for you

2013-02-28 16:38:42

The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples

2013-02-28 16:38:51

reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples

2013-02-28 16:41:20

All five of these output streams would be fed into the reduce tasks, which combine the in-put results and output a single value for each city, producing a final result set

2013-02-28 16:43:08

MapReduce program is referred to as a job. A job is executed by subse-quently breaking it down into pieces called tasks

2013-02-28 16:43:45

An application submits a job to a specific node in a Hadoop cluster, which is running a daemon called the JobTracker

2013-02-28 16:44:39

In a Hadoop cluster, a set of continually running daemons, referred to as TaskTracker agents, monitor the status of each task

2013-02-28 16:46:39

This direct-ing of records to reduce tasks is known as a Shuffle, which takes input from the map tasks and directs the output to a specific re-duce task

2013-02-28 16:47:29

under Hadoop are written in Java, and it is the Java Archive file (jar) that’s distributed by the JobTracker to the various Hadoop cluster nodes to execute the map and reduce tasks

2013-02-28 16:48:00

BigDataUniversity.com and download Info-Sphere BigInsights Basic Edi-tion (www.ibm.com/software/data/infosphere/ biginsights/basic.html

2013-02-28 16:49:58

Hadoop Common Components are a set of libraries that support the var-ious Hadoop subprojects

2013-02-28 16:50:50

When you delete an HDFS file, the data is not actually gone (think of your MAC or Windows-based home computers, and you’ll get the point). Deleted HDFS files can be found in the trash, which is automatically cleaned at some later point in time

2013-02-28 16:52:46

we cover three of the more popular ones, which admit-tedly sound like we’re at a zoo: Pig, Hive, and Jaql

2013-02-28 16:53:09

Pig was initially developed at Yahoo

2013-02-28 16:56:02

you can FIL-TER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggrega-tions, ORDER results, and much more

2013-02-28 16:58:01

There are three ways to run a Pig program: embedded in a script, embedded in a Java program, or from the Pig command line, called Grunt

2013-02-28 16:58:51

Some folks at Face-book developed a runtime Hadoop sup-port structure that allows anyone who is already fluent with SQL (which is commonplace for rela-tional data-base developers) to leverage the Hadoop platform right out of the gate. Their cre-ation, called Hive

2013-02-28 16:59:50

You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby

2013-02-28 17:01:19

Hive is read-based and therefore not appropriate for transaction processing

2013-02-28 17:03:11

JSON is built on top of two types of struc-tures. The first is a collection of name/value pairs

2013-02-28 17:03:35

The sec-ond JSON structure is the ability to create an or-dered list of values much like an array, list, or se-quence you might have in your existing applica-tion

2013-02-28 17:09:13

The operand used to signify flow from one operand to another is an arrow: ->. Unlike SQL, where the output comes first (for example, the SELECT list)

2013-02-28 17:15:16

Jaql is a flexible infras-tructure for managing and analyzing many kinds of semistructured data such as XML, CSV data, flat files, relational data, and so on

REF:
http://www.ibm.com/developerworks/wikis/display/db2oncampus/FREE+ebook+-+Understanding+Big+Data

Daniel @ nowhere

2013/02/28

Big Data and Hadoop

Understanding Big Data

沒有留言:

張貼留言

乩童警探一二集

2013/02/28

Big Data and Hadoop

Understanding Big Data

沒有留言:

張貼留言

乩童警探 一二集

乩童警探一二集