2013/02/28

Big Data and Hadoop

 Spring.Taiwan

Read the book "Understanding Big Data" today. Half way done.
Here's some excerpts from the book:

Understanding Big Data

2013-02-28 15:45:37
the term Big Data applies to information that can’t be processed or analyzed using tradi-tional processes or tools
2013-02-28 15:46:12
An IBM survey found that over half of the busi-ness leaders today realize they don’t have access to the insights they need to do their jobs
2013-02-28 15:48:45
Even if every bit of this data was relational (and it’s not), it is all going to be raw and have very different for-mats, which makes processing it in a traditional relational system impractical or impossible
2013-02-28 15:49:05
variety combining to create the Big Data problem
2013-02-28 15:51:31
Three characteristics define Big Data: vol-ume, variety, and velocity.
2013-02-28 15:54:17
the opportunity exists, with the right technology platform, to ana-lyze almost all of the data (or at least more of it by identifying the data that’s useful to you) to gain a better understanding of your business, your customers, and the marketplace
2013-02-28 15:55:35
a fundamental shift in analysis require-ments from traditional structured data to include raw, semis-tructured, and unstructured data as part of the decision-making and insight process
2013-02-28 15:56:27
To capi-talize on the Big Data opportunity, enterprises must be able to analyze all types of data, both re-lational and nonrelational: text, sensor data,
2013-02-28 15:56:50
Dealing effectively with Big Data requires that you perform analytics against the volume and variety of data while it is still in motion, not just after it is at rest
2013-02-28 15:58:11
Hadoop-based platform is well suited to deal with semistructured and unstructured data, as well as when a data discovery process is needed
2013-02-28 16:07:26
it is about dis-covery and making the once near-impossible possible from a scalability and analysis per-spec-t
2013-02-28 16:09:19
creator Doug Cutting’s son gave to his stuffed toy elephant
2013-02-28 16:09:44
duce)—more on these in a
2013-02-28 16:10:23
this redundancy provides fault toler-ance and a capability for the Hadoop cluster to heal itself
2013-02-28 16:10:55
Some of the more notable Hadoop-related projects include: Apache Avro (for data serializa-tion), Cassandra and HBase (databases), Chukwa (a monitoring sys-tem spe-cifically designed with large distributed systems in mind), Hive (provides ad hoc SQL-like queries for data aggregation and summariza-tion), Mahout (a machine learning library), Pig (a high-level Hadoop programming language that provides a data-flow language and execution framework for parallel computation), ZooKeeper (provides coordination services for distributed ap-plications), and more
2013-02-28 16:12:23
throughout the cluster
2013-02-28 16:13:54
For Hadoop deployments using a SAN or NAS, the extra network commu-nica-tion overhead can cause performance bottle
2013-02-28 16:14:59
an individual file is actually stored as smaller blocks that are repli-cated across multiple servers in the entire cluster
2013-02-28 16:16:09
default size of these blocks for Apache Hadoop is 64 MB
2013-02-28 16:21:41
All of Hadoop’s data placement logic is managed by a special server called NameNode
2013-02-28 16:28:14
All of the NameNode’s infor-mation is stored in memory, which allows it to provide quick response times
2013-02-28 16:30:33
Any data loss in this metadata will result in a permanent loss of corresponding data in the cluster
2013-02-28 16:32:25
devel-oper doesn’t have to deal with the concepts of the NameNode and where data is stored—Hadoop does that for you
2013-02-28 16:38:42
The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples
2013-02-28 16:38:51
reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples
2013-02-28 16:41:20
All five of these output streams would be fed into the reduce tasks, which combine the in-put results and output a single value for each city, producing a final result set
2013-02-28 16:43:08
MapReduce program is referred to as a job. A job is executed by subse-quently breaking it down into pieces called tasks
2013-02-28 16:43:45
An application submits a job to a specific node in a Hadoop cluster, which is running a daemon called the JobTracker
2013-02-28 16:44:39
In a Hadoop cluster, a set of continually running daemons, referred to as TaskTracker agents, monitor the status of each task
2013-02-28 16:46:39
This direct-ing of records to reduce tasks is known as a Shuffle, which takes input from the map tasks and directs the output to a specific re-duce task
2013-02-28 16:47:29
under Hadoop are written in Java, and it is the Java Archive file (jar) that’s distributed by the JobTracker to the various Hadoop cluster nodes to execute the map and reduce tasks
2013-02-28 16:48:00
BigDataUniversity.com and download Info-Sphere BigInsights Basic Edi-tion (www.ibm.com/software/data/infosphere/ biginsights/basic.html
2013-02-28 16:49:58
Hadoop Common Components are a set of libraries that support the var-ious Hadoop subprojects
2013-02-28 16:50:50
When you delete an HDFS file, the data is not actually gone (think of your MAC or Windows-based home computers, and you’ll get the point). Deleted HDFS files can be found in the trash, which is automatically cleaned at some later point in time
2013-02-28 16:52:46
we cover three of the more popular ones, which admit-tedly sound like we’re at a zoo: Pig, Hive, and Jaql
2013-02-28 16:53:09
Pig was initially developed at Yahoo
2013-02-28 16:56:02
you can FIL-TER out rows that are not of interest, JOIN two sets of data files, GROUP data to build aggrega-tions, ORDER results, and much more
2013-02-28 16:58:01
There are three ways to run a Pig program: embedded in a script, embedded in a Java program, or from the Pig command line, called Grunt
2013-02-28 16:58:51
Some folks at Face-book developed a runtime Hadoop sup-port structure that allows anyone who is already fluent with SQL (which is commonplace for rela-tional data-base developers) to leverage the Hadoop platform right out of the gate. Their cre-ation, called Hive
2013-02-28 16:59:50
You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby
2013-02-28 17:01:19
Hive is read-based and therefore not appropriate for transaction processing
2013-02-28 17:03:11
JSON is built on top of two types of struc-tures. The first is a collection of name/value pairs
2013-02-28 17:03:35
The sec-ond JSON structure is the ability to create an or-dered list of values much like an array, list, or se-quence you might have in your existing applica-tion
2013-02-28 17:09:13
The operand used to signify flow from one operand to another is an arrow: ->. Unlike SQL, where the output comes first (for example, the SELECT list)
2013-02-28 17:15:16
Jaql is a flexible infras-tructure for managing and analyzing many kinds of semistructured data such as XML, CSV data, flat files, relational data, and so on

REF:
http://www.ibm.com/developerworks/wikis/display/db2oncampus/FREE+ebook+-+Understanding+Big+Data

No comments:

Post a Comment