The Apache Hadoop project develops open -source software for reliable, scalable, [614824]
Hadoop
The Apache™ Hadoop® project develops open -source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high -availability, the library
itself is designed to detect and h andle failures at the application layer, so delivering a highly –
available service on top of a cluster of computers, each of which may be prone to failures.
The project includes the modules: Hadoop Common (the common utilities that support the
other Hadoop modules), Hadoop Distributed File System (HDFS™) (a distributed file
system that provides high -throughput access to application data), Hadoop YARN (a
framework for job scheduling and cluster resource management), Hadoop MapReduce (a
YARN -based system for p arallel processing of large data sets).
The Hadoop Distributed File System (HDFS) is an open source system currently being used in
situations where massive amounts of data need to be processed. Based on experience with
the largest deployment of HDFS, K. V. Shvachko provided (in “HDFS scalability: the limits to
growth” – ;Login – April 2010) an analysis of how the amount of RAM of a single namespace
server correlates with the storage capacity of Hadoop clusters, outline the advantages of the
single -node name space server architecture for linear performance scaling, and establish
practical limits of growth for this architecture. This study proved to be applicable to issues
with other distributed file systems.
Being a part of Hadoop core and serving as a storage layer for the Hadoop MapReduce
framework, HDFS is also a stand -alone distributed file system like Lustre, GFS, PVFS,
Panasas, GPFS, Ceph, and others. HDFS is optimized for batch processing focusing on the
overall system throughput rather than individual o peration latency. As with most
contemporary distributed file systems, HDFS is based on an architecture with the
namespace decoupled from the data. The namespace forms the file system metadata, which
is maintained by a dedicated server called the name -node. The data itself resides on other
servers called datanodes. The file system data is accessed via HDFS clients, which first
contact the name -node for data location and then transfer data to (write) or from (read) the
specified data -nodes. The main motivatio n for decoupling the namespace from the data is
the scalability of the system. Metadata operations are usually fast, whereas data transfers
can last a long time. If a combined operation is passed through a single server (as in NFS),
the data transfer compo nent dominates the response time of the server, making it a
bottleneck in a highly distributed environment. In the decoupled architecture, fast metadata
operations from multiple clients are addressed to the (usually single) namespace server, and
the data t ransfers are distributed among the data servers utilizing the throughput of the
whole cluster. The namespace consists of files and directories. Directories define the
hierarchical structure of the namespace. Files —the data containers —are divided into large
(128MB each) blocks. The name -node’s metadata consist of the hierarchical namespace and
a block to data -node mapping, which determines physical block locations. In order to keep
the rate of metadata operations high, HDFS keeps the whole namespace in RAM. The name –
node persistently stores the namespace image and its modification log (the journal) in
external memory such as a local or a remote hard drive. The namespace image and the
journal contain the HDFS file and directory names and their attributes (modi fication and
access times, permissions, quotas), including block IDs for files, but not the locations of the
blocks. The locations are reported by the data -nodes via block reports during startup and
then periodically updated once an hour by default. If the name -node fails, its latest state can
be restored by reading the namespace image and replaying the journal.
HDFS is built upon the single -node namespace server architecture. Since the name -node is a
single container of the file system metadata, it natural ly becomes a limiting factor for file
system growth. In order to make metadata operations fast, the name -node loads the whole
namespace into its memory, and therefore the size of the namespace is limited by the
amount of RAM available to the name -node. Est imates show [12] that the name -node uses
fewer than 200 bytes to store a single metadata object (a file inode or a block). According to
statistics on our clusters, a file on average consists of 1.5 blocks, which means that it takes
600 bytes (1 file object + 2 block objects) to store an average file in name -node’s RAM. This
estimate does not include transient data structures, which the name -node creates for
replicating or deleting blocks, etc., removing them when finished.
Ambari
The Apache Ambari project i s aimed at making Hadoop management simpler by developing
software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides
an intuitive, easy -to-use Hadoop management web UI backed by its RESTful APIs.
Ambari enables System Admin istrators to provision a Hadoop Cluster by providing a step -by-
step wizard for installing Hadoop services across any number of hosts and handling
configuration of Hadoop services for the cluster, to manage a Hadoop Cluster by providing
central management f or starting, stopping, and reconfiguring Hadoop services across the
entire cluster, and to monitor a Hadoop Cluster by providing a dashboard for monitoring health
and status of the Hadoop cluster, leveraging Ambari Metrics System for metrics collection and
leveraging Ambari Alert Framework for system alerting and will notify you when your attention
is needed (e.g., a node goes down, remaining disk space is low, etc).
Ambari enables Application Developers and System Integrators to easily integrate Hadoop
provisioning, management, and monitor ing capabilities to their own applications with the Ambari
REST APIs .
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: The Apache Hadoop project develops open -source software for reliable, scalable, [614824] (ID: 614824)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
