Apache Hadoop is an open-source, Java-based, distributed computing platform, built out of the necessity to scale search indices. While the original tasks that Hadoop was created for revolved around building search indices, it was quickly apparent that Hadoop’s core abstraction was more generic and broadly applicable. After years of use and revision, it has become a software ecosystem that forms the backbone of a data center operating system built to do scalable data processing and analytics from the ground-up.
Failure tolerance is one of the few core principles of Hadoop that have persisted since its inception. In order to achieve scalability, the system was designed to take node or component failure, such as hard drive loss, in stride with the underlying system taking responsibility for retrying failed jobs. This software resiliency leads to some nice economic characteristics. In particular, less reliable, and therefore less expensive, hardware may be used within a system, which is, overall, very reliable due to the resiliency established at the software layer rather than the hardware layer. Furthermore, this leads to a lower system operator to node ratio as repairs may be queued and taken care of in bulk rather than requiring immediate reaction.
The core technologies have expanded greatly since the initial commit 1 in 2005. However, at its base there are only a handful of components: