The legacy of Hadoop: no more fear of data


Brother Andrew Brust’s recent article on spring cleaning Hadoop projects obviously hit a nerve, given readership numbers that have gone beyond the rankings. Now the Apache Hadoop the family of projects is no longer the epicenter of “Big Data” as it was ten years ago, and in fact, post-mortems on the death of Hadoop have been floating around for so long that they look more like the latest incarnation of the old slogan that Francisco Franco is still dead.

And if you want to deepen your knowledge, just check out the job offers. A recent poll (pictured below) compiled through web scraping of over 15,000 data scientist job vacancies, published last month by Terence shin, shows that the demand for Hadoop skills is seriously decreasing, joined by C ++, Hive and several legacy proprietary languages. By the way, Spark and Java are also on the same list. Would the results have been different if the same question had been asked of the data engineers?

Resolved, “Hadoop” is considered such in 2014. And like big data, well, the world has changed there too. Big data was so labeled because, at the time, traversing several terabytes or petabytes was exceptional, as was the ability to extend analysis to non-relational data. Today, multi-model databases have become mainstream, while most relational data warehouses have added the ability to parse JSON data and overlay graphical data views. The ability to directly query data in cloud storage and / or federate queries from data warehouses has also become common. No wonder, we now just call it “Data”.

As Andrew pointed out, spring cleaning was all about cleaning the cobwebs. Contrary to popular belief, Hadoop is not dead. A number of key projects in the Hadoop ecosystem continue to live in the Cloudera Data Platform, a very living product. We no longer call it Hadoop because what has survived is the packaged platform which, before CDP, did not exist. The zoo animals are now safely caged.


Credit: Terence Shin

What is dead is the idea of ​​assembling your own cluster with half a dozen or more discrete open source projects. Now that there are alternatives (and we’re not just talking about CDP), why waste time manually implementing Apache? MapReduce, Hive, Tidy, or Atlas? It has been the standard in the database world for at least 30 years; when you purchased Oracle, you did not have to install the query optimizer and storage engines separately. Why should it be any different with the data platform we used to call Hadoop?

In the 2020s, it’s more likely than not that for new projects, your organization is likely to consider implementing cloud services rather than installing packaged software. While the move to the cloud was originally about cost shifting, today it is more likely to be about operational simplification and agility under a common control plan.

Today, there are several ways to analyze data in what was once called The three cons. Today, you can easily access data residing in cloud object storage, the de facto data lake. You can do this through an ad hoc request, using a service such as Amazon Athena; Take advantage of the federated query capabilities that are now checkbox items of most cloud data warehousing services; Classes Spark against data by using a dedicated service such as Databricks, or a cloud-based data warehousing service like Azure Synapse Analytics. Recognizing that the lines between data warehouses and data lakes are blurring, many are now embracing the blurred term data lakehouses that consolidate access between the data warehouse and the data lake, or transform the data lake into an 80% version of the data warehouse.

SEE: Analytics: transforming the science of big data into business strategy (ZDNet / TechRepublic special function) | Download the free PDF version (TechRepublic)

And we haven’t even touched on the topic of AI and machine learning. Just as Hadoop in the beginning was the exclusive domain of data scientists (with the help of data engineers), so were machine learning and AI at large. Today, data scientists have a multitude of tools and frameworks to manage the life cycle of the models they create. And for citizen data scientists, AutoML Services have put building ML models at their fingertips, while cloud data warehouses increasingly add theirs prepackaged ML models which can be triggered via SQL commands.

In all of this, it is easy to forget that just ten years ago it hardly seemed possible. Pioneering Google search got the ball rolling. With Google File System, the internet giant developed an add-only file system that broke the limits of conventional storage networks by taking advantage of an inexpensive drive. With MapReduce, Google cracked the code by achieving near-linear scalability, also on basic hardware, an achievement that was elusive in the context of the large-scale SMP architectures that, at the time, were the avenues for scaling the most common.

Google published the articles which was a good thing for Doug Cup and Mike Cafarella, which at the time were work on a search engine project that could index at least a billion pages, and saw a path through open source that could drastically reduce the cost of implementing such a system. The rest of the community then picked up where Cutting and Cafarella left off; for example, Facebook has developed Hive to provide an SQL-like programming language for iterating through various datasets at the petabyte scale.

It is easy to forget today, with the decline in the number of adoption of classic Hadoop projects, that the discoveries of the Hadoop project have led to a virtuous cycle where innovation has devoured its young people. When Hadoop came along, the data got so big that we had to build the computation into the data. With the emergence of cloud native architectures, made possible by cheap bandwidth and plenty, and more tiers of storage, compute, and data have come apart again. It’s not that either approach was right or wrong, but on the contrary, that they were suited to the technology that existed at the time they were designed for. It is the cyclical nature of technological innovation.

By breaking the boundaries of large-scale processing, the lessons learned from Hadoop helped spawn a cycle where many old assumptions, such as GPUs used strictly for graphics processing, were abandoned. Hadoop’s legacy is not just the virtuous cycle of innovation it spawned, but the fact that it convinced companies of their fear of processing data, and many more.


About Author

Comments are closed.