Big Data Open Source Landscape: Processing Technologies

2017-08-09T12:29:19+00:00 July 15th, 2014|

Hadoop is a well established software framework which analyse structured/unstructured big data and distribute applications on thousands of servers. Hadoop was created in 2005 and after Hadoop several projects around in the Hadoop space appeared that tried to complement it. Sometimes those technologies overlap with each other and sometimes they are partially complementary. I will try to describe a brief map of them.   Programming Model The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Apache Hadoop Project brings an open source MapReduce Implementation.   Management layer The scalability that is needed for big data processing is supported [...]

Databricks Cloud: Next Step For Spark

2017-08-09T12:29:26+00:00 July 1st, 2014|

This morning, during the Spark Summit,  Databricks announced a new step forward, that will allow users to leverage Apache Spark technology to build end-to-end pipelines that underlie advanced analytic running on Amazon AWS. The name is Databricks Cloud. Spark is already deployable on AWS, but Databricks Cloud is a managed service based on Spark that will be supported directly by Databricks. They shown us an impressive demo of the platform. The Databricks Workspace (photo obtained with my iPhone :-) ) is composed by: Notebooks. Provides a rich interface that allows users to perform data discovery and exploration and to plot the results interactively. Dashboards. Create and host dashboards quickly and easily. Users can pick any outputs from previously created notebooks, assemble these outputs in a one-page dashboard with a WISIWYG editor, [...]