Google launches DataFlow (a successor to MapReduce)

2017-08-09T12:30:08+00:00 June 30th, 2014|

I'm in San Francisco ready to attend tomorrow to the 2014 Spark Summit. As I already mentioned in this blog Apache Spark is one technology that's emerged as a potential alternative to Mapreduce/Hadoop. But it seem that it is not the only one.  Last week, also here in San Francisco, at its Google I/O 2014 conference, Google unveiled their successor to MapReduce called Dataflow, which it’s selling through its hosted cloud service (equivalent to Amazon data pipeline service and  Kinesis for real-time data processing). Urs Holzle (Google’s senior vice president of technical infrastructure and a Google Fellow) introduces how Dataflow is used for Analytics during a keynote address at Google I/O 2014 conference  (minute 2:06:30 in this video of the keynote).  The service lets you construct an analytics workflow and then send it [...]

Adaptive MapReduce Scheduling in Shared Environments

2017-08-09T12:30:14+00:00 May 31st, 2014|

Jordà Polo presented our last research in Map Reduce at the 14TH IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing  held in Chicago. In this paper we present a MapReduce task scheduler for shared environments in which MapReduce is executed along with other resource-consuming workloads, such as transactional applications. All workloads may potentially share the same data store, some of them consuming data for analytics purposes while others acting as data generators. This kind of scenario is becoming increasingly important in data centers where improved resource utilization can be achieved through workload consolidation, and is specially challenging due to the interaction between workloads of different nature that compete for limited resources. The proposed scheduler aims to improve resource utilization across machines while observing completion time goals. Unlike [...]

Is Hadoop showing its age?

2017-08-09T12:31:34+00:00 May 22nd, 2014|

In my opinion, yes!, the Hadoop framework is showing its age and new processing models are a must. Not only for performance but also for its lack of flexibility. In some way, it is the same that what is happening with the Big Data management. Due to the lack of flexibility of queries, NoSQL databases are adding new query features based on SQL; on the contrary side, SQL databases are bringing some measures of NoSQL performance to relational models. Recently, together with some colleagues, we decided to explore the Spark ecosystem. Spark is a Hadoop MapReduce alternative that improves the performance of Hadoop in part due to its ability to catch intermediate results in-memory. Additionally, Spark addresses the lack of flexibility of the MapReduce model. Sparks also [...]

Spark Ecosystem

2017-08-09T12:31:48+00:00 April 21st, 2014|

In a previous post  we introduced Spark, a framework that will play an important role in the Big Data area.  You can find a good starting point to understand what is Spark following this page from DataBricks, however let me reproduce an overview in this post. Spark runs on top of existing Hadoop clusters to provide enhanced and additional functionality. Although Hadoop is effective for storing vast amounts of data cheaply, the computations it enables with MapReduce are highly limited. MapReduce is only able to execute simple computations and uses a high-latency batch model. Spark provides a more general and powerful alternative to Hadoop's MapReduce, offering rich functionality such as stream processing, machine learning, and graph computations.  Spark provides out of the box support for deploying within an existing Hadoop [...]

Spark: Big Data Analytics Beyond Hadoop

2017-08-09T12:31:51+00:00 April 20th, 2014|

Hadoop is definitely the de-facto standard for large scale data processing across nearly every industry and enterprise. However, while  "Volume", "Variety" and "Velocity" of data increases, Hadoop as a batch processing framework cannot cope with the requirement for real time analytics.  As we saw in our Technology Basics  for Data Scientist course, the scientific community is offering alternatives like Storm framework that provides event processing and distributed computation capabilities open sourced by Twitter. Storm uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data.  A Storm application is designed as a topology of interfaces which create a "stream" of transformations. It provides similar functionality as a MapReduce job with the exception that the topology will theoretically run indefinitely until it is manually terminated. Hortonworks, one of the [...]

Big Data: Una oportunidad para los emprendedores y las empresas

2017-08-09T12:57:35+00:00 December 13th, 2011|

La aparición de Linux dio poder a los desarrolladores innovadores, que además, con el conjunto de paquetes de software Linux, Apache, MySQL y PHP (LAMP, que cambió totalmente el escenario de las aplicaciones web), les permitió programar potentes servidores web a partir de código abierto. Todo ello llevó a la creación de nuevas empresas en el sector TIC, siendo la base de lo que se conoce como Web 2.0. Para mí MapReduce, la piedra angular de este nuevo mundo llamado Big Data, puede suponer lo mismo. Siendo el punto central de un ecosistema de herramientas de código abierto para el análisis a gran escala de la marea de datos que hoy en día hay disponible, tanto privadas como públicas. Todo un mar de oportunidades para los emprendedores [...]