Building a Unified Data Pipeline in Spark

November 24th, 2014|

Excellent reception of sparkers to the last session of  Barcelona Spark meetup featured by Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) speaking about ‘Building a Unified Data Pipeline in Spark’ . If you missed the presentation or want to revisit it, check out the video recorded here  (talk in English). Enclosed you will find some pictures of the session. Thank you very much to Aaron Davidson for accepting our invitation and also to Paco Nathan, Alex Sicoe, Sameer Farooqui  and Olivier Girardot for their support for this meetup. I hope you enjoyed barcelona and you come back soon.  

Strata + Hadoop World in Barcelona 2014: Videos & Slides

November 22nd, 2014|

The conference is over, and in my point of view it was a great success. The program of the conference were very good, with great networking opportunities and a good sponsor pavilion. I really enjoyed it. Let me say to the organisers that Barcelona is delighted to welcome conferences like Strata+Hadoop. And all attendees with whom I spoke were excited to be in Barcelona.  Congratulations for choosing Barcelona! If you missed the conference or want to revisit the main presentations or keynotes, check out the keynote videos or speaker slides. You can also check out the official photos.

Get certified for Apache Spark in Barcelona

November 13th, 2014|

As all my students know I think that Hadoop is showing its age and Apache Spark is exploding. Let me share with you an important opportunity to get the Developer Certification for Apache Spark in Barcelona. Yes, I said in Barcelona!,  at the upcoming Strata + Hadoop World  next week in the CCIB - Centre Convencions Internacional de Barcelona.  If you want to learn more you can visit this web page. it is a good opportunity!  I hope to see you in the Strata + Hadoop World event!. Also you are invited to attend our next meeting of Barcelona Spark Meetup.  This fourth meeting will feature Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan (Community Evangelism Director  at Databricks) speaking about 'Building a Unified Data Pipeline in Spark' (talk in English). The talk will start next [...]

Databricks-Spark comes to Barcelona!

October 9th, 2014|

¡Lo hemos conseguido, un meetup con ingenieros llegados de USA para contarnos de primera mano lo que se cuece sobre Spark en la empresa Databricks! Este cuarto meeting contará con Aaron Davidson (Apache Spark committer e Ingeniero de Software en Databricks) y Paco Nathan (Community Evangelism Director  at Databricks) que nos hablarán acerca de 'Building a Unified Data Pipeline in Spark' (conferencia en Inglés). La charla se realizará el próximo jueves 20/Noviembre a las 18.30, en la sala de actos de la FIB, en el campus Nord de la UPC. Os esperamos a todos, seguro que va a ser impresionante! Si estáis interesados es muy importante que os apuntéis lo antes posible  en la lista de asistentes  confirmados del meetup puesto que la capacidad de la sala de actos es de 80 personas y [...]

Big Data Open Source Landscape: Processing Technologies

July 15th, 2014|

Hadoop is a well established software framework which analyse structured/unstructured big data and distribute applications on thousands of servers. Hadoop was created in 2005 and after Hadoop several projects around in the Hadoop space appeared that tried to complement it. Sometimes those technologies overlap with each other and sometimes they are partially complementary. I will try to describe a brief map of them.   Programming Model The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Apache Hadoop Project brings an open source MapReduce Implementation.   Management layer The scalability that is needed for big data processing is supported [...]

Is Hadoop showing its age?

May 22nd, 2014|

In my opinion, yes!, the Hadoop framework is showing its age and new processing models are a must. Not only for performance but also for its lack of flexibility. In some way, it is the same that what is happening with the Big Data management. Due to the lack of flexibility of queries, NoSQL databases are adding new query features based on SQL; on the contrary side, SQL databases are bringing some measures of NoSQL performance to relational models. Recently, together with some colleagues, we decided to explore the Spark ecosystem. Spark is a Hadoop MapReduce alternative that improves the performance of Hadoop in part due to its ability to catch intermediate results in-memory. Additionally, Spark addresses the lack of flexibility of the MapReduce model. Sparks also [...]

Spark Ecosystem

April 21st, 2014|

In a previous post  we introduced Spark, a framework that will play an important role in the Big Data area.  You can find a good starting point to understand what is Spark following this page from DataBricks, however let me reproduce an overview in this post. Spark runs on top of existing Hadoop clusters to provide enhanced and additional functionality. Although Hadoop is effective for storing vast amounts of data cheaply, the computations it enables with MapReduce are highly limited. MapReduce is only able to execute simple computations and uses a high-latency batch model. Spark provides a more general and powerful alternative to Hadoop's MapReduce, offering rich functionality such as stream processing, machine learning, and graph computations.  Spark provides out of the box support for deploying within an existing Hadoop [...]

Spark: Big Data Analytics Beyond Hadoop

April 20th, 2014|

Hadoop is definitely the de-facto standard for large scale data processing across nearly every industry and enterprise. However, while  "Volume", "Variety" and "Velocity" of data increases, Hadoop as a batch processing framework cannot cope with the requirement for real time analytics.  As we saw in our Technology Basics  for Data Scientist course, the scientific community is offering alternatives like Storm framework that provides event processing and distributed computation capabilities open sourced by Twitter. Storm uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data.  A Storm application is designed as a topology of interfaces which create a "stream" of transformations. It provides similar functionality as a MapReduce job with the exception that the topology will theoretically run indefinitely until it is manually terminated. Hortonworks, one of the [...]

Hadoop distribution: Main Players-Actores principales

April 5th, 2014|

MAIN PLAYERS Apache Hadoop is the most popular framework used for processing large amounts of data in the Big Data arena. It is clear that Hadoop is here to stay. That is why I always suggest to my students that it is important to know how it works. For the courses I teach where we do not have lab sessions I produced this hands-on for a quick glimpse. If you are interested in learning more about Hadoop you can start with this hands-on that includes some bibliographic references. Some former students and friends who are already in the industry have asked me for a recommendation of some of the distributions available in the market. Each distribution is different and as a researcher I do not have an in-depth [...]

Beca para hacer el doctorado en nuestro grupo de investigación en Barcelona

December 13th, 2013|

Beca de la Caixa para hacer el doctorado en nuestro grupo de investigación en Barcelona  en tema de Analítica avanzada de datos (ref. BSC-Autonomic 01/2014) Acaba de abrirse  la convocatoria de Becas para estudios de doctorado en universidades españolas de la obra social la Caixa y nuestro grupo de investigación tiene una posición de investigador/investigadora para cursar el doctorado en nuestra universidad (que cumple con el requisito de mención de calidad requerido), para un candidato o candidata que consiga esta beca. Para optar a esta beca hace falta la nacionalidad española. El trabajo de doctorado se centraría en técnicas de analítica avanzada  de datos en flujo geoposicionados (twitter, instagram, foursquares, ...) y sensores inteligentes (smartphones, cámaras, ...) para predecir en tiempo real el impacto (social, económico, ...) [...]

¿Cómo empezar a programar en Big Data-Hadoop?

November 15th, 2013|

Hadoop es una de las plataformas más populares en el mundo Big Data y probablemente la mejor puerta de entrada en programación de este nuevo mundo. Pero por sus características para muchos programadores quizás no resulta fácil empezar a trabajar con él. Por ello, cuando acabé de escribir esta práctica no presencial para uno de los cursos que imparto (en el que no tenemos sesiones de laboratorio presenciales y los alumnos tienen que empezar por su cuenta), pensé en hacerlo público para todas aquellas personas que quieran introducirse por su cuenta en esta increíble universo del Big Data. Por este motivo me propuse escribir un tutorial con pasos detallados y simples para permitir la entrada en este mundo a cualquiera que se maneje razonablemente con linux. Espero que [...]