Last week, the large lecture hall at the IESEG School of Management in Grande Arche de la Défense was packed to host the meeting of the Hadoop User Group. A success for this user group that shows the interest of French companies today in the Big Data phenomenon .
A success that also shows that Big Data is no more than a communication phenomenon and that it is indeed a reality for many companies. These companies, precisely, face today the reality of these technologies.
Admittedly, the cost of storage has dropped considerably. While the Hadoop framework has matured , business expectations have risen sharply in the face of big data: you have to deal with massive amounts of data, but also be able to do so while the data is rapidly changing. , that their quality is not always guaranteed and that at the end of the chain, more than one user is ready to wait 30 seconds to have his analysis, even if it involves the processing of billions of records .
The Data Lake, worthy successor of the Data Warehouse?
Grandpa’s Business Intelligence was a data warehouse with star or snowflake modeling, its ETL running every night to feed users with fresh data in the morning. An approach that Big Data totally calls into question.
It is enough “to dump all its raw data in a” Data Lake “, a data lake, then let the Data Scientists work … A vision far from being as simple in the reality of the companies, as underlined Cyrille Coqueret , Technical Director Business Intelligence & Big Data of EDIS Consulting.
” The goal of setting up a Data Lake is to handle very large volumes of data . We will be able to store structured, unstructured data, internal or external data to the company. We will finally be able to “de-silot” the data, which has not always been successful with the Datawarehouse. The Data Lake also allows you to gain agility on data with the ability to add new data very quickly, and finally give users the means to be autonomous in their analyzes. ”
According to him, many Data Lake projects currently carried out by French companies are still limited to this data spill; Data Scientists to carry out their analyzes and to “manage” to cross the data put at their disposal. An approach essentially focused on technology.
For Cyrille Coqueret, the approach should not only be technical, but above all methodological and functional. “Companies often neglect the issues of production and maintenance of these Data Lake. We can start a project quickly, but often projects are blocked at the PoC (Proof-of-Concept), without ever being able to go into production. Big Data is not magic. Unstructured data, semi-structured or input data is retrieved, but if you want to analyze it and get something out of it, you have to structure it. However, for the moment, there are no good practices to apply to structure a Data Lake. ”
Hadoop, its strengths and weakness
The expert opposes the traditional approach in Business Intelligence where we build a data warehouse by feeding it with internal data, generally fed in batch at night and in which the data are perfectly structured, standardized. A structuring that gave birth to traditional star or snowflake patterns, designed for relational databases . This model had its strengths, but also real limitations, some in terms of flexibility, with a scheme difficult to evolve, others in terms of storage cost and scalability.
In this respect, Big Data and its Data Lake approach provide a real solution. However Hadoop has a flaw : its ultra-distributed architecture is struggling when it comes to making joins between data.
“If we try to transpose the model of the star on Hadoop, we will have a huge data file on one side that will be broken up on several servers and small tables of dimensions on another machine. When launching a query that requires a join, it generates several jobs that are then reconciled in the reduce phase to arrive at the desired result. We have a bad distribution of the load: we will solicit all the servers because the data are very scattered, we will generate a large network traffic at the time of data reconciliation and finally a peak of memory consumption at the time we consolidate the results. ”
Denormalize to better distribute the loads on the cluster
An inefficient approach according to the expert who suggests another way to organize the data.
“We have to denormalize the data as much as possible, we enrich them with facts in the same file. We’re going to have big files that contain all the data we’ll be able to query. ”
In this way, the data will be better distributed on the cluster and the joins will be de facto already done. “We’re going to have a better load distribution, a lot less network traffic because the aggregations are already done and less memory is consumed in the reduce. ”
This denormalization generates a duplication of data and mechanically increases the storage volume . But with Big Data, storage volumes are no longer a blocking criterion.
Nevertheless, some consequences are more troublesome. “With this approach, the modeling is defined a priori and we increase the number of columns, which makes the model more difficult to use by the Data Scientist. We already knew this problem with Business Objects universes, for example, where we could end up with 400 different objects. The user was not there and only the IT could handle them in the end. ”
The expert discusses various ways to overcome these limitations: think about partitioning data, use data compression, or use an optimized storage format, such as the Parquet format created by Twitter.
Each company finds its recipes to reconcile large volumes of data, performance and agility
Another pitfall in the modeling solution a priori mentioned by Cyrille Coqueret, the loss of flexibility for analysts. “Data Scientists need to be able to build their own datasets, which is why it is recommended to leave detail data in the datasets, leave the raw information and promote scalable modeling. The JSON format is adapted to this because it is self-supporting of its structure. It allows to add information without problem. ”
The expert gave his peers some tips such as toggle the indicators usually stored in columns to pass online.
This recipe, EDIS Consulting applied as part of a project for the research center of an industrialist to store the sensor data. With line data storage, adding a new sensor to each line is not really a problem, and so the data storage model becomes independent of the number of sensors “lit” during the test.
While keeping good modeling and good performance, the consultants have managed to maintain flexibility, at the cost of increasing volumes since the data must be repeated at each line.
Finally, faced with the increase in the number of columns in datasets that make them less readable by Data Scientists, Cyrille Coqueret suggests the creation of views with a smaller number of columns. Views that will meet 80% of user needs. “Instead of having files with 200, 300 or 400 columns, we will generate smaller files, with 20 to 40 columns maximum so that they remain manipulable by an end user. ”
Another track dug by EDIS Consulting for his client, the coupling to a search engine. “The idea is to index the data in the dataset to allow a full-text or column search. The user enters his request in natural language. He can then select the datasets and columns he will need, and then export the dataset of the data that corresponds to the analysis he wants to conduct. ”
For his client, Cyrille Coqueret used the Sinequa engine to allow the Data Scientists to create the data files they need for their analyzes.
A series of recipes while waiting to see best practices emerge. “There is not enough experience in this field of Data Lakes modeling yet. We believe that this is currently a brake on the production of Big Data projects “concludes the expert.
Communities work on new solutions … techniques
Faced with this problem of modeling, several avenues have been mentioned by the assistance of this edition of Hug France, including the use of Impala on Parquet storage for some. The Kylin solution has also been mentioned by others.
Apache project since the end of 2014, it is an OLAP engine that comes to rely on Hive / Hadoop. The arrival of cubes of data (coboids in eBay terminology) partly overlaps Cyrille Coqueret’s approach by introducing redundancy to the data to gain performance.
Kylin is a solution that was originally developed by eBay for its internal needs and is now open to everyone on Github .
Another innovative solution presented during this evening dedicated to Big Data, the one implemented by Mappy. Faced with the exponential surge in the needs of its users, with a storage that grew from 10 million aggregated data in October 2012 to 2.7 billion today, the French resorted to Map / Reduce Hadoop, then the Big Data In-memory from Spark SQL to finally test today the brand new real-time Big Data solution from the French DataBig.