Frequently, when talking about Big Data, some of the terms and concepts related to data exploration and analysis in business contexts appear mixed together, which leads to a misunderstanding of how to interpret them. In this post I will present the main differences among them, to clarify the kind of data analysis they are focused on and when it would be correct to use one or the other.
The concept relating data capture and analysis and business decision making that dates back the longest is Business Intelligence. In fact, the first person using this expression was Hans Peter Luhn, IBM engineer, in 1958 (decades before the massive adoption of household computers, as we can see). This term was later popularized by Howard Dresner, at the end of 80s and beginning of 90s, working as an analyst in Gartner Group. After that, the definition of Business Intelligence has been delimited to a descriptive type of data analysis, where historical data are consulted and visualized in an aggregated way and indicators are cross-matched to obtain a better vision of what has happened and is happening in the organization. Therefore, this definition leaves aside a predictive type of analysis, which aims at extracting knowledge from data in the form of patterns, trends or models that provide a degree of certainty about the outcome of potential future actions and the facts causing that potential outcome.
This domain of predictive analytics is where we find terms such as Data Science and Data Mining. The difference between both concepts is more subtle. Data Science refers to a set of principles and fundamentals, both scientific and applied, which guide the extraction of knowledge from data (those patterns and models in the data mentioned before), encompassing and integrating principles from statistics and mathematics, computer science, and also fundamentals from the specific application field where we want to extract knowledge from data. Hence, Data Mining is the extraction itself of such knowledge using tools and techniques and following a certain process of data extraction and analysis, all of which are based on Data Science principles.
At this point we can integrate the definition of Big Data as a set of specific technologies (among those used in Data Mining or Business Intelligence) that facilitate the processing and analysis of data when their volume or processing complexity exceeds the computation capabilities of conventional computers. This is the problem that Google faced when they created a first set of tools that allowed them to solve in an efficient way the analysis of large volumes of data stored in a distributed way along a cluster of machines working in parallel, each of them storing a partition of the global data. This approach gave way to a full ecosystem of tools that have brought Big Data closer to progressively more people.
This domain of predictive analytics is where we find terms such as Data Science and Data Mining. The difference between both concepts is more subtle. Data Science refers to a set of principles and fundamentals, both scientific and applied, which guide the extraction of knowledge from data (those patterns and models in the data mentioned before), encompassing and integrating principles from statistics and mathematics, computer science, and also fundamentals from the specific application field where we want to extract knowledge from data. Hence, Data Mining is the extraction itself of such knowledge using tools and techniques and following a certain process of data extraction and analysis, all of which are based on Data Science principles.
At this point we can integrate the definition of Big Data as a set of specific technologies (among those used in Data Mining or Business Intelligence) that facilitate the processing and analysis of data when their volume or processing complexity exceeds the computation capabilities of conventional computers. This is the problem that Google faced when they created a first set of tools that allowed them to solve in an efficient way the analysis of large volumes of data stored in a distributed way along a cluster of machines working in parallel, each of them storing a partition of the global data. This approach gave way to a full ecosystem of tools that have brought Big Data closer to progressively more people.
No hay comentarios:
Publicar un comentario