You’ve probably heard about “Big Data” as a term being used by the media and press. You may have wondered what this term referred to. Is it merely another marketing buzzword to keep selling middleware? Will this be a way your database vendor tries to sell you on buying a new system for data persistence? Perhaps it is something you should be scared of? Large roaming gangs of over-sized ones and zeros bent on destruction and mayhem?
What if it is a legitimate movement that can be described in a manner that relays its’ meaning and also allows us to understand what is not big data? The latter is really the core requirement when defining a term. If you need to know what something is, it is important to understand how it is defined in a manner that would allows you to also disqualify something from belonging to that definition. NoSQL failed in this requirement. “Not Only SQL” does not really define anything as it allows both SQL and things without SQL, disqualifying anything from being a member of that definition. For Big Data, defining the criteria that you can either state with un-wavering certainty that something is or is not big data is the task at hand.
Sadly, there are no common definitions accepted by everyone. In lack of an industry standard definition, we at Hot Tomali have offered our own, based on our experiences with large scale integration projects.
Big Data is a solution requirements context whereby the data models and instance data viewable by an observer are complex, massive and include so many data sets that normal data tools and software could not easily integrate all the data using existing tooling. The data sets are defined as visible data sets meaning the observer can view and retrieve copies of the data.
This disqualifies a single database or single dataset. The scenario described in the definition is one where the observer has access to multiple data sets distributed over multiple domains and each data set has some relevance to a problem or task at hand.
An example of the big data scenario might be best described in terms of someone attempting to build a financial markets prediction software algorithm that can digest many data sets and predict market futures. There are primary data sets such as the interest rates, historical stock market data and more yet each of these may be augmented with an almost open ended collection of other data sets such as weather data, employment data, housing market data, energy futures indexes, government statistical data and world bank statistical data on living conditions for every country in the supply chain of every product built by a company on the Down Jones index.
The above data sets would number in the thousands if not tens of thousands. An integration project of this magnitude would be almost overwhelming.
Now consider an integration project of similar ambitions within the domain of a single company and the data sets it has access to internally. This data integration project could easily be completed using existing database and ETL tooling. As such, this would not qualify as a big data project in our definition.
Our definition is not the industry standard nor are we claiming it is a normative reference model for the rest of the world to use. It is merely our attempt to help our customers and others understand what we mean when we refer to Big Data.
Good luck with your big data projects and don’t forget to call Hot Tomali when your projects require the benefits of a third party, independent review. As always, we are here to help.