The dirty secrete of Big Data is exposed in this very good posting. Forget the Algorithms and Start Cleaning Your Data.
Failure of big data projects is not in technology but the ability to wire the data together. The lack of success is driven by:
- Poor data quality and or inadequate data error handing
- Incompatible or poorly understood semantics from different data sources
- Complex matching rules between data sources
The blog suggests that the big data tooling therefore needs to focus on the burden of integrating, cleaning, and transforming the data prior to analysis. Example: RapidMiner has 1,250 algorithms for this purpose. That might be good, but also very complex for the average human.
Sounds like a classic case of the need for separation of concerns, right? Untangling and designing solutions to these first order problems is data modeling. Given big data’s fluid data structures that means datapoint modelling. Solutioning with 1,250 data manipulation algorithms, Hadoop, algorithms and huge databases etc can then be based on visible logic and good design. With the alternative, jumping right into build, best of luck!