Where are you Implementing your Big Data Algorithms?

It sounds like a simple question. You have to load several data sets, implement some data cleansing, perform some matching to third party data, compute several aggregates, develop some rankings, group several dimensions, benchmark against another data set, analyze for trends and then normalize the data for multiple data visualizations.

In all likelihood, the algorithms that perform these functions are going to be implemented by different people in different technologies and perhaps at different stages in the analysis. End to end, they represent a complex data flow from data sources, computations, analysis, and delivery.

Key Data Architecture Considerations


So my question is, where are you implementing these data processing functions? Where are your algorithms stored? How are they documented? How do you answer questions around, "Where should I do this data processing?" What is your big data culture - Are you more likely to let data scientists determine what tool to use for different needs, or are you centralizing these data architecture decisions?

Once implemented, how do you review to determine what parts of your data processing needs to be refactored? Maybe a step isn't performing well? Maybe a data visualization required some last mile data cleansing that should be moved upstream to benefit other analysis? Perhaps some algorithm fails to meet the "KT" (Knowledge Transfer) test and is so complex it will be impossible to be maintained?

Or maybe, you've implemented something in a Big Data tool that has just released a major upgrade requiring substantial changes to the implementation? Or even worse, perhaps the tool you selected is on the downside, having never achieved critical mass and now you have to explore alternatives and consider switching costs.

The reverse question is equally important. Perhaps you're bundling some activity in the wrong tool and should consider expanding your technical architecture? Perhaps you are spending too many cycles getting SQL to perform and should consider a NoSQL store? Maybe the Python scripts you developed for data integration are becoming unmanageable and an ETL tool is needed?

Managing the Evolving Big Data Landscape and Growing Business Need


So the business need is growing, the technology landscape is changing, quickly, access to talent is volatile, and both standards and best practices are evolving. What does this mean for Big Data specialists and Digital Transformation leaders who need to prove results today but manage to an evolving practice?

My simple answer is to rely on the basic practices that have made application development practices evolve through significant changes in demand, technologies, and development practices. Some specifics -

  • Invest in basic version control so that you can track changed implementations  across platforms and practices.

  • Evolve a data governance practice that starts with basic data dictionaries and documentation on algorithms.

  • Build an agile data practice to make sure participants focus on the problems of highest business value and demo their results

  • Develop operational KPIs covering development cost, implementation complexity and system performance to sense when an implementation shows signs of becoming a pain point.

  • Capture technical debt data quality barriers and other things that need improvement.

And most important:

  • Invest time/resources to perform R&D and experiment.


Thanks to Matt Turck: Is Big Data Still a Thing

No comments:

Post a Comment

Share