Why is Data Sooo Messy and How to Avoid Data Landfills

I was surprised this morning to see an article about the janitorial" work data scientists have to perform to be able to find "nuggets" in big data. Actually, my only surprise is that the story is in the NY Times and that they are covering the least glamorous side of the "sexiest" job.

Why is data so messy?

Let's start with the past. The history of data science starts with complicated data warehouse, expensive BI tools, hundreds if not thousands ETLs moving data all over the place, and bloated expectations. On the other extreme, many organizations have siloed databases, DBAs largely skilled at keeping the lights on (future post?), and spreadsheet jockeys performing analytics. The janitorial work data scientists are performing partially exists because of the mess of databases and derivative data sources previous generations left behind.

And I'm not sure this generation will get it better. As I reported just a couple of months ago, with great power comes even greater responsibility. All the technologies and tools data scientists have at their finger tips also have the power to create a new set of data stashes - informal places where data is aggregated - or buried data mines - places where analytics are performed, but not automated or transparent to future scientists. 

If data scientists, DBAs, and CIOs are not careful the data stashes and buried data mines can slowly transform into full blown data landfills. 

DBAs know what I'm talking about. It's a combination of data warehouses, reports, dashboards, and ETLs that no one wants to touch. No one understands who is using what reports or dashboards in what business process for what purpose or benefit. ETLs look like a maze of buried unlabeled pipes developed using a myriad of materials (programming approaches) and with no standards to help future workers separate out plumbing from filters and valves.

Build Foundations, Not Landfills!

Data scientists and their partners, data stewards, DBAs, business analysts, developers and testers need to instill some discipline - dare I say data governance - and balance their time mining for nuggets with practices that establish data and analytics foundations. For an upcoming post... Remember, big data is a journey.

Until then, here are a few things one can learn about data science from a fourth grade class and think twice about creating another data source!

No comments:

Post a Comment

Comments on this blog are moderated and we do not accept comments that have links to other websites.


About Isaac Sacolick

Isaac Sacolick is President of StarCIO, a technology leadership company that guides organizations on building digital transformation core competencies. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO, a digital transformation influencer, and has over 900 articles published at InfoWorld, CIO.com, his blog Social, Agile, and Transformation, and other sites. You can find him sharing new insights @NYIke on Twitter, his Driving Digital Standup YouTube channel, or during the Coffee with Digital Trailblazers.