Three Steps to Big Data Discovery

Data discovery efforts are particularly challenging in an enterprise or even medium sized business with many disparate databases. It's easy to get stuck before you start not knowing enough about what data exists, where it is stored, how to get access to it, how to interpret the data, what assumptions and rules exist on its creation and transformations, what is the significance of the data quality issues, and ultimately how to relate entities and metrics from multiple data sources.

A top down approach would attempt to document databases, data flows, business rules, data quality, transformations, and application integration. While some of this is clearly needed and important for IT organizations to develop, the task of developing and maintaining rigorous documentation is daunting. In my experience, I seldom find IT organizations with this documentation and practice in place.

I'm an advocate of bottoms up, business driven discovery efforts. So if you're successful and have found people in the organization who have asked good data driven questions, what's next? How do you start answering questions with a limited enterprise guide to your data?

Who are your Data Explorers?

You're going to need to build a small team that has the leadership, people, and technology skills to perform a data exploration. Why leadership and people skills first? It's because much of the knowledge on what data exists, where it lives, and how it is created lies in other people that we often label as "subject matter experts". How data comes to be is often embedded in existing business processes, some which will be structured and documented but other will not be. So the first skills your team needs are people skills to make connections, ask the right questions, build trust, learn, and document.

Eventually, this team will want to explore the data which is where technical skills will be needed. Don't immediately assume that this is your DBA - it depends. If your DBA largely keeps the databases humming, provides access, and occasionally creates views or data reports, then this person might not be skilled at data discovery tools and practices. You'll need someone who can develop entity relationship diagrams, use visualization tools to provide top-down access to data, and is skilled in data quality tools to perform a dimensional data quality analysis. She will also have to know how to quickly research ETLs, stored procedures, and application access to identify the ones most relevant and to provide sufficient business insight. In my experience, few DBAs possess these skills. A very technical data scientist should have these skills. A database architect or advanced developer will also be able to help, but they are often dedicated to projects and hard to assign to data exploration projects.

Iterative Data Discovery

Once you've assembled your team, they have to be comfortable working iteratively. Today's discovery efforts will lead to more questions, shifts in priorities, new strategies, and probably more issues to overcome. This team needs to collaborate efficiently so all members can learn from each other and participate in discussions around "what should we do next".

Document Discoveries

It's not good enough for the Data Explorers to navigate and learn - they have to use the efforts to document and teach others. This team will need to agree on what data sources to focus on what, what areas to document, and what tools or templates to use to develop this documentation.

The combination of these three steps describe who should perform discovery efforts, what is their process, and what are their deliverables. Create milestones for this team to do read outs and provide direction. Decide when you've achieved a "good enough" exploration.

1 comment:

  1. Documenting the value added is important. The bottom-up, decentralized approach I'd the fasted way to turn data into useful information--especially if the work supports immediate operational needs or provides insight that leads to growth or savings.


Comments on this blog are moderated and we do not accept comments that have links to other websites.


About Isaac Sacolick

Isaac Sacolick is President of StarCIO, a technology leadership company that guides organizations on building digital transformation core competencies. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO, a digital transformation influencer, and has over 900 articles published at InfoWorld,, his blog Social, Agile, and Transformation, and other sites. You can find him sharing new insights @NYIke on Twitter, his Driving Digital Standup YouTube channel, or during the Coffee with Digital Trailblazers.