Thursday, January 24, 2013

Three Steps to Big Data Discovery

Data discovery efforts are particularly challenging in an enterprise or even medium sized business with many disparate databases. It's easy to get stuck before you start not knowing enough about what data exists, where it is stored, how to get access to it, how to interpret the data, what assumptions and rules exist on its creation and transformations, what is the significance of the data quality issues, and ultimately how to relate entities and metrics from multiple data sources.

A top down approach would attempt to document databases, data flows, business rules, data quality, transformations, and application integration. While some of this is clearly needed and important for IT organizations to develop, the task of developing and maintaining rigorous documentation is daunting. In my experience, I seldom find IT organizations with this documentation and practice in place.

I'm an advocate of bottoms up, business driven discovery efforts. So if you're successful and have found people in the organization who have asked good data driven questions, what's next? How do you start answering questions with a limited enterprise guide to your data?

Who are your Data Explorers?

You're going to need to build a small team that has the leadership, people, and technology skills to perform a data exploration. Why leadership and people skills first? It's because much of the knowledge on what data exists, where it lives, and how it is created lies in other people that we often label as "subject matter experts". How data comes to be is often embedded in existing business processes, some which will be structured and documented but other will not be. So the first skills your team needs are people skills to make connections, ask the right questions, build trust, learn, and document.

Eventually, this team will want to explore the data which is where technical skills will be needed. Don't immediately assume that this is your DBA - it depends. If your DBA largely keeps the databases humming, provides access, and occasionally creates views or data reports, then this person might not be skilled at data discovery tools and practices. You'll need someone who can develop entity relationship diagrams, use visualization tools to provide top-down access to data, and is skilled in data quality tools to perform a dimensional data quality analysis. She will also have to know how to quickly research ETLs, stored procedures, and application access to identify the ones most relevant and to provide sufficient business insight. In my experience, few DBAs possess these skills. A very technical data scientist should have these skills. A database architect or advanced developer will also be able to help, but they are often dedicated to projects and hard to assign to data exploration projects.

Iterative Data Discovery

Once you've assembled your team, they have to be comfortable working iteratively. Today's discovery efforts will lead to more questions, shifts in priorities, new strategies, and probably more issues to overcome. This team needs to collaborate efficiently so all members can learn from each other and participate in discussions around "what should we do next".

Document Discoveries

It's not good enough for the Data Explorers to navigate and learn - they have to use the efforts to document and teach others. This team will need to agree on what data sources to focus on what, what areas to document, and what tools or templates to use to develop this documentation.

The combination of these three steps describe who should perform discovery efforts, what is their process, and what are their deliverables. Create milestones for this team to do read outs and provide direction. Decide when you've achieved a "good enough" exploration.

1 comment:

  1. Documenting the value added is important. The bottom-up, decentralized approach I'd the fasted way to turn data into useful information--especially if the work supports immediate operational needs or provides insight that leads to growth or savings.

    ReplyDelete