Thursday, May 16, 2013

Transforming to a Big Data Organization

What's Driving your Big Data Transformation Priorities?
What's more important? Big Data infrastructure including cloud computing, storage, Hadoop and other data processing engines? Is it the algorithms used to automate and derive intelligence from the data? Or is it the talent of the data scientists, the visualizations that they develop, the stories that they tell, and the questions that they answer?

If you believe the technology companies marketing Big Data solutions or a large number of technology media journalists they will  focus on Big Data infrastructure. Organizations that already have a strong practice in analytics and reporting are more likely to focus on improving the talent, tools, and capabilities of their data scientists. Companies that already have a strong software practice and develop algorithms for other strategic needs are more likely to pursue machine learning and other data mining algorithms.

 

What's Driving your Big Data Transformation Priorities?

 

From my perspective, it appears that organizational strengths and dynamics are driving the priorities in Big Data capability investments. This isn't necessarily a bad thing and probably smart for organizations by investing in their strengths. But it isn't sufficient and at some point, a balanced set of investments and changes over a longer period of activity will likely be required.

In other words, while leveraging your organizational strengths may be an easy on ramp to Big Data insights, it probably isn't a complete holistic approach. At some point, many organizations that truly want to become "data driven" must transform or invest in new capabilities.

So in thinking about this transformation, here are some simple questions and guidelines on where to focus efforts

  • How big is your data? - The bigger the data, the more likely you will need to look at infrastructure to store, process, and manage larger data sets.
  • How fast do you need results? - The more likely your business derives value in presenting results faster for direct revenue, real time decision making or competitive advantage, the more likely you will need algorithms to directly churn data to drive other systems or decisions. Examples include Hadoop and other machine learning APIs.
  • How complex is your data? - Complexity comes in many forms. It may be unstructured data, data that has rich relational metadata requiring subject matter expertise, or data that's sparse and has other quality issues. In these situations, organizations will likely look to its subject matter experts and ideally data scientists to help provide insights, but they need to work differently. Data scientists are not glorified spreadsheet jockeys and produce different results than statisticians or BI analysts.

But you can see these are not mutually exclusive capabilities. If your data is big, you certainly are going to need algorithms to process the data and people to interpret the results. If you need fast results, you will need to consider the architecture to best process the data. If your data is complex, eventually you'll want to develop algorithms to help automate some of the results. Then, there are other challenges beyond the volume, variety, and velocity of data, and organizations need to consider how to scale their big data practices. I believe that this transformation is a big opportunity for CIOs looking to bring Big Data intelligence and analytics to their organizations.

So Big Data is not a solution or a capability. It is a transformation journey of developing new capabilities through people, process, and technology.
continue reading "Transforming to a Big Data Organization"

Wednesday, May 08, 2013

What Data Scientists Can Learn From Moneyball

A colleague asked me about how to get started with data science and how to influence the organization to understand the analytics and utilize it in decision making.

You can get many answers to these questions just by watching Moneyball, the movie that put Data Science on the map. Starting with the basics, organizations need executive sponsors to recognize that utilizing data and analytics in decision making is a game changer. Billy Beane, GM of the Oakland A's and played by Brad Pitt is that type of sponsor. Second, you need talent and tools. Assistant GM Peter Brand, "Yale, Economics, Baseball" is what the team needed.

The movie doesn't go into the technology, but provides a view into the challenges and organizational changes both men faced in trying to implement their strategy. Below are some memorable quotes:
No. No. Baseball thinking is medieval. They are asking all the wrong questions. And if I say it to anybody, I'm-I'm ostracized. I'm-I'm-I'm a leper
Peter points out that the art of data science is to ask good questions. He also shows, and feels the difficulty explaining and selling analytics and data based conclusions with his peers.

Billy Beane: No. What's the problem?
John Poloni: Same as it's ever been. We've gotta replace these guys with what we have existing.
Billy Beane: No! What's the problem, Barry?
Scout Barry: We need three eight home runs, a hundred twenty R.B.I's and forty seven...
Billy Beane: ... We got to think differently.
Are you analysing the right metrics? It's not about replacing players or getting wins - it's about scoring runs. Again, if you're asking the wrong questions, it can bring you to the wrong conclusions. And more importantly, the whole organization needs to change the way it thinks
... Of the 20,000 notable players for us to consider, I believe that there is a championship team of twenty-five people that we can afford, because everyone else in baseball undervalues them. 
This is a classic, needle in a haystack data mining problem. It also shows that data scientists need to consider economic factors when performing their analytics, so in this example there are two contraints. It's not just twenty-five people that they can afford, they also must be talented and therefore must be undervalued by other teams.  
Major league baseball and it's fans they're gonna be more than happy to throw you and Google boy into the bus if you keep doing what you're doing here. You don't put a team together with a computer, Billy.
This is another reminder that change management is hard. There are a good number of people in the organization that are used to making decisions based on their intuition and experience. In some cases, they may be using data, but they may not be using the right analytics or asking the right questions. When the data scientist comes out with a new perspective, they will be the first person to challenge the analysis, followed by its messanger, and then finally the overall strategy.
I'm saying it doesn't matter what moves I make if you don't play the team they way they're designed to be played.
Bottom line is you can have a great sponsor, smart data scientists, and the right analysis but if line managers don't utilize the strategy, tools, and data provided to them then it is all for naught.


continue reading "What Data Scientists Can Learn From Moneyball"

Monday, April 29, 2013

Data Visualization Examples

In my last post, I reviewed Five Types of Data Visualizations and broke them down to discovery, quality, storytelling, dashboards/tools and trends/predictive. In this post, I will share some examples of Discovery, Quality, and Dashboard Visualizations. There are too many good examples of Trends/Predicative and Storytelling Dashboards and it was too hard to select one for this post.

Discovery

My Network on LinkedIn

 

LinkedIn provides this fun tool to help its users visualize and navigate their networks. Data discovery tools have to provide mechanisms to visualize large data sets and help identify relationships. This particular tool leverages color, distance relationships, and zoom in/out to help the user find clusters and potential connections.

Quality

There many tools on the market to help Data Stewards and Data Scientists to identify and address data quality issues. My favorite tools help users identify quality issues and drill down into the data by its dimensions to identify causes and fixes. The tool below is featured in Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment

Quality issues highlighted in this dashboard

Dashboards

Tableau Public has numerous examples of interesting visualizations. The one below is a good example of a Dashboard that lets the user see the visualization, review the detailed data, and use filters to drill into results.

continue reading "Data Visualization Examples"

Tuesday, April 23, 2013

Five Types of Data Visualizations

Data visualization tools help Data Scientists explore data, find patterns, and provide organizations tools to leverage data for making decisions. Data Visualization tools stand at the top of my Top Five Tools of Big Data Analytics because other tools such as infrastructure, data quality, semantic engines, predictive analytics and data mining all require good visualizations to demonstrate results. Data visualization tools are also a key tool Data Scientists can differentiate.

Data visualization tools are a key component of Self Service Business Intelligence (BI) tools and I found one of the better overviews of these tools in Gartner's Business Value of Self Service BI.  My personal favorite slide is #11 where they classify data dumps in Microsoft and Excel and Access as "The Dark Side" of self service BI. See my previous posts on the perils of abusing these tools, Please Stop Creating Microsoft Access Databases, The Problems with Siloed Databases and Dear Spreadsheet Jockey, Welcome to BigData. The problem with these tools is that they focus too much on data manipulation and too little on providing understanding and insight and if these tools are abused, they can lead to a data management nightmare.

The Gartner report provides a chart of many of the top tools in this category and I only have experience using several of them. If you're looking for a tool, I would suggest reading, downloading, and experimenting.

But the key for Data Scientists using these tools is where I can provide some insight. Harvard Business just published their Elements of Successful Visualizations covering what makes a successful visualization. The other aspect is to recognize the type of problem being solved with the visualization, the audience, and its life expectancy. Some examples


  • Data Discovery visualizations are created by Data Scientists to explore the data, understand the various dimensions, and search for patterns. Scientists use a variety of visualizations, filters, calculations, and other tools to look for correlations and determine what the data says and what may be of interest.

  • Data Quality visuals allow the Data Scientist to explore single or groups of dimensions. They will look for data that should be normalized such as "New York", "NY" and "N.Y." or grouped together to form hierarchies or segments. They will investigate sparse dimensions (ones with many blank or null values) or develop logic to merge common dimensions aggregated from multiple data sources.

  • Storytelling visualizations often filter into a specific data set and utilize color, size, and other tools to highlight or provide insight to the reader. A scientist may create storytelling visualizations to show correlations or patterns in the data or to identify outliers. Storytelling often requires some narrative such as text, video, or presentation to help the audience understand the visualization.

  • Dashboards and Tools are developed by Data Scientists as decision making tools for a selected audience. It may be a dashboard for Sales to better understand their pipeline, or a set of reports for Operations to better understand quality and productivity factors. These tools often require the Scientist to develop some documentation or training materials so that the intended audience understands the data and knows how to use the tool.

  • Trends and Predicative visualizations can be used for Storytelling or may be deployed as Dashboards, but often have broader audiences. They demonstrate the collective results of decisions and activities, some that the audience can not directly control. In that regard, the Scientist must use the real estate and visual tools to display as much direct and related data so that the audience can have a complete understanding of the trends and predictions.



Once the Data Scientist understands the audience and the type of problem the visualization will solve they can the select and utilize the best visual (and often visuals) whether it include bar graphs, trees, maps and others. There are many articles covering types of visuals including this Introduction to Data Visualization, but Data Scientists should first recognize their intended audiences and needs before diving into visual approach.


continue reading "Five Types of Data Visualizations"

Wednesday, April 10, 2013

Dark Data - A Business Definition

I mentioned dark data in one my recent posts covering the issues with creating Microsoft Access Databases and with siloed databases in general. I have a broad definition of dark data and more expansive then some posts that I've found -


All of these definitions are true, but somewhat limited. In What Is Big Data? The Real Challenges Beyond Volume, Velocity and Variety, I provided my definition of Big Data:

Big Data Defined - Big Data is not defined by its data management challenges, but by the organization's capabilities in analyzing the data, deriving intelligence from it, and leveraging it to make forward looking decisions. It should also be defined by the organization's capability in creating new data streams and aggregating them into its data warehouses. 
As such, my definition of Dark Data is fairly expansive
Dark Data Defined - Dark data is data and content that exists and is stored, but is not leveraged and analyzed for intelligence or used in forward looking decisions. It includes data that is in physical locations or formats that make analysis complex or too costly, or data that has significant data quality issues. It also includes data that is currently stored and can be connected to other data sources for analysis, but the Business has not dedicated sufficient resources to analyze and leverage. Finally (and this may be debatable), dark data also includes data that currently isn't captured by the enterprise, or data that exists outside of the boundary of the enterprise.
This basically demonstrates three conditions where data could be, but is not leveraged in Big Data analytics: (i) It could exist in a sufficient format, but the Business hasn't leveraged it yet, (ii) it exists, but it is too costly to clean or process, or (iii) it doesn't exist and it needs to be captured or acquired.

In some ways, Dark Data is the opposite of Big Data..



continue reading "Dark Data - A Business Definition"

Friday, April 05, 2013

The Problems with Siloed Databases Part 2

I received several comments on my last post, Please, Stop Creating Microsoft Access Databases and thought I'd use today's post to respond to some of them.

  • "It's more often poor planning, lack of knowledge transfer and / or changes of use over time." - I completely agree that this is what makes ongoing database and application support complex. I accept all of these as reality so the question is, are non-developers or business users developing databases with structures and documentation that simplify changes? If they are performing a one time data analysis, then maybe this isn't a concern but for databases that will be used and updated over time, they should be managed by database developers and dba's that are trained to support enhancements and changes.

  • "The underlying problem here isn't MS Access." - MS Access is unique in that it is widely available to business users, it easily allows saving databases to desktop hard drives that are difficult to administer, and has application functionality such as forms and reports. So yes, one of the underlying problems is MS Access because of its capabilities and how it is deployed.

  • "Can we arrange a process of promotion, where ad hoc dbs get promoted to proper data in due course?" - Yes, this is possible and ideal, but hard to govern and sometimes difficult to staff. It depends on how much database development is in practice and the size of the organization. My policy would read something like:
    • Register all non-IT database development in a directory.
    • Allow databases to be created for one time data analysis, but archive them in three months or less.
    • Prototype databases for single user use, but if multiple users need access or if form/reports are needed, then the prototype should be transferred to IT so that they can be properly developed and managed. 

  • "Non-developers create Access databases because they need to get some work done" - I agree, and relying on IT isn't always the answer. However, most non-developers don't have an objective to create a database - they are usually looking to develop a workflow or to perform some analytics or reporting and realize they need a database to store the data. To that end, I think it is better for IT organizations to provide "self service" tools to manage departmental workflows (see my post: In my CIO toolkit), or tools for self service analytics/reporting.

  • "It turns out it is about dark data and how organizations should better consider their enterprise data handling." "Dark data can be a problem." - Indeed, that is really what the post is about. My definition of dark data is "Data that isn't documented or easily understood, data that can't easily be connected to other data sources, or data that can't easily be used in analytics.". So when you have poorly planned, silo'ed databases, then this is a dark data issue.
continue reading "The Problems with Siloed Databases Part 2"

Monday, April 01, 2013

Please, Stop Creating Microsoft Access Databases!

It all starts very simply and innocently with someone needing a place to store data that is a little bit more than what is convenient to store in Microsoft Excel. She thinks, "It's just a couple of tables and I already have MS Access on my desktop", so this shouldn't be too hard. The bad news is that if this database is "successful" it will likely draw others to it forcing the SadBA (self appointed database business analyst) to consider granting access to her desktop stored database, developing forms, and producing reports. Even worse is when new opportunities present themselves and she decides to create additional MS Access databases. She only calls in IT if she needs something scripted such as more advanced forms or jobs that can load and transform new data.

Flash forward a few years and consider if this behavior is repeated across multiple organizations and locations and you have a classic database mess. IT will probably be asked to perform heroics when a desktop fails and there isn't a sufficient backup, or when there is an MS Office upgrade being planned and these databases need testing, or when the SadBA is leaving the company and no one understands how to support these databases.

As big of a database mess this is, the underlying data mess can be a daunting maze to unwind. Consider even a single database, a trained DBA would need to understand the underlying data model, document any scripts or procedures loading data, and itemize reporting needs. If any forms were developed and especially if multiple people are using the database as part of a workflow, then you'll need a Business Analyst and possibly an Application Developer to consider how these business processes are accomplished.

Perhaps you've never had to read someone else's code?


Rebuilding a database when it likely has poor naming conventions, missing data relationships, and a complete lack of referential integrity requires a DBA with the skills of a linguistic anthropologist. Now tell this DBA that there are multiple databases that contain duplicate and related data and they'll need some special software tools to normalize the data model, load in data from multiple sources, and match, merge and de-duplicate records,  before even considering how to replicate existing functionality.


Why is this a Big Concern?


Even smaller companies are recognizing the benefits of analytics and Big Data processing. It's relatively easy for a business user to perform analysis on a single data source, or even a handful if the data relationships are understood. This can easily be done in MS Excel or even better, by selecting and correctly leveraging a self service BI tool. But if there are numerous databases stored all over the place with undocumented data dictionaries, unknown data quality, and little understanding of how to relate data sources, then it is virtually impossible to perform broad analytics on it. It is part of the company's dark data - data that exists but can't easily be analyzed for intelligence or insight.

Is this your company's sales data, customer data, marketing data, or financial data? More likely, the answer is yes because it's this data that business users work with the most. If the business user needed to perform a quick analysis and IT wasn't accessible, available, or had the necessary agility to solution, then it is likely that a SpreadSheet Jockey or a SadBA established a solution.

What is the first step to solving this issue? Please, stop creating MS Access Databases!

continue reading "Please, Stop Creating Microsoft Access Databases!"