Monday, June 17, 2013

The Internet of Things Will Deliver Big Data's Promise

The Internet of Things
I have mixed success as a futurist, but I'm fairly convinced that the Internet of Things and Machine to Machine technologies will be one of the next waves of major technology advances.

My experience at being a futurist is limited and based on personal experience. In the mid '90s when Web 1.0 became the next platform for Media, I joined a SaaS startup that helped newspapers develop new digital revenue streams with their editorial and classified ad content. When Web 2.0 and social media technologies made it easier for users to share information, my partner and I developed TripConnect, a travel site for sharing travel reviews and questions with your connections and groups believing the more personal experience would gain followers and establish better leads for travel agencies. Social is not just for personal relationships, so I joined BusinessWeek to develop Business Exchange, a website for sharing articles under business topics all prioritized by user activity.

At McGraw Hill Financial, I continue to bring a startup culture, innovation, and entrepreneur mindset to its businesses by transforming IT using a structured Agile Planning and Development practice. So at McGraw Hill Construction, we've developed a new set of Big Data Analytics capabilities and products that lets customers - largely building product manufacturers, general contractors, and subcontractors, size their market, target relationships, and prioritize prospects.

I also have some misses. I thought cloud computing was just glorified hosting. My initial impression of the iPad was that it was an iPhone with a larger screen.

IOT Predictions


But when I was asked recently by SearchCIO-Midmarket about my prediction about the next big technology, I responded with the Internet of Things. There are several advancements making this happen, including
  • The cost of off the shelf smart, network enabled chips and starter boards has dropped and engineers have options. Examples include Broadcom's BCM4390 chip to Ayla Network's IOT starter kit.
  • There are many options to connect devices to cloud based, data collection services such as ThingWorx, OpenIOT, Osiot, and ThingSpeak.
  • The Machine to Machine standards and development tools are improving including Konetki, OMA-DM
  • The availability of Big Data platforms to process the data, and the growing expertise in how to develop algorithms and analytics from the data collected.
  • IOT applications span multiple industries and domains, everything from wearable computing, to health care, to smarter cities.
This is my intro post to this topic. So while I've covered areas including Agile, CIO Advice, Innovation, Organizational change, Enterprise 2.0, and BigData, today I add IOT to the list.
continue reading "The Internet of Things Will Deliver Big Data's Promise"

Thursday, May 16, 2013

Transforming to a Big Data Organization

What's Driving your Big Data Transformation Priorities?
What's more important? Big Data infrastructure including cloud computing, storage, Hadoop and other data processing engines? Is it the algorithms used to automate and derive intelligence from the data? Or is it the talent of the data scientists, the visualizations that they develop, the stories that they tell, and the questions that they answer?

If you believe the technology companies marketing Big Data solutions or a large number of technology media journalists they will  focus on Big Data infrastructure. Organizations that already have a strong practice in analytics and reporting are more likely to focus on improving the talent, tools, and capabilities of their data scientists. Companies that already have a strong software practice and develop algorithms for other strategic needs are more likely to pursue machine learning and other data mining algorithms.

 

What's Driving your Big Data Transformation Priorities?

 

From my perspective, it appears that organizational strengths and dynamics are driving the priorities in Big Data capability investments. This isn't necessarily a bad thing and probably smart for organizations by investing in their strengths. But it isn't sufficient and at some point, a balanced set of investments and changes over a longer period of activity will likely be required.

In other words, while leveraging your organizational strengths may be an easy on ramp to Big Data insights, it probably isn't a complete holistic approach. At some point, many organizations that truly want to become "data driven" must transform or invest in new capabilities.

So in thinking about this transformation, here are some simple questions and guidelines on where to focus efforts

  • How big is your data? - The bigger the data, the more likely you will need to look at infrastructure to store, process, and manage larger data sets.
  • How fast do you need results? - The more likely your business derives value in presenting results faster for direct revenue, real time decision making or competitive advantage, the more likely you will need algorithms to directly churn data to drive other systems or decisions. Examples include Hadoop and other machine learning APIs.
  • How complex is your data? - Complexity comes in many forms. It may be unstructured data, data that has rich relational metadata requiring subject matter expertise, or data that's sparse and has other quality issues. In these situations, organizations will likely look to its subject matter experts and ideally data scientists to help provide insights, but they need to work differently. Data scientists are not glorified spreadsheet jockeys and produce different results than statisticians or BI analysts.

But you can see these are not mutually exclusive capabilities. If your data is big, you certainly are going to need algorithms to process the data and people to interpret the results. If you need fast results, you will need to consider the architecture to best process the data. If your data is complex, eventually you'll want to develop algorithms to help automate some of the results. Then, there are other challenges beyond the volume, variety, and velocity of data, and organizations need to consider how to scale their big data practices. I believe that this transformation is a big opportunity for CIOs looking to bring Big Data intelligence and analytics to their organizations.

So Big Data is not a solution or a capability. It is a transformation journey of developing new capabilities through people, process, and technology.
continue reading "Transforming to a Big Data Organization"

Wednesday, May 08, 2013

What Data Scientists Can Learn From Moneyball

A colleague asked me about how to get started with data science and how to influence the organization to understand the analytics and utilize it in decision making.

You can get many answers to these questions just by watching Moneyball, the movie that put Data Science on the map. Starting with the basics, organizations need executive sponsors to recognize that utilizing data and analytics in decision making is a game changer. Billy Beane, GM of the Oakland A's and played by Brad Pitt is that type of sponsor. Second, you need talent and tools. Assistant GM Peter Brand, "Yale, Economics, Baseball" is what the team needed.

The movie doesn't go into the technology, but provides a view into the challenges and organizational changes both men faced in trying to implement their strategy. Below are some memorable quotes:
No. No. Baseball thinking is medieval. They are asking all the wrong questions. And if I say it to anybody, I'm-I'm ostracized. I'm-I'm-I'm a leper
Peter points out that the art of data science is to ask good questions. He also shows, and feels the difficulty explaining and selling analytics and data based conclusions with his peers.

Billy Beane: No. What's the problem?
John Poloni: Same as it's ever been. We've gotta replace these guys with what we have existing.
Billy Beane: No! What's the problem, Barry?
Scout Barry: We need three eight home runs, a hundred twenty R.B.I's and forty seven...
Billy Beane: ... We got to think differently.
Are you analysing the right metrics? It's not about replacing players or getting wins - it's about scoring runs. Again, if you're asking the wrong questions, it can bring you to the wrong conclusions. And more importantly, the whole organization needs to change the way it thinks
... Of the 20,000 notable players for us to consider, I believe that there is a championship team of twenty-five people that we can afford, because everyone else in baseball undervalues them. 
This is a classic, needle in a haystack data mining problem. It also shows that data scientists need to consider economic factors when performing their analytics, so in this example there are two contraints. It's not just twenty-five people that they can afford, they also must be talented and therefore must be undervalued by other teams.  
Major league baseball and it's fans they're gonna be more than happy to throw you and Google boy into the bus if you keep doing what you're doing here. You don't put a team together with a computer, Billy.
This is another reminder that change management is hard. There are a good number of people in the organization that are used to making decisions based on their intuition and experience. In some cases, they may be using data, but they may not be using the right analytics or asking the right questions. When the data scientist comes out with a new perspective, they will be the first person to challenge the analysis, followed by its messanger, and then finally the overall strategy.
I'm saying it doesn't matter what moves I make if you don't play the team they way they're designed to be played.
Bottom line is you can have a great sponsor, smart data scientists, and the right analysis but if line managers don't utilize the strategy, tools, and data provided to them then it is all for naught.


continue reading "What Data Scientists Can Learn From Moneyball"

Monday, April 29, 2013

Data Visualization Examples

In my last post, I reviewed Five Types of Data Visualizations and broke them down to discovery, quality, storytelling, dashboards/tools and trends/predictive. In this post, I will share some examples of Discovery, Quality, and Dashboard Visualizations. There are too many good examples of Trends/Predicative and Storytelling Dashboards and it was too hard to select one for this post.

Discovery

My Network on LinkedIn

 

LinkedIn provides this fun tool to help its users visualize and navigate their networks. Data discovery tools have to provide mechanisms to visualize large data sets and help identify relationships. This particular tool leverages color, distance relationships, and zoom in/out to help the user find clusters and potential connections.

Quality

There many tools on the market to help Data Stewards and Data Scientists to identify and address data quality issues. My favorite tools help users identify quality issues and drill down into the data by its dimensions to identify causes and fixes. The tool below is featured in Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment

Quality issues highlighted in this dashboard

Dashboards

Tableau Public has numerous examples of interesting visualizations. The one below is a good example of a Dashboard that lets the user see the visualization, review the detailed data, and use filters to drill into results.

continue reading "Data Visualization Examples"

Tuesday, April 23, 2013

Five Types of Data Visualizations

Data visualization tools help Data Scientists explore data, find patterns, and provide organizations tools to leverage data for making decisions. Data Visualization tools stand at the top of my Top Five Tools of Big Data Analytics because other tools such as infrastructure, data quality, semantic engines, predictive analytics and data mining all require good visualizations to demonstrate results. Data visualization tools are also a key tool Data Scientists can differentiate.

Data visualization tools are a key component of Self Service Business Intelligence (BI) tools and I found one of the better overviews of these tools in Gartner's Business Value of Self Service BI.  My personal favorite slide is #11 where they classify data dumps in Microsoft and Excel and Access as "The Dark Side" of self service BI. See my previous posts on the perils of abusing these tools, Please Stop Creating Microsoft Access Databases, The Problems with Siloed Databases and Dear Spreadsheet Jockey, Welcome to BigData. The problem with these tools is that they focus too much on data manipulation and too little on providing understanding and insight and if these tools are abused, they can lead to a data management nightmare.

The Gartner report provides a chart of many of the top tools in this category and I only have experience using several of them. If you're looking for a tool, I would suggest reading, downloading, and experimenting.

But the key for Data Scientists using these tools is where I can provide some insight. Harvard Business just published their Elements of Successful Visualizations covering what makes a successful visualization. The other aspect is to recognize the type of problem being solved with the visualization, the audience, and its life expectancy. Some examples


  • Data Discovery visualizations are created by Data Scientists to explore the data, understand the various dimensions, and search for patterns. Scientists use a variety of visualizations, filters, calculations, and other tools to look for correlations and determine what the data says and what may be of interest.

  • Data Quality visuals allow the Data Scientist to explore single or groups of dimensions. They will look for data that should be normalized such as "New York", "NY" and "N.Y." or grouped together to form hierarchies or segments. They will investigate sparse dimensions (ones with many blank or null values) or develop logic to merge common dimensions aggregated from multiple data sources.

  • Storytelling visualizations often filter into a specific data set and utilize color, size, and other tools to highlight or provide insight to the reader. A scientist may create storytelling visualizations to show correlations or patterns in the data or to identify outliers. Storytelling often requires some narrative such as text, video, or presentation to help the audience understand the visualization.

  • Dashboards and Tools are developed by Data Scientists as decision making tools for a selected audience. It may be a dashboard for Sales to better understand their pipeline, or a set of reports for Operations to better understand quality and productivity factors. These tools often require the Scientist to develop some documentation or training materials so that the intended audience understands the data and knows how to use the tool.

  • Trends and Predicative visualizations can be used for Storytelling or may be deployed as Dashboards, but often have broader audiences. They demonstrate the collective results of decisions and activities, some that the audience can not directly control. In that regard, the Scientist must use the real estate and visual tools to display as much direct and related data so that the audience can have a complete understanding of the trends and predictions.



Once the Data Scientist understands the audience and the type of problem the visualization will solve they can the select and utilize the best visual (and often visuals) whether it include bar graphs, trees, maps and others. There are many articles covering types of visuals including this Introduction to Data Visualization, but Data Scientists should first recognize their intended audiences and needs before diving into visual approach.


continue reading "Five Types of Data Visualizations"

Wednesday, April 10, 2013

Dark Data - A Business Definition

I mentioned dark data in one my recent posts covering the issues with creating Microsoft Access Databases and with siloed databases in general. I have a broad definition of dark data and more expansive then some posts that I've found -


All of these definitions are true, but somewhat limited. In What Is Big Data? The Real Challenges Beyond Volume, Velocity and Variety, I provided my definition of Big Data:

Big Data Defined - Big Data is not defined by its data management challenges, but by the organization's capabilities in analyzing the data, deriving intelligence from it, and leveraging it to make forward looking decisions. It should also be defined by the organization's capability in creating new data streams and aggregating them into its data warehouses. 
As such, my definition of Dark Data is fairly expansive
Dark Data Defined - Dark data is data and content that exists and is stored, but is not leveraged and analyzed for intelligence or used in forward looking decisions. It includes data that is in physical locations or formats that make analysis complex or too costly, or data that has significant data quality issues. It also includes data that is currently stored and can be connected to other data sources for analysis, but the Business has not dedicated sufficient resources to analyze and leverage. Finally (and this may be debatable), dark data also includes data that currently isn't captured by the enterprise, or data that exists outside of the boundary of the enterprise.
This basically demonstrates three conditions where data could be, but is not leveraged in Big Data analytics: (i) It could exist in a sufficient format, but the Business hasn't leveraged it yet, (ii) it exists, but it is too costly to clean or process, or (iii) it doesn't exist and it needs to be captured or acquired.

In some ways, Dark Data is the opposite of Big Data..



continue reading "Dark Data - A Business Definition"

Friday, April 05, 2013

The Problems with Siloed Databases Part 2

I received several comments on my last post, Please, Stop Creating Microsoft Access Databases and thought I'd use today's post to respond to some of them.

  • "It's more often poor planning, lack of knowledge transfer and / or changes of use over time." - I completely agree that this is what makes ongoing database and application support complex. I accept all of these as reality so the question is, are non-developers or business users developing databases with structures and documentation that simplify changes? If they are performing a one time data analysis, then maybe this isn't a concern but for databases that will be used and updated over time, they should be managed by database developers and dba's that are trained to support enhancements and changes.

  • "The underlying problem here isn't MS Access." - MS Access is unique in that it is widely available to business users, it easily allows saving databases to desktop hard drives that are difficult to administer, and has application functionality such as forms and reports. So yes, one of the underlying problems is MS Access because of its capabilities and how it is deployed.

  • "Can we arrange a process of promotion, where ad hoc dbs get promoted to proper data in due course?" - Yes, this is possible and ideal, but hard to govern and sometimes difficult to staff. It depends on how much database development is in practice and the size of the organization. My policy would read something like:
    • Register all non-IT database development in a directory.
    • Allow databases to be created for one time data analysis, but archive them in three months or less.
    • Prototype databases for single user use, but if multiple users need access or if form/reports are needed, then the prototype should be transferred to IT so that they can be properly developed and managed. 

  • "Non-developers create Access databases because they need to get some work done" - I agree, and relying on IT isn't always the answer. However, most non-developers don't have an objective to create a database - they are usually looking to develop a workflow or to perform some analytics or reporting and realize they need a database to store the data. To that end, I think it is better for IT organizations to provide "self service" tools to manage departmental workflows (see my post: In my CIO toolkit), or tools for self service analytics/reporting.

  • "It turns out it is about dark data and how organizations should better consider their enterprise data handling." "Dark data can be a problem." - Indeed, that is really what the post is about. My definition of dark data is "Data that isn't documented or easily understood, data that can't easily be connected to other data sources, or data that can't easily be used in analytics.". So when you have poorly planned, silo'ed databases, then this is a dark data issue.
continue reading "The Problems with Siloed Databases Part 2"