What are Seven Types of Big Data Debt

As more organizations have embarked on agile software development over the last five years as part of digital transformation programs, the term technical debt is more widely understood. Teams that develop code leave artifacts behind that require improvements, reengineering, refactoring, or wholesale rewriting.

What is Big Data Debt

Some technical debt is done purposefully to deliver applications faster, while other forms of technical debt emerge over time with age and increased usage. Developers focus on fixing technical debt in their code. At the same time, CIOs often apply the term at a macro level and include legacy systems, monolithic applications, or any technology that needs an architecture upgrade.

But what about data?

More organizations are trying to be data-driven, invest in self-service BI programs, or want competitive advantages using machine learning and AI. How much enterprise data debt inhibits accurate, reliable, and unbiased analytics?

How Should Organizations Define Big Data Debt?

Let's try to put some levels of big data debt:

  1. Dark data is data that hasn't been cataloged or thoroughly analyzed. It is the lowest form of big data debt because it represents data mostly unknown and unused in analytics or machine learning experiments.
  2. Dirty data is data that has known and often unknown data quality issues. This includes fields that need to be normalized or are missing values across a large percentage of data. Data quality and profiling tools are potential options for cleansing dirty data.
  3. Duplicate data comes from data sources that are full or partial copies of primary data sources. They represent everything from spreadsheets to database temporary tables that were created for single or temporary use and often have derivative data added to them. 
  4. Murky data is complex data that is only used by a small number of business analysts, data scientists, or citizen data scientists in the organizations. Murky = not well documented or understood and so subject matter experts are required to interpret the data before it's applied accurately. Building data catalogs and defining data dictionaries are two ways to address murky data. 
  5. Dysfunctional data occurs when the current data management tools are the wrong ones or are poorly structured to enable optimal use of the data in applications, data analytics, or machine learning. Data stored in file systems, unstructured data stored as CLOB database fields, media data that hasn't been tagged are all examples.   
  6. Unsecured data includes data that hasn't been classified, scanned for privacy and other controls, or does not have well-defined policies on who can use it and for what purposes. It also includes data that isn't adequately encrypted or needs other data security issues addressed.
  7. Masterless data is data that is not joined or connected to enterprise master data sources on customers, products, and other primary business entities. It prevents creating full customer data profiles and other analyses that benefit having a central master source and connected data attributes. 
Most data sources and repositories will suffer more than one type of big data debt. And like technical debt, it's difficult to quantify or even put a measure around big data debt. Measures of data management issues and debt should ideally quantify the business impact, which is nontrivial to define and compute.

Why Classify Big Data Debt?

To better address big data debt, organizations need to adopt nomenclature that's better understood by business leaders and that separate out the problem from the solution. We shouldn't be prescribing data governance, standardizing data visualizations, defining data dictionaries, adopting data catalogs, instituting data quality procedures, or investing in master data management - all solutions - until business leaders better understand the types of problems and their business impacts. 

There are a few reasons:

  • Requires an agile approach -Addressing big data debt is an iterative process as addressing one set of issues often exposes new ones. Business leaders need to be long-term sponsors of remediations, and so they need a firm understanding of the underlying problems. 
  • Tech solutions require problem statements - Data technology solutions often have overlapping capabilities, so its best to identify the business impacting problem areas before shopping for solutions.
  • Addressing data debt requires change management - Addressing the issues go beyond tools! They require that new responsibilities are assigned and employees trained. Establishing cross-functional teams following agile methodologies to address big data debt is a best practice. These people and teams benefit by having a common language on problem types, impacts, and solutions.
  • Fixing requires business process changes - Addressing big data debt often requires changes in business processes and tools used to collect or create data. Forms need validations, data integrations need exception handling, and business processes require metrics that illustrate data defects. On a large scale, this is an organizational change that requires educating a larger number of people.
  • Gaining traction with leaders - As you can see from the chart below, "data debt" queries are not on anyone's radar. Yet. And yet they are so critical to becoming a data-driven organization or winning at AI experimentation.
What's your plan to address big data debt? Please reach out to me if you'd like to discuss!



  1. Anonymous7:19 AM

    Really enjoyed bumping into this. Your perspective is refreshingly concise.

  2. This is very true and you have highlighted all the actions needed to resolve.


Comments on this blog are moderated and we do not accept comments that have links to other websites.


About Isaac Sacolick

Isaac Sacolick is President of StarCIO, a technology leadership company that guides organizations on building digital transformation core competencies. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO, a digital transformation influencer, and has over 900 articles published at InfoWorld, CIO.com, his blog Social, Agile, and Transformation, and other sites. You can find him sharing new insights @NYIke on Twitter, his Driving Digital Standup YouTube channel, or during the Coffee with Digital Trailblazers.