A Global Data Panacea

It’s hard to create corporate or master data strategies when every user group you talk to has a different idea of what “good” should look like from their perspective.

July 24, 2024

Data Science and Digital Engineering

Capsule With Red Connection Dots And Plexus Lines. 3D Rendering

What is today's data panacea? In my many decades of work in the data disciplines, dozens of technologies have emerged claiming they will "fix" bad data. Creating a data solution that works for everyone is incredibly difficult. The trouble comes, of course, when you think about “data” both tactically and strategically. It’s hard to create corporate or master data strategies when every user group you talk to has a different idea of what “good” should look like from their perspective.

When attempting to satisfy the needs of many diverse expectations, the result is often a solution that pleases no one, confuses everyone, and leads to poor decision making and increased risk.

Decades ago, relational databases did a great job of keeping data structures aligned, and they remain a solid workhorse for data management, although they are "grandma's technology," unexciting for newcomers. A mere 4 decades ago, we ("we" is pretty ubiquitous) thought that maybe, if we all shared the same container design, all of our data woes would vanish. Not so, alas.

From relational containers, we leapt with increasing vigor (desperation?) to NoSQL and NewSQL technologies. The problem was the same. Things became even worse for data quality and integrity. Will a rinse and repeat with JSON, YAML, Avro, Parquet, or Optimized Row Columnar solve this problem? Technology is, of course, not magic, so success hinges on other factors.

Shared container designs expose, but don’t resolve, differences in semantics or content. Through the years, we discovered (but sadly didn’t learn the lesson) that data problems are not solved by a “standard format,” regardless of the technology the data container format uses.

An assumption of success grounded in the use of data container technology is pure fallacy. "We" know that but keep trying again and again with pretty much the same result.

"We" have tried methodologies. Data warehouses, data lakes, data lake houses, and more have been tried. All have the potential to provide significant benefits but only when commensurate attention is spent on the data being inserted into the system.

The trouble, of course, is that data requires constant attention to stay good—something budgets and budgeters don't like to see as repeating line items year after year. An assumption of success grounded in the use of methodology is fallacy. Methodology is part of the equation but can’t stand alone.

We tried data mastering, data governance, and data quality. These are all extremely powerful methods. However, the majority of implementations were based on software tools (technology) in the hope that this alone would make their data “good." Many implementations struggled with the people, culture, and process aspects of these methods. It is people (hard work) and process (solid thinking, planning, and execution) that spell success, not software.

Read the full column here.