Data is Never Clean

Lixar I.T. Tech

Here’s the dirty little secret to Data Science: it’s challenging … and it’s not.


With teachers, time, and help, the data math part “makes sense.” With the multitude of online courses, it has never been easier to learn how Data works. When the data is organized, data, it’s miraculous.

But it’s never a snap and here’s why:

  • Data is never clean.
  • It’s never organized.
  • It’s never complete.

There are often data irregularities.  Here are some examples:

  • “How are there more flights departing than planes on the tarmac?
  • Why is there 27 hours in a day?
  • A company sold a mobile phone for negative dollars? Twice?

What if the data is in six different databases across three continents and the usernames in France accept letters with accents but the American office doesn’t. There are different date formats and metric units and currency exchange. How do you link up customer databases when usernames are different for the same user?

What is the fastest and most efficient way to store decades worth of data on thousands of products with millions of transactions?

Having veteran Data Engineers to properly architect and deploy is key to a streamlined, effective system. The results from Machine Learning are only as good as the data you use. It’s as much Data Art as it is Data Science. To get that extra 5% improvement you need to work with the client to truly understand not only their data but their industry, their business, incentives, and risks.

The ability to quickly empathize with a client and truly understand their needs is a key requirement for a powerful Data Sciences team. Many Machine Learning algorithms are “pattern matchers.” Provide it enough data from different types of scenarios and the system can learn to detect the inherent patterns in the information. Because the algorithm has seen this pattern and trend in historical data, sales for *this* particular:

  • Telco product this month will be about 45 units.
  • The optimal speed to win this car race should be 155 +/- 5mph, and
  • Here’s a report on the seven particular customers that are at risk of cancelling their subscription next quarter.

Experience means knowing the best way to edit, twist, roll, and massage the data to make the “patterns” and “signal” more visible to the algorithms. This may mean linking external data (historical foreign exchange rates, census reports, or published papers on adoption rates), or taking the data you have and getting more out of it.

A great example is sentiment analysis, using your customer’s reviews, surveys, and social media. There is more “pattern” to match in what they tweets *mean* rather than how many times someone posted. Happy tweets could mean a [link]Raving Fan[/link], upset tweets following a recent purchase is an increased risk of a lost client.

Providing real-time visualization dashboards to executive, marketing, and customer retention teams can mean the difference between savings and losing long-term customers. How many customers are at risk? How much monthly recurring revenue is on the table?

Connecting with a strong data science team and a robust data practice helps businesses discover patterns, trends and answers to business questions that perhaps they didn’t even think of asking.

The right Data science team will find value and relevance to elevate business optimization and opportunity.

Interested?  Connect with the data team to learn more:

Thank you to Jim Provost, Lixar Data Scientist for writing this blog.