The Human Side Of Data Sanitation

Data gathering is an important process in any business. To ensure that the gathered data will be useful for an organisation it should be in pristine condition. Making decisions based on dirty data can be detrimental to the company.

Now what is dirty data anyway? Putting it simply, data that is duplicated, inaccurate, incomplete or flawed in any way is exactly what dirty data is. Therefore it is essential that quality control is applied in the process of data gathering.

Now how do we do that, exactly? There’s an old saying that goes; “Straight from the horse’s mouth”. A lot of organisations put significant weight in what people say about their products and services. They invest millions in data gathering campaigns using surveys, questionairres and feedback forms.

Granted, these are globally accepted and practiced methods of data gathering. In fact, if you persevere, you can probably get a significant pool of respondents providing a wealth of business data. If you believe in strength in numbers, you’d probably need to rethink that approach when it comes to data gathering. Bear in mind that effective data gathering is not built around a single pillar of strength.

Personally, I feel that trustable data must be verifiable and quantifiable. Now what does this mean in real world terms? Verifiable data means that the data can be traced back to its exact source. Quantifiable data means that it can be valued. Let’s apply this concept by looking at a couple of examples:

  1. A well dressed female customer buys a few expensive jewellery
  2. Mrs Basu bought a few expensive jewellery
  3. Mrs Basu bought a diamond ring, a gold bracelet and a pair of pearl earrings
  4. Mrs Basu, a regular customer, who happens to be the General Manager of Accounts at Company X, bought a diamond ring worth RM1200, a gold bracelet worth RM2000 and a pair of pearl earrings at RM800 (after getting a 20% discount of the original sales price) at 3:30pm today

The first statement is neither verifiable nor quantifiable. Basically we don’t know who exactly the well dressed female customer is, nor do we know what “a few” or “expensive” means. Those two terms are highly subjective.

The second statement is better; at least we can verify who the customer is. Nevertheless, we still don’t know what exactly did she bought.

The third statement makes the picture much clearer. We know who bought what. However, it can be better as the next statement shows.

The fourth and last statement pretty much hits it on the head. We know exactly who bought exactly what and at what exact point of time.

Now, let’s say your company has invested in a state-of-the-art IT solution that can handle tons of data and output it in so many graphs that you can get epileptic seizure just by watching the rendering process. How sure are you that this implimentation will contribute significant ROI? Do you even have a complete picture on what the investment was in the first place? Can you even quantify the exact breakeven point of the implementation? Can you verify that?

I’ve been in IT long enough to know that there is no magic bullet when it comes to data analysis. Often enough, improperly sanitised data contribute a lot to the perceived “failure” of an analytical IT implimentation. It’s just too easy to blame the system when the fault is much, much more ingrained.

Computerised systems can pretty much handle any kind of duplicate detection or ensuring data completeness. However, accuracy is pretty much a hit and miss thing.

I’m not saying that IT solutions can’t add, substract or perform more complex mathematical processes properly. The fact of the matter is, the inputted data is only as accurate as what the user keys into the system. All the computer systems I’ve seen do not make presumptions of its operators. Therefore, if the entered data matches the acceptable pattern, it’s allowed to go through.

The human factor is an often overlooked aspect in the data gathering process. Training is mostly methodical instead of focusing on the importance of ensuring accurate input. Putting it simply, systems training is more focused on the how instead of the why.

Humans are intelligent creatures. More importantly, they’re also selfish. So if they are made to understand why making use of an IT implimentation properly will affect their income, job security and ultimately their survival in the organisation; you can be pretty darn sure that they’ll strive to do the best job possible.

And that my friends is the bottom line.

6 responses to “The Human Side Of Data Sanitation”.

  1. Kay Kastum Says:

    Very true Azmeen. Having the most advanced technology will not be a guarantee of a company success.

  2. Site Admin Azmeen Says:

    It’s funny sometimes, Kay.

    Companies invest significantly in data mining tech; but it can all go down the drain when most of what you’re mining are dirt.

    Technology helps, but it’s not the be all and end all in improving business processes.

  3. Bryan Says:

    Data? Its garbage in garbage out for me. I would add one thing to your excellent writeup – the element of time. I’ve always imagined managing data is like managing perishables at the supermarket. Everything has an expiry date. If you can’t get the customer to update their info, no matter how good a system is, you end up with a yardful of decayed veggies or dirty data as you put it.

    In the US data mining is a common thing in large companies. Asian companies will balk at the cost of data gathering and maintenance and will question cost benefit. They might have a system but most will have absolutely no clue what to do with the data. They invariably fall back to gut feel, intuition and feng shui. In the attitude towards data management I think Asia’s where the US was in the late 70’s.

  4. Site Admin Azmeen Says:

    Ah yes Bryan, I think I should have emphasised the timeliness part as well.

    How right you are about the Asian attitude towards data mining. However, this area is also where those with experience can shine. Futhermore, we can catch up pretty fast once we know how to tie the knots.

    That’s the wonderful thing about technology; the hard part is invention, replication and improvisation will spread pretty fast after that.

  5. Sung Says:

    This is very useful information. I would like to share this information if you do not mind. Thank you.

  6. Site Admin Azmeen Says:

    Hello Sung,

    By all means, please do.

    All HTNet content is licensed under the Creative Commons Attribution license.

    Basically it means that you can use the content in its entirety or partially; for commercial or non-commercial purposes as long as you provide credit to me (the original author).

    Credits can be in the form of a link to the original article if you’re publishing in an online format. Alternatively you can display the URL to this page for printed publications.

    Thank you.