You’re talking about data that doesn’t back the initial hypothesis. That isn’t bad data in this context, and you’re correct that it is still valuable for reforming hypotheses and re-running the experiment.
Bad data in this context is referring to data quality - things like inconsistent collection, inadequate/missing data, free text vs controlled input, etc. In those cases the data can become almost useless (and this is usually known by the people working on a project but not necessarily by their management). This causes pressure to turn shit into gold when that just isn’t possible.
Imagine that your boss wanted you to predict what the temperature will be next Tuesday. In order to do this, your company has provided you the temperature from every Tuesday for the past 12 years. If that wasn’t bad enough, at first they recorded the date in DDMMYY format but 10 years ago they switched to MMDDYY. However, some records were still collected in the legacy DDMMYY format due to lack of training in the temperature collection department, and there is no way to distinguish the correct date. Also, one employee who was close to retirement only collected the temperature as “Hot” or “Cold” because that is how he was trained to do it when he was first hired 50 years ago and he never bothered to learn the new system. Now, you can probably build a model that tracks weekly temperature over time and approximates the next Tuesday’s temperature based on something like seasonality, the historical average, and the most recent Tuesday. But you’ll know that it’s not the best estimate, you’ll know there is way better data out there, and you’ll probably be able to make a simpler, more accurate estimate just by averaging the temperature from Saturday/Sunday/Monday.
You’re talking about data that doesn’t back the initial hypothesis. That isn’t bad data in this context, and you’re correct that it is still valuable for reforming hypotheses and re-running the experiment.
Bad data in this context is referring to data quality - things like inconsistent collection, inadequate/missing data, free text vs controlled input, etc. In those cases the data can become almost useless (and this is usually known by the people working on a project but not necessarily by their management). This causes pressure to turn shit into gold when that just isn’t possible.
Imagine that your boss wanted you to predict what the temperature will be next Tuesday. In order to do this, your company has provided you the temperature from every Tuesday for the past 12 years. If that wasn’t bad enough, at first they recorded the date in DDMMYY format but 10 years ago they switched to MMDDYY. However, some records were still collected in the legacy DDMMYY format due to lack of training in the temperature collection department, and there is no way to distinguish the correct date. Also, one employee who was close to retirement only collected the temperature as “Hot” or “Cold” because that is how he was trained to do it when he was first hired 50 years ago and he never bothered to learn the new system. Now, you can probably build a model that tracks weekly temperature over time and approximates the next Tuesday’s temperature based on something like seasonality, the historical average, and the most recent Tuesday. But you’ll know that it’s not the best estimate, you’ll know there is way better data out there, and you’ll probably be able to make a simpler, more accurate estimate just by averaging the temperature from Saturday/Sunday/Monday.
That’s bad data.
This guy datas.