Here at EMI, we conduct a large amount of Research-on-Research (ROR). While the focus of the research is to gain an understanding about the sample landscape, including differences in attitudes and behaviors across sample providers, we also do quite a bit of work to evaluate the quality of the data. For this research, we interviewed nearly 12,000 respondents and evaluated nearly 36,000 open-ended responses across 23 different sample sources. We utilized our internal proprietary software (SWIFT) to help with this.
Keep in mind that for the purposes of this research, we did not remove anyone – we’re only evaluating respondents to help determine the ideal quality measure or set of quality measures to use in the future.
There are countless ways to evaluate and measure data quality. As part of the most recent waves of research on research, we included 19 different data quality checks embedded in the survey that accounts for and examines a variety of data quality measures.
The 19 quality checks are broken down into different categories:
We generally include three open-ended questions in each survey. The first quality check that we undergo is looking for garbage or irrelevant responses in open-ended questions. Open-ended responses require more in-depth thought and attention to detail and can help determine how thorough a respondent is being in their answers. The second check is to look for suspicious open-ended responses. These responses may be irrelevant to the question or may by similar to responses used for other questions.
Time within a survey is a necessary measure of quality. While it is important that surveys are an appropriate length, they are designed to take a certain amount of time if done correctly. The first quality check in the time category is if a respondent takes more than 600 minutes (mean length of surveys is 21.93 minutes) to complete the survey. The second quality check is to look for speeders. Speeders are respondents who take less than the cutoff time (median length of survey/3) to complete the survey. For example, if the median length of the survey is 15 minutes and a respondent took less than 5 minutes to complete the survey, they would be considered a speeder.
When looking at quality checks by device, there are four different types of mobile device contradictions. We asked all respondents which device they were using to take the survey and compared that to the digital fingerprinting software we utilize that tells us each specific device used by each respondent. The first contradiction is when a respondent says “No” to using a mobile device but in fact are using a mobile device. The second is when a respondent says “Yes” to using a mobile device when they are in fact not using a mobile device. The third is a respondent saying “No” to using a mobile device, then later selecting “Android/iOS.” The final contradiction is when a respondent says “Yes” to a mobile device, then later selects “Other/ Desktop.”
IP Address is also an important factor in making sure that our surveys reach the targeted respondents. The first quality check for IP address is to search for duplicates based on IP address and demographics. We looked across all respondents and identified consistent trends by IP address and demographics.
The second quality check for the IP address category is to ensure that respondents are not outside of the country for which the survey is being conducted, which can be determined by the IP address. Finally, if the IP address does not match a country, we can determine that the data from this respondent may be fraudulent.
There are two attention check questions to test for data quality. For these questions, the respondent is asked to “Select 5 for all brands” to determine if they are paying attention. If they do not select 5, they fail the attention check.
The final category for data quality checks is the “Other” category. The first quality check is the smoking contradiction. We ask respondents about smoking once at the beginning of the questionnaire and once at the end. If their answers contradict each other, it may be a sign of poor data quality from that respondent.
The second quality check is income contradiction, which follows the same format as the smoking contradiction.
The third quality check is a random question within the survey. If the respondent selects “Walk on the moon” in response to the question “Which of the following have you done in the past month?” we can determine that this respondent may not be paying attention or not taking the survey seriously. Two more quality checks are if the respondent selects more than 3 ethnicities or has an ethnicity contradiction.
The final quality check is if the respondent is a straight liner. A straight liner respondent is someone who selects the answers that fall in the same vertical line throughout the survey. For example, the respondent might select answer A for every question, which may be a sign of fraudulent responses.
These checks allow us to detect a variety of threats to the accuracy and reliability of the data. They measure the speed of respondents, accuracy of the responses, duplicate answers, contradicting answers, and more. They also provide a guideline and standard to hold ourselves accountable when collecting and analyzing data here at EMI.
We can then take that (data quality) data and analyze it just as we would any other data. After that, we consider all 19 data quality checks and decide whether to remove a respondent or not. We must take into consideration the number of failed quality checks, as well as which ones were failed.
At this point in the data quality check, we make an educated decision on whether to remove the respondent or not. This is a crucial decision because we don’t want to miss out on a valuable respondent perspective because we mistook the data as incorrect or invalid, but we also don’t want to include biased data in our analyses that could potentially sway the decisions made based on that data. This attention to detail ensures that the data is reliable and trustworthy so that better business decisions can be made.
The results from our data quality measures will be discussed in upcoming blogs. We will look at differences in data quality by panel, demographic, changes in attitudes and behaviors, open-ends, and finish with our recommendations on how to approach data quality so that you get the most out of your data.