Andy Peytchev's Survey Methodology Blog

Friday, March 1, 2019

Toward greater integration of surveys with data from alternative sources

Alternative sources can encompass almost any type of information, making this topic intractable without specific examples. At a high level, the promise of "administrative data," for example, can be quite lucrative. But for most statistical purposes these data are not replacements for survey data but rather, hide potentially useful ways to inform data collection, augment analyses, improve survey estimates, and integrate with survey data to produce superior, potentially cheaper and faster, data products.

Here is one example. A national survey is conducted every week to track national gasoline and diesel prices. It is a complex endeavor that requires collection and processing of data in a single day.

A popular smartphone app uses crowdsourcing to collect gasoline and diesel prices at gas stations. Individuals post and update prices at each gas station to inform all users of the app in real time.

The two are not independent as the national survey makes some use of the app data. Indeed, it is commendable that the survey uses data from the app to some degree. Nonetheless, I was surprised at the degree of correspondence between the two sources of estimates when I graphed the survey estimates and overlaid them with the app estimates:

The main giveaway for which line represents which source is that the app provides daily estimates instead of weekly.

Is this a cell phone, a landline, or both? The demise of landlines as a technology?

Currently, Random Digit Dial (RDD) telephone surveys can field samples from the landline and from the cell phone frames (sets of exchanges assigned to either service). The interviewers could confirm with the respondent whether they have reached a landline or a cell phone, and assign them to the correct stratum for weighting and estimation.

Apparently, motivated by the high cost of maintaining landlines, landline companies have been developing ways to use cell service to replace the copper wires to the homes. Maybe not surprising as many of these companies are also cell phone carriers. In this setup, a cell receiver then connects to the existing wiring in the house, much like VoIP (Voice over IP, which uses the internet). People then use their regular "landline" handsets.

Are we choosing the wrong survey design because of goals expressed in number of interviews?

We could be choosing a suboptimal survey design (with respect to bias in estimates, for example) because of pursuit of higher response rates. But this is another topic.

Survey objectives are often expressed as X,XXX number of interviews. A general survey design is selected that can achieve that number within given budget constraints, balancing cost and credibility in some fashion. For example, a Random-Digit-Dial telephone survey using mostly landline numbers (lower cost) and some, although relatively few, cell phone numbers. This is not unlike a decade or two ago when an analogous split was made between the listed and list-assisted RDD sampling frames. So the design above would produce unbiased estimates to the extent that adults with only cell phones are represented.

What is not obvious is that this design, while cost-driven, may be very cost-inefficient.

Missed Opportunities to Make Use of Large Scale Survey Experiments

Large ongoing national surveys such as the Consumer Expenditure surveys, the National Crime Victimization Survey, the National Health Interview Survey, the National Survey on Drug Use and Health, and the National Immunization Survey (these are just a few haphazard examples), field major methodological experiments, the data from which are not intended to be used in official estimates but rather, the inform changes in the design of the studies.

Unfortunately, the data from these large scale methodological experiments are usually not publicly released. This is not through inadvertent oversight. At a minimum, there is cost associated with preparing and releasing data files - resources that are needed on the particular survey being redesigned.

The Three Sampling Distributions of Polls ... and Surveys?

I just listened to a pollster describe polls relying increasingly more on art than science. Decisions on how to identify likely voters, how to weigh the data, etc. And it made me think: could we not quantify the Art part so that it becomes Science?

Probability-based surveys and polls rely on sampling theory, and ultimately, on a sampling distribution. This is certainly the case when frequentist-based inferential methods are used. That is, if the survey is repeated again and again, there will be a sampling distribution of the survey estimate of interest. Based on this distribution, we expect that approximately 95% of the replications of the survey will fall within two standard deviations of its mean.

Are we ready for a single-frame cell phone survey design?

It feels as if in this rapidly changing social and technological environment, surveys methods are merely trying to catch up with changes. It took a while for RDD telephone surveys to incorporate cell phone numbers. It took more time for researchers to use sample optimization formulas for allocation of sample across frames.

What if we try to look at least a step ahead. If one plots the landline and cell phone service for adults in the US from Blumberg and Luke (2012) based on NHIS data, there is a clear linear trend. More importantly, there are two clear trends: increasing cell phone service, and just as importantly, decreasing landline telephone service. Another key observation that can be made from the graph below is that we are rapidly approaching almost complete coverage of the telephone households through cell phones.

Modeling Paradata

Paradata have been used long before the term, describing data automatically as part of the data collection, was invented. For example, interviewer response rates have been monitored on many surveys and in various organizations, during data collection. A recent development, largely helped by responsive (and more generally, adaptive) designs has been the graphical presentation of such paradata over the course of the data collection period. That allows us to see changes in performance over time.

Consequences of Survey Nonresponse

I was asked to write a paper on the consequences of unit nonresponse in surveys (http://ann.sagepub.com/content/645/1/88). Quite a daunting task at first glance... but I found it quite interesting to go through the thought exercise of how, particularly increasing, nonresponse rates are affecting how we do surveys and the inferences we make from the collected data. The interesting part is not the obvious--sure, an article may be rejected from JAMA for using data from a survey with lower than 60% response rate, because it is equated with bias--but rather, all the other ways it affects surveys and survey inference. Surveys become more costly because of an increasing need to spend resources on reduction of nonresponse. They also become more complex, such as incorporating responsive designs, multiphase designs more generally, and using multiple modes. While higher nonresponse rates may not lead to higher nonresponse bias, it does mean greater reliance on auxiliary data and statistical models.

Computation of Response Rates in RDD Telephone Surveys

Suppose you are interested in examining the fat-free food items in Target. You can quickly screen out the clothing and furniture items by taking a quick look. You may then go to the food department and look at the labels on all the items whether they are fat-free. Some will be, some will not, and a third set you will not be able to establish (maybe they are out of reach...).

Multiple sources of error in surveys

Declining response rates in surveys have inarguably led to disproportionate attention to nonresponse reduction in surveys. Yet other sources of survey error may make greater contribution to bias in survey estimates. Even if the impact of other errors are not larger than that of nonresponse, they may be less susceptible to adjustment. That is, nonignorable bias in survey estimates may be dominated by undercoverage of the target population, for example.

Responsive Design in a Telephone Survey

I have been quite intrigued by a couple of aspects in RDD telephone surveys, a mode that is generally marred by very limited (if any) information on sample numbers and limited techniques that could be applied to tackle nonresponse, relative to face to face surveys. First, there is some information that can be merged from other sources. Such data are either quite aggregated, such as census demographic estimates at the tract or block group levels, or have doubtful measurement properties, such as commercial data at the household or person level. Second, response propensity is not a property of a person. What is done on the first call attempt can affect later outcomes. An inexperienced interviewer may, for example, reduce the likelihood that a sample member is interviewed even if the best interviewer calls the case on the next call.

Nonresponse bias reduction through case prioritization

It is understandable why interviewers and survey organizations target cases with likely higher response propensities in order to maximize response rates. If the goal is to really reduce the potential for nonresponse bias, however, this may no longer be the correct approach. It may not also lead to the lowest variance in adjusted estimates because it can lead to high variability in nonresponse weight components.

Nonresponse and Measurement Error

Survey researchers aim to minimize the effect of multiple sources of survey error, but they may not be independent; in fact, they may have common causes. In a new article (Peytchev, Peytcheva, and Groves, 2010) we explore the possibility that topic-related social stigma may cause both underreporting of abortion experiences and failure to obtain interviews altogether. In addition to finding support for such a relationship between unit nonresponse and measurement error through a common cause, we also find an indication for possible remedies - finding survey protocols that reduce the relationship between error sources.

Friday, February 19, 2010

Mail vs. Telephone; ABS vs. RDD

The telephone survey methodology conference II and the resulting book, Advances in Telephone Survey Methodology, have shown problems and solutions to the use of random-digit-dialing (RDD) and telephone to conduct surveys.

Others have used growing challenges with RDD, both cost and coverage related, as arguments for the use of entirely new approaches to current studies.

Are We Done Yet?

Commonly, surveys have a target number of interviews, together with a target response rate. The number of interviews can be determined by a desired precision of survey estimates based on variance assumptions. Most surveys have multiple objectives, producing numerous estimates and even types of analysis. But what if a survey has a focal objective, such that survey data collection can be geared towards that objective? Clinical trials are one such example, but there are likely many surveys that despite producing many estimates, are really conducted to produce one or few key estimates such as an employment rate or consumer sentiment.

Optimizing dual-frame RDD sample designs

Mike Brick presented an interesting short course through ASA/SRMS last year, presenting much-needed work on optimum allocation of samples from landline and cell phone frames. In his examples, made-up numbers for cost per interview were used.

Unfortunately, it is the cost estimates themselves that are one of the greatest challenges to optimal allocation. Survey costs are ill-understood and difficult to measure.

The Role of the Survey Statistician: Beyond Sampling Error

Not surprisingly, the sampling texts devote the majority of the content to sampling error. There are tradeoffs with other sources of error, however. Nonresponse, for example, is quite directly related to sampling; one could start with a much larger sample size and achieve more completed interviews at the expense of potential nonresponse error.

Andy Peytchev's Survey Methodology Blog

Friday, March 1, 2019

Toward greater integration of surveys with data from alternative sources

Tuesday, July 23, 2013

Is this a cell phone, a landline, or both? The demise of landlines as a technology?

Thursday, May 2, 2013

Are we choosing the wrong survey design because of goals expressed in number of interviews?

Tuesday, March 26, 2013

Missed Opportunities to Make Use of Large Scale Survey Experiments

Wednesday, January 23, 2013

The Three Sampling Distributions of Polls ... and Surveys?

Saturday, December 15, 2012

Are we ready for a single-frame cell phone survey design?

Tuesday, December 11, 2012

Modeling Paradata

Thursday, November 29, 2012

Consequences of Survey Nonresponse

Wednesday, November 9, 2011

Computation of Response Rates in RDD Telephone Surveys

Monday, January 31, 2011

Multiple sources of error in surveys

Tuesday, June 22, 2010

Responsive Design in a Telephone Survey

Sunday, May 2, 2010

Nonresponse bias reduction through case prioritization

Thursday, March 25, 2010

Nonresponse and Measurement Error

Friday, February 19, 2010

Mail vs. Telephone; ABS vs. RDD

Saturday, February 13, 2010

Are We Done Yet?

Sunday, January 31, 2010

Optimizing dual-frame RDD sample designs

Friday, January 22, 2010

The Role of the Survey Statistician: Beyond Sampling Error