The SampBal14x.lng Model

The sample weighting problem

View the model

Download the model

Suppose you have a sample of 1000 people, classified according to the two dimensions: Gender: female or male, and Income: Low, MediumLow, Medium, or High In our target population, the "rim" fractions to apply: Female: 50%, Male: 50%, LowIncome: 40%, MediumLow: 30%, Medium: 20%, High: 10%. Our sample of 1000 has the following number in the various cells: Low MediumLow Medium High Female: 220 147 106 55 Male: 193 149 94 36 We want to predict a dependent variable, e.g., how the population will vote on a particular issue. We say the sample is representative if the fraction of the sample that is of a given income level is the same as in the general population, the fraction of the sample that are female is the same as in the target population, etc. Our sample above is not representative of the target population in this sense. Notice that (220 + 193)/1000 = 0.413 > 0.4 so the first income level is over-represented. Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5 so females are over-represented. One thing we could do is to discard "unrepresentative" observations from the sample so that the remaining observations accurately match the target population. Thus, we have a smaller sample size. We want to drop the smallest number of observations to achieve a close match to the population. A slightly more general approach is to not completely drop an observation, but rather, reduce the weights given to "unrepresentative" observations. Which weights should be adjusted, and by how much? If the sample is not representative, we may want to choose a weight other than 1.0 for each observation so that the weighted fractions of the sample match the population fractions. For our example there are 2 + 4 = 6 target fractions to be matched. There are 1000 different weights to be chosen (or 8 weights if we apply the same weight to all observations in the same cell), so there are lots of different weight combinations that match the target population fractions. Which weight combination should we choose? The "Max effective sample Size" (MS) approach chooses the weights that a) match the population targets, and b) minimize the variance of the resulting estimate, or equivalently, maximize the size of an equivalent representative sample that has the same variance. Here are some cell weights that perfectly match the rim targets. Low MediumLow Medium High Female: 0.31 1.065170 1.700566 1.726545 Male: 1.719171 0.96255 0.21 0.14 Notice that (0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4 so the first income level is matched. Are these the best possible weights? It can be shown that these weights have an effective sample size of 729, - considerably lower than the unweighted sample size of 1000. If on the other hand, you choose the following weights: Low MediumLow Medium High Female: 0.731 1.182 1.444 0.228 Male: 1.240 0.847 0.499 2.430 it has an effective sample size of 843. Considerably better than 729. There are various intermediate things we could do in the sense we could consider the spectrum ranging from placing great importance on closely matching the population rim targets vs. not letting the weights vary too much from 1 but not exactly matching the population target fractions. In either case, we want to minimize the variance of our estimator. If you are willing to allow a very slight violation of the rim targets, then the following weights: Low MediumLow Medium High Female: 0.940 0.971 0.96 1.059 Male: 1.015 1.045 1.038 1.134 have an effective sample size of 997, almost as good as 1000. It does a fairly good job of matching the rim targets as follows: Achieved rim target percent Variable Level WgtPercent GENDER 1 51.0 GENDER 2 49.0 INCOME 1 40.3 INCOME 2 29.9 INCOME 3 20.0 INCOME 4 9.9 For example: (0.940*220 + 1.015*193) = 402.695/1000 = 40.3. Ref: Madansky, A. and L. Schrage (2017), "Maximizing effective sample size while sample balancing," Quirks Marketing Research Review;

Keywords:

Marketing | Sampling | Statistics | Quadratic |