The SampBal14x.lng Model

The sample weighting problem

View the model
Download the model

 Suppose you have a sample of 1000 people, classified according to
the two dimensions:
  Gender: female or male, and 
  Income:  Low, MediumLow, Medium, or High
In our target population, the "rim" fractions to apply:
Female: 50%,  Male: 50%,
LowIncome: 40%, MediumLow: 30%, Medium: 20%, High: 10%.
Our sample of 1000 has the following number in the various cells:
           Low    MediumLow   Medium   High
  Female:  220 	  147	       106	    55
    Male:  193	  149         94	    36

We want to predict a dependent variable, e.g., how the population will vote
on a particular issue.
   We say the sample is representative if the fraction of the sample that is
of a given income level is the same as in the general population, the
fraction of the sample that are female is the same as in the target population, etc.
   Our sample above is not representative of the target population in this sense.
Notice that (220 + 193)/1000 = 0.413 > 0.4 so the first income level is over-represented.
Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5
so females are over-represented.
    One thing we could do is to discard "unrepresentative"  observations 
from the sample so that the remaining observations accurately match the target population. 
Thus, we have a smaller sample size. We want to drop the smallest number
of observations to achieve a close match to the population.
A slightly more general approach is to not completely drop an observation,
but rather, reduce the weights given to "unrepresentative" observations. 
Which weights should be adjusted, and by how much? 
   If the sample is not representative, we may want to choose a weight other than 1.0 
for each observation so that the weighted fractions of the sample match the
population fractions. For our example there are 2 + 4 = 6 target fractions to
be matched.  There are 1000 different weights to be chosen (or 8 weights if we
apply the same weight to all observations in the same cell), so there are lots of
different weight combinations that match the target population fractions.
Which weight combination should we choose?
The "Max effective sample Size" (MS) approach chooses the weights that
   a) match the population targets, and
   b) minimize the variance of the resulting estimate, or equivalently,
      maximize the size of an equivalent representative sample that has
      the same variance.

 Here are some cell weights that  perfectly match the rim targets. 
           Low     MediumLow   Medium     High
  Female: 0.31	  1.065170  1.700566  1.726545
    Male: 1.719171  0.96255   0.21	    0.14
Notice that
(0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4
so the first income level is matched.
Are these the best possible weights?  It can be shown that these 
weights have an effective sample size of 729, - considerably lower than
the unweighted sample size of 1000.
   If on the other hand, you choose the following weights: 
           Low     MediumLow   Medium     High
  Female:  0.731    1.182      1.444     0.228
    Male:  1.240    0.847      0.499     2.430

it has an effective sample size of 843.  Considerably better than 729.
   There are various intermediate things we could do in the sense we 
could consider the spectrum ranging from 
  placing great importance on closely matching the population rim targets vs.
  not letting the weights vary too much from 1 but 
      not exactly matching the population target fractions.
In either case, we want to minimize the variance of our estimator.
If you are willing to allow a very slight violation of the rim targets,
then the following weights:
           Low     MediumLow   Medium     High
  Female:  0.940    0.971      0.96      1.059
    Male:  1.015    1.045      1.038     1.134
have an effective sample size of 997, almost as good as 1000.
It does a fairly good job of matching the rim targets as follows:
 Achieved rim target percent
 Variable  Level  WgtPercent
   GENDER    1    51.0
   GENDER    2    49.0
   INCOME    1    40.3
   INCOME    2    29.9
   INCOME    3    20.0
   INCOME    4     9.9
        
For example:  (0.940*220 + 1.015*193) = 402.695/1000 = 40.3.

 Ref: Madansky, A. and L. Schrage (2017), "Maximizing effective sample size
 while sample balancing," Quirks Marketing Research Review; 

Keywords:

Marketing | Sampling | Statistics | Quadratic |