The SampBal14x.lng Model

The sample weighting problem

Suppose you have a sample of 1000 people, classified according to
the two dimensions:
Gender: female or male, and
Income: Low, MediumLow, Medium, or High
In our target population, the "rim" fractions to apply:
Female: 50%, Male: 50%,
LowIncome: 40%, MediumLow: 30%, Medium: 20%, High: 10%.
Our sample of 1000 has the following number in the various cells:
Low MediumLow Medium High
Female: 220 147 106 55
Male: 193 149 94 36

We want to predict a dependent variable, e.g., how the population will vote
on a particular issue.
We say the sample is representative if the fraction of the sample that is
of a given income level is the same as in the general population, the
fraction of the sample that are female is the same as in the target population, etc.
Our sample above is not representative of the target population in this sense.
Notice that (220 + 193)/1000 = 0.413 > 0.4 so the first income level is over-represented.
Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5
so females are over-represented.
One thing we could do is to discard "unrepresentative" observations
from the sample so that the remaining observations accurately match the target population.
Thus, we have a smaller sample size. We want to drop the smallest number
of observations to achieve a close match to the population.
A slightly more general approach is to not completely drop an observation,
but rather, reduce the weights given to "unrepresentative" observations.
Which weights should be adjusted, and by how much?
If the sample is not representative, we may want to choose a weight other than 1.0
for each observation so that the weighted fractions of the sample match the
population fractions. For our example there are 2 + 4 = 6 target fractions to
be matched. There are 1000 different weights to be chosen (or 8 weights if we
apply the same weight to all observations in the same cell), so there are lots of
different weight combinations that match the target population fractions.
Which weight combination should we choose?
The "Max effective sample Size" (MS) approach chooses the weights that
a) match the population targets, and
b) minimize the variance of the resulting estimate, or equivalently,
maximize the size of an equivalent representative sample that has
the same variance.

Here are some cell weights that perfectly match the rim targets.
Low MediumLow Medium High
Female: 0.31 1.065170 1.700566 1.726545
Male: 1.719171 0.96255 0.21 0.14
Notice that
(0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4
so the first income level is matched.
Are these the best possible weights? It can be shown that these
weights have an effective sample size of 729, - considerably lower than
the unweighted sample size of 1000.
If on the other hand, you choose the following weights:
Low MediumLow Medium High
Female: 0.731 1.182 1.444 0.228
Male: 1.240 0.847 0.499 2.430

it has an effective sample size of 843. Considerably better than 729.
There are various intermediate things we could do in the sense we
could consider the spectrum ranging from
placing great importance on closely matching the population rim targets vs.
not letting the weights vary too much from 1 but
not exactly matching the population target fractions.
In either case, we want to minimize the variance of our estimator.
If you are willing to allow a very slight violation of the rim targets,
then the following weights:
Low MediumLow Medium High
Female: 0.940 0.971 0.96 1.059
Male: 1.015 1.045 1.038 1.134
have an effective sample size of 997, almost as good as 1000.
It does a fairly good job of matching the rim targets as follows:
Achieved rim target percent
Variable Level WgtPercent
GENDER 1 51.0
GENDER 2 49.0
INCOME 1 40.3
INCOME 2 29.9
INCOME 3 20.0
INCOME 4 9.9

For example: (0.940*220 + 1.015*193) = 402.695/1000 = 40.3.

Ref: Madansky, A. and L. Schrage (2017), "Maximizing effective sample size
while sample balancing," Quirks Marketing Research Review;

Keywords:

Marketing | Sampling | Statistics | Quadratic |