SETS:
! The sample weighting problem. (SampBal.lng)
Suppose you have a sample of 1000 people, classified according to
the two dimensions:
Gender: female or male, and
Income: Low, MediumLow, Medium, or High
In our target population, the "rim" fractions to apply:
Female: 50%, Male: 50%,
LowIncome: 40%, MediumLow: 30%, Medium: 20%, High: 10%.
Our sample of 1000 has the following number in the various cells:
Low MediumLow Medium High
Female: 220 147 106 55
Male: 193 149 94 36
We want to predict a dependent variable, e.g., how the population will vote
on a particular issue.
We say the sample is representative if the fraction of the sample that is
of a given income level is the same as in the general population, the
fraction of the sample that are female is the same as in the target population, etc.
Our sample above is not representative of the target population in this sense.
Notice that (220 + 193)/1000 = 0.413 > 0.4 so the first income level is over-represented.
Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5
so females are over-represented.
One thing we could do is to discard "unrepresentative" observations
from the sample so that the remaining observations accurately match the target population.
Thus, we have a smaller sample size. We want to drop the smallest number
of observations to achieve a close match to the population.
A slightly more general approach is to not completely drop an observation,
but rather, reduce the weights given to "unrepresentative" observations.
Which weights should be adjusted, and by how much?
If the sample is not representative, we may want to choose a weight other than 1.0
for each observation so that the weighted fractions of the sample match the
population fractions. For our example there are 2 + 4 = 6 target fractions to
be matched. There are 1000 different weights to be chosen (or 8 weights if we
apply the same weight to all observations in the same cell), so there are lots of
different weight combinations that match the target population fractions.
Which weight combination should we choose?
The "Max effective sample Size" (MS) approach chooses the weights that
a) match the population targets, and
b) minimize the variance of the resulting estimate, or equivalently,
maximize the size of an equivalent representative sample that has
the same variance.
Here are some cell weights that perfectly match the rim targets.
Low MediumLow Medium High
Female: 0.31 1.065170 1.700566 1.726545
Male: 1.719171 0.96255 0.21 0.14
Notice that
(0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4
so the first income level is matched.
Are these the best possible weights? It can be shown that these
weights have an effective sample size of 729, - considerably lower than
the unweighted sample size of 1000.
If on the other hand, you choose the following weights:
Low MediumLow Medium High
Female: 0.731 1.182 1.444 0.228
Male: 1.240 0.847 0.499 2.430
it has an effective sample size of 843. Considerably better than 729.
There are various intermediate things we could do in the sense we
could consider the spectrum ranging from
placing great importance on closely matching the population rim targets vs.
not letting the weights vary too much from 1 but
not exactly matching the population target fractions.
In either case, we want to minimize the variance of our estimator.
If you are willing to allow a very slight violation of the rim targets,
then the following weights:
Low MediumLow Medium High
Female: 0.940 0.971 0.96 1.059
Male: 1.015 1.045 1.038 1.134
have an effective sample size of 997, almost as good as 1000.
It does a fairly good job of matching the rim targets as follows:
Achieved rim target percent
Variable Level WgtPercent
GENDER 1 51.0
GENDER 2 49.0
INCOME 1 40.3
INCOME 2 29.9
INCOME 3 20.0
INCOME 4 9.9
For example: (0.940*220 + 1.015*193) = 402.695/1000 = 40.3.
! Ref: Madansky, A. and L. Schrage (2017), "Maximizing effective sample size
while sample balancing," Quirks Marketing Research Review;
! Keywords: Marketing, Quadratic optimization, Sampling, Statistics;
CELL :
count, HWgt, ! HWgt is just for comparison;
weightcell,
income, age, region, achvcountc;
variable;
level;
VXL( variable, level): targpcent, achvcount, sampno,
achvcounth;
CXV( CELL, variable): Lnum, VarLvl;
! Keywords: Sample balancing, Survey analysis, Re-weighting, Raking;
ENDSETS DATA:
!Case 2x4: We sampled 1000 people. There are two explanatory variabes,
Gender and Income level. In the target population we expect the percentages:
Female: 50%
Male: 50%
Income1: 40%
Income2: 30%
Income3: 20%
Income4: 10%
;
! Minimizing rim error is all that matters;
!Case2x4; alpha = 1
!Case2x4; level = 1 2 3 4
!Case2x4; variable = Gender Income
! The population rim target percentages for
each variable and combination of level;
!Case2x4;
VXL Targpcent =
Gender 1 50
Gender 2 50
Income 1 40
Income 2 30
Income 3 20
Income 4 10;
! Cell number and combination of levels sampled;
!Case2x4;
CELL VarLvl=
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
;
! Count in each cell & heuristic weights.
The actual numbers in our sample for each of
the 2 genders and 4 income levels are below.
Notice that (220 + 193)/1000 = 0.413 > 0.4 so
the first income level is over-represented.
Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5
so the first gender is over-represented.;
!Case2x4;
count =
220 147 106 55
193 149 94 36
;
! Here are some cell weights that do perfectly
match the rim targets. Note that
(0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4
so the first income level is matched;
!Case2x4;
HWgt =
0.31 1.065170 1.700566 1.726545
1.719171 0.96255 0.21 0.14
;
! Various interesting settings of alpha;
!Case01 alpha = 1;
!Case01 level = 1 2 3 4 5 6 7 8 9 10;
!Case01 variable = income_ age_ region_;
! The population rim target percentages;
!Case01
VXL, targpcent =
income_ 1 17.95
income_ 2 23.20
income_ 3 27.28
income_ 4 14.34
income_ 5 17.23
age_ 1 3.87
age_ 2 9.07
age_ 3 19.09
age_ 4 21.60
age_ 5 18.02
age_ 6 6.44
age_ 7 5.17
age_ 8 8.80
age_ 9 5.91
age_ 10 2.03
region_ 1 5.14
region_ 2 14.21
region_ 3 16.44
region_ 4 7.14
region_ 5 19.05
region_ 6 6.30
region_ 7 10.91
region_ 8 6.40
region_ 9 14.41
;
! Sample data. This data set has
5 possible income levels,
10 age levels,
9 regions, giving
450 potential cells.
;
!Case01
CELL income age region count HWgt =
1 1 1 5 1 7.03868
2 1 2 3 5 2.64602
3 1 2 5 1 2.89112
4 1 2 6 1 2.83734
5 1 2 7 1 2.80156
6 1 3 1 1 1.44292
7 1 3 3 3 1.29764
8 1 3 4 4 0.84916
9 1 3 5 1 1.54273
10 1 3 6 2 1.48895
11 1 3 7 3 1.45317
12 1 3 8 2 1.27067
13 1 3 9 1 1.46797
14 1 4 1 2 1.1352
15 1 4 2 2 1.87272
16 1 4 3 3 0.98992
17 1 4 4 7 0.54144
18 1 4 5 11 1.23501
19 1 4 6 2 1.18123
20 1 4 7 4 1.14545
21 1 4 8 4 0.96296
22 1 4 9 5 1.16026
23 1 5 1 2 1.10644
24 1 5 2 3 1.84395
25 1 5 3 4 0.96116
26 1 5 4 2 0.51268
27 1 5 5 5 1.20625
28 1 5 6 1 1.15247
29 1 5 8 2 0.9342
30 1 5 9 1 1.13149
31 1 6 1 1 1.05367
32 1 6 2 1 1.79119
33 1 6 3 1 0.90839
34 1 6 4 2 0.45992
35 1 6 5 2 1.15349
36 1 6 6 1 1.09971
37 1 6 8 1 0.88143
38 1 6 9 1 1.07873
39 1 7 1 1 1.20926
40 1 7 2 1 1.94677
41 1 7 5 2 1.30907
42 1 7 6 1 1.25529
43 1 7 7 1 1.21951
44 1 7 9 1 1.23431
45 1 8 2 4 2.38434
46 1 8 3 2 1.50154
47 1 8 4 3 1.05307
48 1 8 5 1 1.74664
49 1 8 9 1 1.67188
50 1 9 3 3 2.45789
51 1 9 4 2 2.00941
52 1 9 5 2 2.70298
53 1 9 9 3 2.62822
54 2 2 3 1 2.84731
55 2 2 4 2 2.39883
56 2 2 5 3 3.0924
57 2 2 7 4 3.00284
58 2 2 8 1 2.82035
59 2 3 3 4 1.49892
60 2 3 4 4 1.05045
61 2 3 5 4 1.74402
62 2 3 6 4 1.69024
63 2 3 7 3 1.65446
64 2 3 8 4 1.47196
65 2 3 9 5 1.66926
66 2 4 1 3 1.33649
67 2 4 2 6 2.07401
68 2 4 3 8 1.19121
69 2 4 4 6 0.74273
70 2 4 5 6 1.4363
71 2 4 6 2 1.38252
72 2 4 7 4 1.34674
73 2 4 8 2 1.16425
74 2 4 9 3 1.36154
75 2 5 1 1 1.30773
76 2 5 2 1 2.04524
77 2 5 3 6 1.16245
78 2 5 4 5 0.71397
79 2 5 5 6 1.40754
80 2 5 6 2 1.35376
81 2 5 7 3 1.31798
82 2 5 8 2 1.13548
83 2 5 9 6 1.33278
84 2 6 1 1 1.25496
85 2 6 3 3 1.10968
86 2 6 4 1 0.6612
87 2 6 5 2 1.3547
88 2 6 6 2 1.30099
89 2 6 7 2 1.26521
90 2 6 8 2 1.08272
91 2 6 9 2 1.28002
92 2 7 8 1 1.2383
93 2 7 9 5 1.4356
94 2 8 2 1 2.58563
95 2 8 3 4 1.70283
96 2 8 4 1 1.25435
97 2 8 5 1 1.94792
98 2 8 6 2 1.89414
99 2 8 7 2 1.85836
100 2 8 8 1 1.67587
101 2 8 9 2 1.87317
102 2 9 3 1 2.65918
103 2 9 4 2 2.2107
104 2 10 3 1 5.3832
105 3 1 4 1 5.92843
106 3 2 1 1 2.37462
107 3 2 2 3 3.11213
108 3 2 5 2 2.47443
109 3 2 6 1 2.42065
110 3 2 8 1 2.20238
111 3 2 9 1 2.39967
112 3 3 1 3 1.02623
113 3 3 2 5 1.76375
114 3 3 3 16 0.88095
115 3 3 4 7 0.43247
116 3 3 5 9 1.12604
117 3 3 6 5 1.07226
118 3 3 7 7 1.03648
119 3 3 8 6 0.85399
120 3 3 9 11 1.05128
121 3 4 1 3 0.71852
122 3 4 2 5 1.45603
123 3 4 3 15 0.57324
124 3 4 4 13 0.12476
125 3 4 5 14 0.81833
126 3 4 6 3 0.76455
127 3 4 7 4 0.72877
128 3 4 8 8 0.54628
129 3 4 9 5 0.74357
130 3 5 1 2 0.68975
131 3 5 2 12 1.42727
132 3 5 3 16 0.54447
133 3 5 4 12 0.096
134 3 5 5 14 0.78956
135 3 5 6 3 0.73578
136 3 5 7 9 0.70001
137 3 5 8 3 0.51751
138 3 5 9 8 0.71481
139 3 6 1 1 0.63699
140 3 6 2 2 1.3745
141 3 6 3 5 0.49171
142 3 6 4 4 0.04323
143 3 6 5 7 0.7368
144 3 6 6 2 0.68302
145 3 6 7 2 0.64724
146 3 6 8 2 0.46475
147 3 6 9 2 0.66204
148 3 7 1 2 0.79257
149 3 7 2 3 1.53009
150 3 7 3 9 0.64729
151 3 7 4 5 0.19881
152 3 7 5 2 0.89238
153 3 7 6 2 0.8386
154 3 7 8 4 0.62033
155 3 7 9 2 0.81763
156 3 8 1 1 1.23014
157 3 8 2 1 1.96765
158 3 8 3 2 1.08486
159 3 8 4 3 0.63638
160 3 8 5 1 1.32995
161 3 8 8 2 1.0579
162 3 8 9 2 1.25519
163 3 9 2 1 2.924
164 3 9 3 1 2.0412
165 3 9 5 1 2.2863
166 3 9 8 1 2.01424
167 3 9 9 3 2.21154
168 3 10 4 1 4.31675
169 3 10 5 1 5.01032
170 4 1 2 1 7.0697
171 4 2 3 1 2.03933
172 4 2 5 1 2.28442
173 4 2 7 1 2.19486
174 4 3 1 5 0.83622
175 4 3 2 2 1.57374
176 4 3 3 4 0.69094
177 4 3 4 7 0.24247
178 4 3 5 7 0.93604
179 4 3 6 1 0.88226
180 4 3 7 3 0.84648
181 4 3 8 2 0.66398
182 4 3 9 10 0.86128
183 4 4 1 3 0.52851
184 4 4 2 8 1.26603
185 4 4 3 14 0.38323
186 4 4 4 5 0.001
187 4 4 5 8 0.62832
188 4 4 6 2 0.57454
189 4 4 7 7 0.53876
190 4 4 8 8 0.35627
191 4 4 9 9 0.55357
192 4 5 1 3 0.49975
193 4 5 2 4 1.23726
194 4 5 3 17 0.35447
195 4 5 4 7 0.001
196 4 5 5 11 0.59956
197 4 5 6 2 0.54578
198 4 5 7 6 0.51
199 4 5 8 3 0.32751
200 4 5 9 6 0.5248
201 4 6 1 2 0.44698
202 4 6 3 7 0.3017
203 4 6 5 3 0.54679
204 4 6 7 4 0.45723
205 4 6 8 4 0.27474
206 4 6 9 4 0.47204
207 4 7 3 2 0.45729
208 4 7 4 2 0.00881
209 4 7 5 1 0.70238
210 4 7 9 1 0.62762
211 4 8 1 1 1.04013
212 4 8 2 1 1.77765
213 4 8 3 2 0.89485
214 4 8 4 2 0.44637
215 4 8 5 8 1.13994
216 4 8 6 2 1.08616
217 4 8 7 1 1.05038
218 4 8 8 2 0.86789
219 4 8 9 2 1.06519
220 4 9 4 1 1.40272
221 4 9 6 1 2.04251
222 5 1 3 1 6.31875
223 5 1 4 1 5.87028
224 5 1 7 1 6.47429
225 5 2 1 1 2.31647
226 5 2 7 1 2.32672
227 5 2 9 1 2.34152
228 5 3 1 2 0.96808
229 5 3 3 5 0.8228
230 5 3 4 4 0.37432
231 5 3 5 3 1.06789
232 5 3 6 1 1.01411
233 5 3 7 6 0.97833
234 5 3 8 3 0.79584
235 5 3 9 6 0.99313
236 5 4 1 3 0.66037
237 5 4 2 4 1.39788
238 5 4 3 5 0.51509
239 5 4 4 6 0.06661
240 5 4 5 11 0.76018
241 5 4 6 4 0.7064
242 5 4 7 10 0.67062
243 5 4 8 5 0.48812
244 5 4 9 14 0.68542
245 5 5 1 4 0.6316
246 5 5 2 4 1.36912
247 5 5 3 15 0.48632
248 5 5 4 4 0.03784
249 5 5 5 11 0.73141
250 5 5 6 5 0.67763
251 5 5 7 7 0.64185
252 5 5 8 2 0.45936
253 5 5 9 6 0.65666
254 5 6 2 3 1.31635
255 5 6 3 1 0.43356
256 5 6 4 1 0.001
257 5 6 5 5 0.67865
258 5 6 6 1 0.62487
259 5 6 7 1 0.58909
260 5 6 8 2 0.4066
261 5 6 9 2 0.60389
262 5 7 1 4 0.73442
263 5 7 4 1 0.1406
264 5 7 7 5 0.74467
265 5 7 8 2 0.56218
266 5 7 9 4 0.75948
267 5 8 1 1 1.17199
268 5 8 5 1 1.2718
269 5 8 7 1 1.18224
270 5 8 9 4 1.19704
271 5 9 1 1 2.12833
272 5 9 4 1 1.53458
273 5 9 6 1 2.17437
274 5 9 9 1 2.15339
275 5 10 2 1 5.58987
;
! Sample data. This data set has
5 possible income levels,
10 age levels,
9 regions, giving
450 potential cells.
;
!CaseOK alpha = 1;
!CaseOK level = 1 2 3 4 5 6 7 8 9 10;
!CaseOK variable = income_ age_ region_;
! The population rim target percentages;
!CaseOK
VXL, targpcent =
income_ 1 17.95
income_ 2 23.20
income_ 3 27.28
income_ 4 14.34
income_ 5 17.23
age_ 1 3.87
age_ 2 9.07
age_ 3 19.09
age_ 4 21.60
age_ 5 18.02
age_ 6 6.44
age_ 7 5.17
age_ 8 8.80
age_ 9 5.91
age_ 10 2.03
region_ 1 5.14
region_ 2 14.21
region_ 3 16.44
region_ 4 7.14
region_ 5 19.05
region_ 6 6.30
region_ 7 10.91
region_ 8 6.40
region_ 9 14.41
;
! Cell number and combination of levels sampled;
!CaseOK
CELL VarLvl=
1 1 1 5
2 1 2 3
3 1 2 5
4 1 2 6
5 1 2 7
6 1 3 1
7 1 3 3
8 1 3 4
9 1 3 5
10 1 3 6
11 1 3 7
12 1 3 8
13 1 3 9
14 1 4 1
15 1 4 2
16 1 4 3
17 1 4 4
18 1 4 5
19 1 4 6
20 1 4 7
21 1 4 8
22 1 4 9
23 1 5 1
24 1 5 2
25 1 5 3
26 1 5 4
27 1 5 5
28 1 5 6
29 1 5 8
30 1 5 9
31 1 6 1
32 1 6 2
33 1 6 3
34 1 6 4
35 1 6 5
36 1 6 6
37 1 6 8
38 1 6 9
39 1 7 1
40 1 7 2
41 1 7 5
42 1 7 6
43 1 7 7
44 1 7 9
45 1 8 2
46 1 8 3
47 1 8 4
48 1 8 5
49 1 8 9
50 1 9 3
51 1 9 4
52 1 9 5
53 1 9 9
54 2 2 3
55 2 2 4
56 2 2 5
57 2 2 7
58 2 2 8
59 2 3 3
60 2 3 4
61 2 3 5
62 2 3 6
63 2 3 7
64 2 3 8
65 2 3 9
66 2 4 1
67 2 4 2
68 2 4 3
69 2 4 4
70 2 4 5
71 2 4 6
72 2 4 7
73 2 4 8
74 2 4 9
75 2 5 1
76 2 5 2
77 2 5 3
78 2 5 4
79 2 5 5
80 2 5 6
81 2 5 7
82 2 5 8
83 2 5 9
84 2 6 1
85 2 6 3
86 2 6 4
87 2 6 5
88 2 6 6
89 2 6 7
90 2 6 8
91 2 6 9
92 2 7 8
93 2 7 9
94 2 8 2
95 2 8 3
96 2 8 4
97 2 8 5
98 2 8 6
99 2 8 7
100 2 8 8
101 2 8 9
102 2 9 3
103 2 9 4
104 2 10 3
105 3 1 4
106 3 2 1
107 3 2 2
108 3 2 5
109 3 2 6
110 3 2 8
111 3 2 9
112 3 3 1
113 3 3 2
114 3 3 3
115 3 3 4
116 3 3 5
117 3 3 6
118 3 3 7
119 3 3 8
120 3 3 9
121 3 4 1
122 3 4 2
123 3 4 3
124 3 4 4
125 3 4 5
126 3 4 6
127 3 4 7
128 3 4 8
129 3 4 9
130 3 5 1
131 3 5 2
132 3 5 3
133 3 5 4
134 3 5 5
135 3 5 6
136 3 5 7
137 3 5 8
138 3 5 9
139 3 6 1
140 3 6 2
141 3 6 3
142 3 6 4
143 3 6 5
144 3 6 6
145 3 6 7
146 3 6 8
147 3 6 9
148 3 7 1
149 3 7 2
150 3 7 3
151 3 7 4
152 3 7 5
153 3 7 6
154 3 7 8
155 3 7 9
156 3 8 1
157 3 8 2
158 3 8 3
159 3 8 4
160 3 8 5
161 3 8 8
162 3 8 9
163 3 9 2
164 3 9 3
165 3 9 5
166 3 9 8
167 3 9 9
168 3 10 4
169 3 10 5
170 4 1 2
171 4 2 3
172 4 2 5
173 4 2 7
174 4 3 1
175 4 3 2
176 4 3 3
177 4 3 4
178 4 3 5
179 4 3 6
180 4 3 7
181 4 3 8
182 4 3 9
183 4 4 1
184 4 4 2
185 4 4 3
186 4 4 4
187 4 4 5
188 4 4 6
189 4 4 7
190 4 4 8
191 4 4 9
192 4 5 1
193 4 5 2
194 4 5 3
195 4 5 4
196 4 5 5
197 4 5 6
198 4 5 7
199 4 5 8
200 4 5 9
201 4 6 1
202 4 6 3
203 4 6 5
204 4 6 7
205 4 6 8
206 4 6 9
207 4 7 3
208 4 7 4
209 4 7 5
210 4 7 9
211 4 8 1
212 4 8 2
213 4 8 3
214 4 8 4
215 4 8 5
216 4 8 6
217 4 8 7
218 4 8 8
219 4 8 9
220 4 9 4
221 4 9 6
222 5 1 3
223 5 1 4
224 5 1 7
225 5 2 1
226 5 2 7
227 5 2 9
228 5 3 1
229 5 3 3
230 5 3 4
231 5 3 5
232 5 3 6
233 5 3 7
234 5 3 8
235 5 3 9
236 5 4 1
237 5 4 2
238 5 4 3
239 5 4 4
240 5 4 5
241 5 4 6
242 5 4 7
243 5 4 8
244 5 4 9
245 5 5 1
246 5 5 2
247 5 5 3
248 5 5 4
249 5 5 5
250 5 5 6
251 5 5 7
252 5 5 8
253 5 5 9
254 5 6 2
255 5 6 3
256 5 6 4
257 5 6 5
258 5 6 6
259 5 6 7
260 5 6 8
261 5 6 9
262 5 7 1
263 5 7 4
264 5 7 7
265 5 7 8
266 5 7 9
267 5 8 1
268 5 8 5
269 5 8 7
270 5 8 9
271 5 9 1
272 5 9 4
273 5 9 6
274 5 9 9
275 5 10 2 ;
!CaseOK
count HWgt =
1 7.031564
5 2.643345
1 2.888197
1 2.834471
1 2.798728
1 1.441461
3 1.296328
4 0.848302
1 1.54117
2 1.487445
3 1.451701
2 1.269385
1 1.466486
2 1.134052
2 1.870827
3 0.988919
7 0.540893
11 1.233761
2 1.180036
4 1.144292
4 0.961986
5 1.159087
2 1.105321
3 1.842086
4 0.960188
2 0.512162
5 1.20503
1 1.151305
2 0.933256
1 1.130346
1 1.052605
1 1.789379
1 0.907472
2 0.459455
2 1.152324
1 1.098598
1 0.880539
1 1.077639
1 1.208037
1 1.944802
2 1.307747
1 1.254021
1 1.218277
1 1.233062
4 2.381929
2 1.500022
3 1.052005
1 1.744874
1 1.67019
3 2.455405
2 2.007379
2 2.700247
3 2.625563
1 2.844431
2 2.396405
3 3.089274
4 2.999804
1 2.817499
4 1.497405
4 1.049388
4 1.742257
4 1.688531
3 1.652787
4 1.470472
5 1.667572
3 1.335139
6 2.071913
8 1.190006
6 0.741979
6 1.434848
2 1.381122
4 1.345378
2 1.163073
3 1.360164
1 1.306408
1 2.043172
6 1.161275
5 0.713248
6 1.406117
2 1.352391
3 1.316648
2 1.134332
6 1.331433
1 1.253691
3 1.108558
1 0.660532
2 1.35333
2 1.299675
2 1.263931
2 1.081625
2 1.278726
1 1.237048
5 1.434149
1 2.583016
4 1.701108
1 1.253082
1 1.945951
2 1.892225
2 1.856481
1 1.674176
2 1.871276
1 2.656492
2 2.208465
1 5.377758
1 5.922436
1 2.372219
3 3.108984
2 2.471928
1 2.418203
1 2.200153
1 2.397244
3 1.025192
5 1.761967
16 0.880059
7 0.432033
9 1.124902
5 1.071176
7 1.035432
6 0.853127
11 1.050217
3 0.717794
5 1.454558
15 0.57266
13 0.124634
14 0.817503
3 0.763777
4 0.728033
8 0.545728
5 0.742818
2 0.689053
12 1.425827
16 0.54392
12 0.095903
14 0.788762
3 0.735036
9 0.699302
3 0.516987
8 0.714087
1 0.636346
2 1.37311
5 0.491213
4 0.043186
7 0.736055
2 0.682329
2 0.646586
2 0.46428
2 0.661371
2 0.791769
3 1.528543
9 0.646636
5 0.198609
2 0.891478
2 0.837752
4 0.619703
2 0.816803
1 1.228896
1 1.965661
2 1.083763
3 0.635737
1 1.328605
2 1.05683
2 1.253921
1 2.921044
1 2.039136
1 2.283989
1 2.012204
3 2.209304
1 4.312386
1 5.005255
1 7.062553
1 2.037268
1 2.28211
1 2.192641
5 0.835375
2 1.572149
4 0.690241
7 0.242225
7 0.935094
1 0.881368
3 0.845624
2 0.663309
10 0.860409
3 0.527976
8 1.26475
14 0.382843
5 0.000999
8 0.627685
2 0.573959
7 0.538215
8 0.35591
9 0.55301
3 0.499245
4 1.236009
17 0.354112
7 0.000999
11 0.598954
2 0.545228
6 0.509484
3 0.327179
6 0.524269
2 0.446528
7 0.301395
3 0.546237
4 0.456768
4 0.274462
4 0.471563
2 0.456828
2 0.008801
1 0.70167
1 0.626985
1 1.039078
1 1.775853
2 0.893945
2 0.445919
8 1.138788
2 1.085062
1 1.049318
2 0.867013
2 1.064113
1 1.401302
1 2.040445
1 6.312362
1 5.864345
1 6.467745
1 2.314128
1 2.324368
1 2.339153
2 0.967101
5 0.821968
4 0.373942
3 1.06681
1 1.013085
6 0.977341
3 0.795035
6 0.992126
3 0.659702
4 1.396467
5 0.514569
6 0.066543
11 0.759411
4 0.705686
10 0.669942
5 0.487627
14 0.684727
4 0.630961
4 1.367736
15 0.485828
4 0.037802
11 0.730671
5 0.676945
7 0.641201
2 0.458896
6 0.655996
3 1.315019
1 0.433122
1 0.000999
5 0.677964
1 0.624238
1 0.588494
2 0.406189
2 0.603279
4 0.733678
1 0.140458
5 0.743917
2 0.561612
4 0.758712
1 1.170805
1 1.270514
1 1.181045
4 1.19583
1 2.126178
1 1.533029
1 2.172172
1 2.151213
1 5.584219
;
! Data set Dorofeev and Grant, suggests negative weights;
!CaseDG alpha = 1;
!CaseDG level = 1 2 3 ;
!CaseDG variable = EXV1 EXV2;
! The population rim target percentages for
each variable and combination of level;
!CaseDG
VXL Targpcent =
EXV1 1 55.5556
EXV1 2 33.3333
EXV1 3 11.1111
EXV2 1 22.2222
EXV2 2 33.3333
EXV2 3 44.4444 ;
! Cell number and combination of levels sampled;
!CaseDG
CELL VarLvl=
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3 ;
! Count in each cell & heuristic weights;
!CaseDG
count HWgt =
5 0
7 1.06881
10 1.76551
3 4.91677
0 0
0 0
9 0
10 0.417
1 0.94301 ;
ENDDATA
SUBMODEL samplebal:
! If alpha = 0, we get straight sampling, alpha = 1: weighted sampling with perfect match
if we allow weightcell < 0;
min = alphat* rimerr +(1- alphat)* cellerr;
rimerr <= rimerrUL; ! In case we want to constrain rimerr;
cellerr <= cellerrUL; ! In case we want to constrain rimerr;
! For each variable i and level j, compute the achieved count
when weightcell( c), is applied to each observation in cell c;
! For all explanatory variables i and their levels j
the achieved count over all cells sampled is... ;
@FOR( VXL( i, j) :
[ACNT] achvcount( i, j) = @SUM( CELL( c) | VarLvl( c, i) #EQ# j : weightcell( c) * count( c))
);
! Special case. If a cell is empty, its weight = 0;
@FOR( CELL( c) | count( c) #EQ# 0: weightcell( c) = 0);
! Add this if we want the estimator to be unbiased;
[SIZE] @SUM( CELL( c): weightcell( c) * count( c)) = sampsize;
! Compute rim error measure. Should watch out for sampno( i, j) = 0;
[CRIMER] rimerr = @SUM( VXL( i, j) :
(( achvcount( i, j) - targpcent( i, j)* sampsize/100) / sampno( i, j))^2 );
! Compute cost of increased variance;
[CCELLR] cellerr = @SUM( CELL( c): count( c)*(1 - weightcell( c))^2);
! Standard formula for rim error;
! rimerrstd = (rimerr/m)^0.5;
! @free(rimerrstd);
ENDSUBMODEL
CALC:
@SET( 'TERSEO',2); ! Output level (0:verb, 1:terse, 2:only errors, 3:none);
m = @size( VXL); ! number of rim targets;
sampsize = @SUM( CELL( c): count( c));
! Compute number in sample for each variable i that have level j;
@FOR( VXL( i, j):
sampno( i, j) = @SUM( CELL( c) | VarLvl( c, i) #eq# j: count( c));
);
! Compute achieved counts for the heuristic weights;
! Add this to check the heuristic weights estimator to be unbiased;
sampsizeh = @SUM( CELL( c): HWgt( c) * count( c));
alphat = alpha; ! Set temporary alpha = alpha;
@SOLVE( samplebal);
! In case alpha = 1, minimize other error as secondary objective;
@IFC( alpha #EQ# 1:
alphat = 0;
rimerrUL = rimerr;
@SOLVE( samplebal);
);
! In case alpha = 0, minimize other error as secondary objective;
@IFC( alpha #EQ# 0:
alphat = 1;
cellerrUL = cellerr;
@SOLVE( samplebal);
);
! Compute standard error measure: (Sum of squared rim errors/NumberRimTargets)^0.5;
rimerrstd = ( rimerr/ m)^0.5;
! Compute cell error measure for existing heuristic method;
cellHeur = @SUM( CELL( c): count( c)*(1 - HWgt( c))^2);
@WRITE(' Input DATA:', @NEWLINE( 1));
@FOR( VXL( i, j) :
achvcounth( i, j) = @SUM( CELL( c) | VarLvl( c, i) #EQ# j : HWgt( c) * count( c))
);
! Compute rim error measure. Should watch out for sampno( i, j) = 0;
rimerrh = @SUM( VXL( i, j) : (( achvcounth( i, j) - targpcent( i, j)* sampsize/100) / sampno( i, j))^2 );
rimerrstdh = ( rimerrh/ m)^0.5;
rimerrstdh = ( rimerrh/ m)^0.5;
@WRITE( @FORMAT( @SIZE( variable),'12.0f'),' = number explanatory variables', @NEWLINE(1));
@WRITE( @FORMAT( @SIZE( CELL),'12.0f'),' = number cells with count > 0', @NEWLINE(1));
@WRITE( @FORMAT( m,'12.0f'),' = number brackets over all variables', @NEWLINE(1));
@WRITE( @FORMAT( sampsize,'12.0f'),' = sample size(sum of the counts)', @NEWLINE(1));
EffSmpSzH = ( sampsize^2)/( sampsize + cellHeur);
@WRITE( @FORMAT( EffSmpSzH,'12.2f'), ' = Effective sample size of heuristic wgts', @NEWLINE(1));
@WRITE( @FORMAT( rimerrstdh,'12.5f'), ' = root mean standardized rim error of heuristic weights', @NEWLINE(1));
@WRITE( @NEWLINE( 1));
@WRITE(' Sample Balancing Results', @NEWLINE(1));
@WRITE( @FORMAT( alpha,'12.4f'),' = alpha ( weight on rim errors, 1-alpha= weight on max sample size) ',
@NEWLINE(1));
! Report effective sample size;
EffSmpSz = ( sampsize^2)/( sampsize + cellerr);
@WRITE( @FORMAT( EffSmpSz,'12.2f'), ' = Effective sample size (Optimal for given alpha) ', @NEWLINE(1));
@WRITE( @FORMAT( rimerrstd,'12.5f'), ' = root mean standardized rim error', @NEWLINE(1));
@WRITE( @FORMAT( cellerr, '12.3f'), ' = individual weight squared deviation from 1.0 ', @NEWLINE(1));
@WRITE(' CELL ');
@FOR( variable( v):
@WRITE( @FORMAT( variable( v),'9s'));
);
@WRITE( ' Count Weight ', @NEWLINE( 1));
@FOR( CELL( c) :
@WRITE( @FORMAT( c,'5.0f'),' ');
@FOR( variable( v):
@WRITE( @FORMAT( VarLvl( c, v),'9.0f'))
);
@WRITE( @FORMAT( count( c),'8.0f'), ' ', @FORMAT( weightcell( c),'10.3f'),@NEWLINE(1));
);
@WRITE(@NEWLINE( 1), ' Achieved rim target percent', @NEWLINE( 1));
@WRITE(' Variable Level WgtPercent', @NEWLINE( 1));
@FOR( VXL( i, j) :
@WRITE( @FORMAT( Variable( i),'9s'), ' ', @FORMAT( j,'4.0f'),
' ', @FORMAT( 100* achvcount( i, j)/Sampsize,'7.1f'), @NEWLINE( 1));
);
ENDCALC
|