```SETS: ! The sample weighting problem. (SampBal.lng) Suppose you have a sample of 1000 people, classified according to the two dimensions: Gender: female or male, and Income: Low, MediumLow, Medium, or High In our target population, the "rim" fractions to apply: Female: 50%, Male: 50%, LowIncome: 40%, MediumLow: 30%, Medium: 20%, High: 10%. Our sample of 1000 has the following number in the various cells: Low MediumLow Medium High Female: 220 147 106 55 Male: 193 149 94 36 We want to predict a dependent variable, e.g., how the population will vote on a particular issue. We say the sample is representative if the fraction of the sample that is of a given income level is the same as in the general population, the fraction of the sample that are female is the same as in the target population, etc. Our sample above is not representative of the target population in this sense. Notice that (220 + 193)/1000 = 0.413 > 0.4 so the first income level is over-represented. Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5 so females are over-represented. One thing we could do is to discard "unrepresentative" observations from the sample so that the remaining observations accurately match the target population. Thus, we have a smaller sample size. We want to drop the smallest number of observations to achieve a close match to the population. A slightly more general approach is to not completely drop an observation, but rather, reduce the weights given to "unrepresentative" observations. Which weights should be adjusted, and by how much? If the sample is not representative, we may want to choose a weight other than 1.0 for each observation so that the weighted fractions of the sample match the population fractions. For our example there are 2 + 4 = 6 target fractions to be matched. There are 1000 different weights to be chosen (or 8 weights if we apply the same weight to all observations in the same cell), so there are lots of different weight combinations that match the target population fractions. Which weight combination should we choose? The "Max effective sample Size" (MS) approach chooses the weights that a) match the population targets, and b) minimize the variance of the resulting estimate, or equivalently, maximize the size of an equivalent representative sample that has the same variance. Here are some cell weights that perfectly match the rim targets. Low MediumLow Medium High Female: 0.31 1.065170 1.700566 1.726545 Male: 1.719171 0.96255 0.21 0.14 Notice that (0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4 so the first income level is matched. Are these the best possible weights? It can be shown that these weights have an effective sample size of 729, - considerably lower than the unweighted sample size of 1000. If on the other hand, you choose the following weights: Low MediumLow Medium High Female: 0.731 1.182 1.444 0.228 Male: 1.240 0.847 0.499 2.430 it has an effective sample size of 843. Considerably better than 729. There are various intermediate things we could do in the sense we could consider the spectrum ranging from placing great importance on closely matching the population rim targets vs. not letting the weights vary too much from 1 but not exactly matching the population target fractions. In either case, we want to minimize the variance of our estimator. If you are willing to allow a very slight violation of the rim targets, then the following weights: Low MediumLow Medium High Female: 0.940 0.971 0.96 1.059 Male: 1.015 1.045 1.038 1.134 have an effective sample size of 997, almost as good as 1000. It does a fairly good job of matching the rim targets as follows: Achieved rim target percent Variable Level WgtPercent GENDER 1 51.0 GENDER 2 49.0 INCOME 1 40.3 INCOME 2 29.9 INCOME 3 20.0 INCOME 4 9.9 For example: (0.940*220 + 1.015*193) = 402.695/1000 = 40.3. ! Ref: Madansky, A. and L. Schrage (2017), "Maximizing effective sample size while sample balancing," Quirks Marketing Research Review; ! Keywords: Marketing, Quadratic optimization, Sampling, Statistics; CELL : count, HWgt, ! HWgt is just for comparison; weightcell, income, age, region, achvcountc; variable; level; VXL( variable, level): targpcent, achvcount, sampno, achvcounth; CXV( CELL, variable): Lnum, VarLvl; ! Keywords: Sample balancing, Survey analysis, Re-weighting, Raking; ENDSETSDATA: !Case 2x4: We sampled 1000 people. There are two explanatory variabes, Gender and Income level. In the target population we expect the percentages: Female: 50% Male: 50% Income1: 40% Income2: 30% Income3: 20% Income4: 10% ; ! Minimizing rim error is all that matters; !Case2x4; alpha = 1 !Case2x4; level = 1 2 3 4 !Case2x4; variable = Gender Income ! The population rim target percentages for each variable and combination of level; !Case2x4; VXL Targpcent = Gender 1 50 Gender 2 50 Income 1 40 Income 2 30 Income 3 20 Income 4 10; ! Cell number and combination of levels sampled; !Case2x4; CELL VarLvl= 1 1 1 2 1 2 3 1 3 4 1 4 5 2 1 6 2 2 7 2 3 8 2 4 ; ! Count in each cell & heuristic weights. The actual numbers in our sample for each of the 2 genders and 4 income levels are below. Notice that (220 + 193)/1000 = 0.413 > 0.4 so the first income level is over-represented. Similarly, (220 + 147 + 106 + 55)/1000 = 0.528 > 0.5 so the first gender is over-represented.; !Case2x4; count = 220 147 106 55 193 149 94 36 ; ! Here are some cell weights that do perfectly match the rim targets. Note that (0.31*220 + 1.719171*193)/1000 = 400/1000 = 0.4 so the first income level is matched; !Case2x4; HWgt = 0.31 1.065170 1.700566 1.726545 1.719171 0.96255 0.21 0.14 ; ! Various interesting settings of alpha; !Case01 alpha = 1; !Case01 level = 1 2 3 4 5 6 7 8 9 10; !Case01 variable = income_ age_ region_; ! The population rim target percentages; !Case01 VXL, targpcent = income_ 1 17.95 income_ 2 23.20 income_ 3 27.28 income_ 4 14.34 income_ 5 17.23 age_ 1 3.87 age_ 2 9.07 age_ 3 19.09 age_ 4 21.60 age_ 5 18.02 age_ 6 6.44 age_ 7 5.17 age_ 8 8.80 age_ 9 5.91 age_ 10 2.03 region_ 1 5.14 region_ 2 14.21 region_ 3 16.44 region_ 4 7.14 region_ 5 19.05 region_ 6 6.30 region_ 7 10.91 region_ 8 6.40 region_ 9 14.41 ; ! Sample data. This data set has 5 possible income levels, 10 age levels, 9 regions, giving 450 potential cells. ; !Case01 CELL income age region count HWgt = 1 1 1 5 1 7.03868 2 1 2 3 5 2.64602 3 1 2 5 1 2.89112 4 1 2 6 1 2.83734 5 1 2 7 1 2.80156 6 1 3 1 1 1.44292 7 1 3 3 3 1.29764 8 1 3 4 4 0.84916 9 1 3 5 1 1.54273 10 1 3 6 2 1.48895 11 1 3 7 3 1.45317 12 1 3 8 2 1.27067 13 1 3 9 1 1.46797 14 1 4 1 2 1.1352 15 1 4 2 2 1.87272 16 1 4 3 3 0.98992 17 1 4 4 7 0.54144 18 1 4 5 11 1.23501 19 1 4 6 2 1.18123 20 1 4 7 4 1.14545 21 1 4 8 4 0.96296 22 1 4 9 5 1.16026 23 1 5 1 2 1.10644 24 1 5 2 3 1.84395 25 1 5 3 4 0.96116 26 1 5 4 2 0.51268 27 1 5 5 5 1.20625 28 1 5 6 1 1.15247 29 1 5 8 2 0.9342 30 1 5 9 1 1.13149 31 1 6 1 1 1.05367 32 1 6 2 1 1.79119 33 1 6 3 1 0.90839 34 1 6 4 2 0.45992 35 1 6 5 2 1.15349 36 1 6 6 1 1.09971 37 1 6 8 1 0.88143 38 1 6 9 1 1.07873 39 1 7 1 1 1.20926 40 1 7 2 1 1.94677 41 1 7 5 2 1.30907 42 1 7 6 1 1.25529 43 1 7 7 1 1.21951 44 1 7 9 1 1.23431 45 1 8 2 4 2.38434 46 1 8 3 2 1.50154 47 1 8 4 3 1.05307 48 1 8 5 1 1.74664 49 1 8 9 1 1.67188 50 1 9 3 3 2.45789 51 1 9 4 2 2.00941 52 1 9 5 2 2.70298 53 1 9 9 3 2.62822 54 2 2 3 1 2.84731 55 2 2 4 2 2.39883 56 2 2 5 3 3.0924 57 2 2 7 4 3.00284 58 2 2 8 1 2.82035 59 2 3 3 4 1.49892 60 2 3 4 4 1.05045 61 2 3 5 4 1.74402 62 2 3 6 4 1.69024 63 2 3 7 3 1.65446 64 2 3 8 4 1.47196 65 2 3 9 5 1.66926 66 2 4 1 3 1.33649 67 2 4 2 6 2.07401 68 2 4 3 8 1.19121 69 2 4 4 6 0.74273 70 2 4 5 6 1.4363 71 2 4 6 2 1.38252 72 2 4 7 4 1.34674 73 2 4 8 2 1.16425 74 2 4 9 3 1.36154 75 2 5 1 1 1.30773 76 2 5 2 1 2.04524 77 2 5 3 6 1.16245 78 2 5 4 5 0.71397 79 2 5 5 6 1.40754 80 2 5 6 2 1.35376 81 2 5 7 3 1.31798 82 2 5 8 2 1.13548 83 2 5 9 6 1.33278 84 2 6 1 1 1.25496 85 2 6 3 3 1.10968 86 2 6 4 1 0.6612 87 2 6 5 2 1.3547 88 2 6 6 2 1.30099 89 2 6 7 2 1.26521 90 2 6 8 2 1.08272 91 2 6 9 2 1.28002 92 2 7 8 1 1.2383 93 2 7 9 5 1.4356 94 2 8 2 1 2.58563 95 2 8 3 4 1.70283 96 2 8 4 1 1.25435 97 2 8 5 1 1.94792 98 2 8 6 2 1.89414 99 2 8 7 2 1.85836 100 2 8 8 1 1.67587 101 2 8 9 2 1.87317 102 2 9 3 1 2.65918 103 2 9 4 2 2.2107 104 2 10 3 1 5.3832 105 3 1 4 1 5.92843 106 3 2 1 1 2.37462 107 3 2 2 3 3.11213 108 3 2 5 2 2.47443 109 3 2 6 1 2.42065 110 3 2 8 1 2.20238 111 3 2 9 1 2.39967 112 3 3 1 3 1.02623 113 3 3 2 5 1.76375 114 3 3 3 16 0.88095 115 3 3 4 7 0.43247 116 3 3 5 9 1.12604 117 3 3 6 5 1.07226 118 3 3 7 7 1.03648 119 3 3 8 6 0.85399 120 3 3 9 11 1.05128 121 3 4 1 3 0.71852 122 3 4 2 5 1.45603 123 3 4 3 15 0.57324 124 3 4 4 13 0.12476 125 3 4 5 14 0.81833 126 3 4 6 3 0.76455 127 3 4 7 4 0.72877 128 3 4 8 8 0.54628 129 3 4 9 5 0.74357 130 3 5 1 2 0.68975 131 3 5 2 12 1.42727 132 3 5 3 16 0.54447 133 3 5 4 12 0.096 134 3 5 5 14 0.78956 135 3 5 6 3 0.73578 136 3 5 7 9 0.70001 137 3 5 8 3 0.51751 138 3 5 9 8 0.71481 139 3 6 1 1 0.63699 140 3 6 2 2 1.3745 141 3 6 3 5 0.49171 142 3 6 4 4 0.04323 143 3 6 5 7 0.7368 144 3 6 6 2 0.68302 145 3 6 7 2 0.64724 146 3 6 8 2 0.46475 147 3 6 9 2 0.66204 148 3 7 1 2 0.79257 149 3 7 2 3 1.53009 150 3 7 3 9 0.64729 151 3 7 4 5 0.19881 152 3 7 5 2 0.89238 153 3 7 6 2 0.8386 154 3 7 8 4 0.62033 155 3 7 9 2 0.81763 156 3 8 1 1 1.23014 157 3 8 2 1 1.96765 158 3 8 3 2 1.08486 159 3 8 4 3 0.63638 160 3 8 5 1 1.32995 161 3 8 8 2 1.0579 162 3 8 9 2 1.25519 163 3 9 2 1 2.924 164 3 9 3 1 2.0412 165 3 9 5 1 2.2863 166 3 9 8 1 2.01424 167 3 9 9 3 2.21154 168 3 10 4 1 4.31675 169 3 10 5 1 5.01032 170 4 1 2 1 7.0697 171 4 2 3 1 2.03933 172 4 2 5 1 2.28442 173 4 2 7 1 2.19486 174 4 3 1 5 0.83622 175 4 3 2 2 1.57374 176 4 3 3 4 0.69094 177 4 3 4 7 0.24247 178 4 3 5 7 0.93604 179 4 3 6 1 0.88226 180 4 3 7 3 0.84648 181 4 3 8 2 0.66398 182 4 3 9 10 0.86128 183 4 4 1 3 0.52851 184 4 4 2 8 1.26603 185 4 4 3 14 0.38323 186 4 4 4 5 0.001 187 4 4 5 8 0.62832 188 4 4 6 2 0.57454 189 4 4 7 7 0.53876 190 4 4 8 8 0.35627 191 4 4 9 9 0.55357 192 4 5 1 3 0.49975 193 4 5 2 4 1.23726 194 4 5 3 17 0.35447 195 4 5 4 7 0.001 196 4 5 5 11 0.59956 197 4 5 6 2 0.54578 198 4 5 7 6 0.51 199 4 5 8 3 0.32751 200 4 5 9 6 0.5248 201 4 6 1 2 0.44698 202 4 6 3 7 0.3017 203 4 6 5 3 0.54679 204 4 6 7 4 0.45723 205 4 6 8 4 0.27474 206 4 6 9 4 0.47204 207 4 7 3 2 0.45729 208 4 7 4 2 0.00881 209 4 7 5 1 0.70238 210 4 7 9 1 0.62762 211 4 8 1 1 1.04013 212 4 8 2 1 1.77765 213 4 8 3 2 0.89485 214 4 8 4 2 0.44637 215 4 8 5 8 1.13994 216 4 8 6 2 1.08616 217 4 8 7 1 1.05038 218 4 8 8 2 0.86789 219 4 8 9 2 1.06519 220 4 9 4 1 1.40272 221 4 9 6 1 2.04251 222 5 1 3 1 6.31875 223 5 1 4 1 5.87028 224 5 1 7 1 6.47429 225 5 2 1 1 2.31647 226 5 2 7 1 2.32672 227 5 2 9 1 2.34152 228 5 3 1 2 0.96808 229 5 3 3 5 0.8228 230 5 3 4 4 0.37432 231 5 3 5 3 1.06789 232 5 3 6 1 1.01411 233 5 3 7 6 0.97833 234 5 3 8 3 0.79584 235 5 3 9 6 0.99313 236 5 4 1 3 0.66037 237 5 4 2 4 1.39788 238 5 4 3 5 0.51509 239 5 4 4 6 0.06661 240 5 4 5 11 0.76018 241 5 4 6 4 0.7064 242 5 4 7 10 0.67062 243 5 4 8 5 0.48812 244 5 4 9 14 0.68542 245 5 5 1 4 0.6316 246 5 5 2 4 1.36912 247 5 5 3 15 0.48632 248 5 5 4 4 0.03784 249 5 5 5 11 0.73141 250 5 5 6 5 0.67763 251 5 5 7 7 0.64185 252 5 5 8 2 0.45936 253 5 5 9 6 0.65666 254 5 6 2 3 1.31635 255 5 6 3 1 0.43356 256 5 6 4 1 0.001 257 5 6 5 5 0.67865 258 5 6 6 1 0.62487 259 5 6 7 1 0.58909 260 5 6 8 2 0.4066 261 5 6 9 2 0.60389 262 5 7 1 4 0.73442 263 5 7 4 1 0.1406 264 5 7 7 5 0.74467 265 5 7 8 2 0.56218 266 5 7 9 4 0.75948 267 5 8 1 1 1.17199 268 5 8 5 1 1.2718 269 5 8 7 1 1.18224 270 5 8 9 4 1.19704 271 5 9 1 1 2.12833 272 5 9 4 1 1.53458 273 5 9 6 1 2.17437 274 5 9 9 1 2.15339 275 5 10 2 1 5.58987 ; ! Sample data. This data set has 5 possible income levels, 10 age levels, 9 regions, giving 450 potential cells. ; !CaseOK alpha = 1; !CaseOK level = 1 2 3 4 5 6 7 8 9 10; !CaseOK variable = income_ age_ region_; ! The population rim target percentages; !CaseOK VXL, targpcent = income_ 1 17.95 income_ 2 23.20 income_ 3 27.28 income_ 4 14.34 income_ 5 17.23 age_ 1 3.87 age_ 2 9.07 age_ 3 19.09 age_ 4 21.60 age_ 5 18.02 age_ 6 6.44 age_ 7 5.17 age_ 8 8.80 age_ 9 5.91 age_ 10 2.03 region_ 1 5.14 region_ 2 14.21 region_ 3 16.44 region_ 4 7.14 region_ 5 19.05 region_ 6 6.30 region_ 7 10.91 region_ 8 6.40 region_ 9 14.41 ; ! Cell number and combination of levels sampled; !CaseOK CELL VarLvl= 1 1 1 5 2 1 2 3 3 1 2 5 4 1 2 6 5 1 2 7 6 1 3 1 7 1 3 3 8 1 3 4 9 1 3 5 10 1 3 6 11 1 3 7 12 1 3 8 13 1 3 9 14 1 4 1 15 1 4 2 16 1 4 3 17 1 4 4 18 1 4 5 19 1 4 6 20 1 4 7 21 1 4 8 22 1 4 9 23 1 5 1 24 1 5 2 25 1 5 3 26 1 5 4 27 1 5 5 28 1 5 6 29 1 5 8 30 1 5 9 31 1 6 1 32 1 6 2 33 1 6 3 34 1 6 4 35 1 6 5 36 1 6 6 37 1 6 8 38 1 6 9 39 1 7 1 40 1 7 2 41 1 7 5 42 1 7 6 43 1 7 7 44 1 7 9 45 1 8 2 46 1 8 3 47 1 8 4 48 1 8 5 49 1 8 9 50 1 9 3 51 1 9 4 52 1 9 5 53 1 9 9 54 2 2 3 55 2 2 4 56 2 2 5 57 2 2 7 58 2 2 8 59 2 3 3 60 2 3 4 61 2 3 5 62 2 3 6 63 2 3 7 64 2 3 8 65 2 3 9 66 2 4 1 67 2 4 2 68 2 4 3 69 2 4 4 70 2 4 5 71 2 4 6 72 2 4 7 73 2 4 8 74 2 4 9 75 2 5 1 76 2 5 2 77 2 5 3 78 2 5 4 79 2 5 5 80 2 5 6 81 2 5 7 82 2 5 8 83 2 5 9 84 2 6 1 85 2 6 3 86 2 6 4 87 2 6 5 88 2 6 6 89 2 6 7 90 2 6 8 91 2 6 9 92 2 7 8 93 2 7 9 94 2 8 2 95 2 8 3 96 2 8 4 97 2 8 5 98 2 8 6 99 2 8 7 100 2 8 8 101 2 8 9 102 2 9 3 103 2 9 4 104 2 10 3 105 3 1 4 106 3 2 1 107 3 2 2 108 3 2 5 109 3 2 6 110 3 2 8 111 3 2 9 112 3 3 1 113 3 3 2 114 3 3 3 115 3 3 4 116 3 3 5 117 3 3 6 118 3 3 7 119 3 3 8 120 3 3 9 121 3 4 1 122 3 4 2 123 3 4 3 124 3 4 4 125 3 4 5 126 3 4 6 127 3 4 7 128 3 4 8 129 3 4 9 130 3 5 1 131 3 5 2 132 3 5 3 133 3 5 4 134 3 5 5 135 3 5 6 136 3 5 7 137 3 5 8 138 3 5 9 139 3 6 1 140 3 6 2 141 3 6 3 142 3 6 4 143 3 6 5 144 3 6 6 145 3 6 7 146 3 6 8 147 3 6 9 148 3 7 1 149 3 7 2 150 3 7 3 151 3 7 4 152 3 7 5 153 3 7 6 154 3 7 8 155 3 7 9 156 3 8 1 157 3 8 2 158 3 8 3 159 3 8 4 160 3 8 5 161 3 8 8 162 3 8 9 163 3 9 2 164 3 9 3 165 3 9 5 166 3 9 8 167 3 9 9 168 3 10 4 169 3 10 5 170 4 1 2 171 4 2 3 172 4 2 5 173 4 2 7 174 4 3 1 175 4 3 2 176 4 3 3 177 4 3 4 178 4 3 5 179 4 3 6 180 4 3 7 181 4 3 8 182 4 3 9 183 4 4 1 184 4 4 2 185 4 4 3 186 4 4 4 187 4 4 5 188 4 4 6 189 4 4 7 190 4 4 8 191 4 4 9 192 4 5 1 193 4 5 2 194 4 5 3 195 4 5 4 196 4 5 5 197 4 5 6 198 4 5 7 199 4 5 8 200 4 5 9 201 4 6 1 202 4 6 3 203 4 6 5 204 4 6 7 205 4 6 8 206 4 6 9 207 4 7 3 208 4 7 4 209 4 7 5 210 4 7 9 211 4 8 1 212 4 8 2 213 4 8 3 214 4 8 4 215 4 8 5 216 4 8 6 217 4 8 7 218 4 8 8 219 4 8 9 220 4 9 4 221 4 9 6 222 5 1 3 223 5 1 4 224 5 1 7 225 5 2 1 226 5 2 7 227 5 2 9 228 5 3 1 229 5 3 3 230 5 3 4 231 5 3 5 232 5 3 6 233 5 3 7 234 5 3 8 235 5 3 9 236 5 4 1 237 5 4 2 238 5 4 3 239 5 4 4 240 5 4 5 241 5 4 6 242 5 4 7 243 5 4 8 244 5 4 9 245 5 5 1 246 5 5 2 247 5 5 3 248 5 5 4 249 5 5 5 250 5 5 6 251 5 5 7 252 5 5 8 253 5 5 9 254 5 6 2 255 5 6 3 256 5 6 4 257 5 6 5 258 5 6 6 259 5 6 7 260 5 6 8 261 5 6 9 262 5 7 1 263 5 7 4 264 5 7 7 265 5 7 8 266 5 7 9 267 5 8 1 268 5 8 5 269 5 8 7 270 5 8 9 271 5 9 1 272 5 9 4 273 5 9 6 274 5 9 9 275 5 10 2 ; !CaseOK count HWgt = 1 7.031564 5 2.643345 1 2.888197 1 2.834471 1 2.798728 1 1.441461 3 1.296328 4 0.848302 1 1.54117 2 1.487445 3 1.451701 2 1.269385 1 1.466486 2 1.134052 2 1.870827 3 0.988919 7 0.540893 11 1.233761 2 1.180036 4 1.144292 4 0.961986 5 1.159087 2 1.105321 3 1.842086 4 0.960188 2 0.512162 5 1.20503 1 1.151305 2 0.933256 1 1.130346 1 1.052605 1 1.789379 1 0.907472 2 0.459455 2 1.152324 1 1.098598 1 0.880539 1 1.077639 1 1.208037 1 1.944802 2 1.307747 1 1.254021 1 1.218277 1 1.233062 4 2.381929 2 1.500022 3 1.052005 1 1.744874 1 1.67019 3 2.455405 2 2.007379 2 2.700247 3 2.625563 1 2.844431 2 2.396405 3 3.089274 4 2.999804 1 2.817499 4 1.497405 4 1.049388 4 1.742257 4 1.688531 3 1.652787 4 1.470472 5 1.667572 3 1.335139 6 2.071913 8 1.190006 6 0.741979 6 1.434848 2 1.381122 4 1.345378 2 1.163073 3 1.360164 1 1.306408 1 2.043172 6 1.161275 5 0.713248 6 1.406117 2 1.352391 3 1.316648 2 1.134332 6 1.331433 1 1.253691 3 1.108558 1 0.660532 2 1.35333 2 1.299675 2 1.263931 2 1.081625 2 1.278726 1 1.237048 5 1.434149 1 2.583016 4 1.701108 1 1.253082 1 1.945951 2 1.892225 2 1.856481 1 1.674176 2 1.871276 1 2.656492 2 2.208465 1 5.377758 1 5.922436 1 2.372219 3 3.108984 2 2.471928 1 2.418203 1 2.200153 1 2.397244 3 1.025192 5 1.761967 16 0.880059 7 0.432033 9 1.124902 5 1.071176 7 1.035432 6 0.853127 11 1.050217 3 0.717794 5 1.454558 15 0.57266 13 0.124634 14 0.817503 3 0.763777 4 0.728033 8 0.545728 5 0.742818 2 0.689053 12 1.425827 16 0.54392 12 0.095903 14 0.788762 3 0.735036 9 0.699302 3 0.516987 8 0.714087 1 0.636346 2 1.37311 5 0.491213 4 0.043186 7 0.736055 2 0.682329 2 0.646586 2 0.46428 2 0.661371 2 0.791769 3 1.528543 9 0.646636 5 0.198609 2 0.891478 2 0.837752 4 0.619703 2 0.816803 1 1.228896 1 1.965661 2 1.083763 3 0.635737 1 1.328605 2 1.05683 2 1.253921 1 2.921044 1 2.039136 1 2.283989 1 2.012204 3 2.209304 1 4.312386 1 5.005255 1 7.062553 1 2.037268 1 2.28211 1 2.192641 5 0.835375 2 1.572149 4 0.690241 7 0.242225 7 0.935094 1 0.881368 3 0.845624 2 0.663309 10 0.860409 3 0.527976 8 1.26475 14 0.382843 5 0.000999 8 0.627685 2 0.573959 7 0.538215 8 0.35591 9 0.55301 3 0.499245 4 1.236009 17 0.354112 7 0.000999 11 0.598954 2 0.545228 6 0.509484 3 0.327179 6 0.524269 2 0.446528 7 0.301395 3 0.546237 4 0.456768 4 0.274462 4 0.471563 2 0.456828 2 0.008801 1 0.70167 1 0.626985 1 1.039078 1 1.775853 2 0.893945 2 0.445919 8 1.138788 2 1.085062 1 1.049318 2 0.867013 2 1.064113 1 1.401302 1 2.040445 1 6.312362 1 5.864345 1 6.467745 1 2.314128 1 2.324368 1 2.339153 2 0.967101 5 0.821968 4 0.373942 3 1.06681 1 1.013085 6 0.977341 3 0.795035 6 0.992126 3 0.659702 4 1.396467 5 0.514569 6 0.066543 11 0.759411 4 0.705686 10 0.669942 5 0.487627 14 0.684727 4 0.630961 4 1.367736 15 0.485828 4 0.037802 11 0.730671 5 0.676945 7 0.641201 2 0.458896 6 0.655996 3 1.315019 1 0.433122 1 0.000999 5 0.677964 1 0.624238 1 0.588494 2 0.406189 2 0.603279 4 0.733678 1 0.140458 5 0.743917 2 0.561612 4 0.758712 1 1.170805 1 1.270514 1 1.181045 4 1.19583 1 2.126178 1 1.533029 1 2.172172 1 2.151213 1 5.584219 ; ! Data set Dorofeev and Grant, suggests negative weights; !CaseDG alpha = 1; !CaseDG level = 1 2 3 ; !CaseDG variable = EXV1 EXV2; ! The population rim target percentages for each variable and combination of level; !CaseDG VXL Targpcent = EXV1 1 55.5556 EXV1 2 33.3333 EXV1 3 11.1111 EXV2 1 22.2222 EXV2 2 33.3333 EXV2 3 44.4444 ; ! Cell number and combination of levels sampled; !CaseDG CELL VarLvl= 1 1 1 2 1 2 3 1 3 4 2 1 5 2 2 6 2 3 7 3 1 8 3 2 9 3 3 ; ! Count in each cell & heuristic weights; !CaseDG count HWgt = 5 0 7 1.06881 10 1.76551 3 4.91677 0 0 0 0 9 0 10 0.417 1 0.94301 ; ENDDATA SUBMODEL samplebal: ! If alpha = 0, we get straight sampling, alpha = 1: weighted sampling with perfect match if we allow weightcell < 0; min = alphat* rimerr +(1- alphat)* cellerr; rimerr <= rimerrUL; ! In case we want to constrain rimerr; cellerr <= cellerrUL; ! In case we want to constrain rimerr; ! For each variable i and level j, compute the achieved count when weightcell( c), is applied to each observation in cell c; ! For all explanatory variables i and their levels j the achieved count over all cells sampled is... ; @FOR( VXL( i, j) : [ACNT] achvcount( i, j) = @SUM( CELL( c) | VarLvl( c, i) #EQ# j : weightcell( c) * count( c)) ); ! Special case. If a cell is empty, its weight = 0; @FOR( CELL( c) | count( c) #EQ# 0: weightcell( c) = 0); ! Add this if we want the estimator to be unbiased; [SIZE] @SUM( CELL( c): weightcell( c) * count( c)) = sampsize; ! Compute rim error measure. Should watch out for sampno( i, j) = 0; [CRIMER] rimerr = @SUM( VXL( i, j) : (( achvcount( i, j) - targpcent( i, j)* sampsize/100) / sampno( i, j))^2 ); ! Compute cost of increased variance; [CCELLR] cellerr = @SUM( CELL( c): count( c)*(1 - weightcell( c))^2); ! Standard formula for rim error; ! rimerrstd = (rimerr/m)^0.5; ! @free(rimerrstd); ENDSUBMODEL CALC: @SET( 'TERSEO',2); ! Output level (0:verb, 1:terse, 2:only errors, 3:none); m = @size( VXL); ! number of rim targets; sampsize = @SUM( CELL( c): count( c)); ! Compute number in sample for each variable i that have level j; @FOR( VXL( i, j): sampno( i, j) = @SUM( CELL( c) | VarLvl( c, i) #eq# j: count( c)); ); ! Compute achieved counts for the heuristic weights; ! Add this to check the heuristic weights estimator to be unbiased; sampsizeh = @SUM( CELL( c): HWgt( c) * count( c)); alphat = alpha; ! Set temporary alpha = alpha; @SOLVE( samplebal); ! In case alpha = 1, minimize other error as secondary objective; @IFC( alpha #EQ# 1: alphat = 0; rimerrUL = rimerr; @SOLVE( samplebal); ); ! In case alpha = 0, minimize other error as secondary objective; @IFC( alpha #EQ# 0: alphat = 1; cellerrUL = cellerr; @SOLVE( samplebal); ); ! Compute standard error measure: (Sum of squared rim errors/NumberRimTargets)^0.5; rimerrstd = ( rimerr/ m)^0.5; ! Compute cell error measure for existing heuristic method; cellHeur = @SUM( CELL( c): count( c)*(1 - HWgt( c))^2); @WRITE(' Input DATA:', @NEWLINE( 1)); @FOR( VXL( i, j) : achvcounth( i, j) = @SUM( CELL( c) | VarLvl( c, i) #EQ# j : HWgt( c) * count( c)) ); ! Compute rim error measure. Should watch out for sampno( i, j) = 0; rimerrh = @SUM( VXL( i, j) : (( achvcounth( i, j) - targpcent( i, j)* sampsize/100) / sampno( i, j))^2 ); rimerrstdh = ( rimerrh/ m)^0.5; rimerrstdh = ( rimerrh/ m)^0.5; @WRITE( @FORMAT( @SIZE( variable),'12.0f'),' = number explanatory variables', @NEWLINE(1)); @WRITE( @FORMAT( @SIZE( CELL),'12.0f'),' = number cells with count > 0', @NEWLINE(1)); @WRITE( @FORMAT( m,'12.0f'),' = number brackets over all variables', @NEWLINE(1)); @WRITE( @FORMAT( sampsize,'12.0f'),' = sample size(sum of the counts)', @NEWLINE(1)); EffSmpSzH = ( sampsize^2)/( sampsize + cellHeur); @WRITE( @FORMAT( EffSmpSzH,'12.2f'), ' = Effective sample size of heuristic wgts', @NEWLINE(1)); @WRITE( @FORMAT( rimerrstdh,'12.5f'), ' = root mean standardized rim error of heuristic weights', @NEWLINE(1)); @WRITE( @NEWLINE( 1)); @WRITE(' Sample Balancing Results', @NEWLINE(1)); @WRITE( @FORMAT( alpha,'12.4f'),' = alpha ( weight on rim errors, 1-alpha= weight on max sample size) ', @NEWLINE(1)); ! Report effective sample size; EffSmpSz = ( sampsize^2)/( sampsize + cellerr); @WRITE( @FORMAT( EffSmpSz,'12.2f'), ' = Effective sample size (Optimal for given alpha) ', @NEWLINE(1)); @WRITE( @FORMAT( rimerrstd,'12.5f'), ' = root mean standardized rim error', @NEWLINE(1)); @WRITE( @FORMAT( cellerr, '12.3f'), ' = individual weight squared deviation from 1.0 ', @NEWLINE(1)); @WRITE(' CELL '); @FOR( variable( v): @WRITE( @FORMAT( variable( v),'9s')); ); @WRITE( ' Count Weight ', @NEWLINE( 1)); @FOR( CELL( c) : @WRITE( @FORMAT( c,'5.0f'),' '); @FOR( variable( v): @WRITE( @FORMAT( VarLvl( c, v),'9.0f')) ); @WRITE( @FORMAT( count( c),'8.0f'), ' ', @FORMAT( weightcell( c),'10.3f'),@NEWLINE(1)); ); @WRITE(@NEWLINE( 1), ' Achieved rim target percent', @NEWLINE( 1)); @WRITE(' Variable Level WgtPercent', @NEWLINE( 1)); @FOR( VXL( i, j) : @WRITE( @FORMAT( Variable( i),'9s'), ' ', @FORMAT( j,'4.0f'), ' ', @FORMAT( 100* achvcount( i, j)/Sampsize,'7.1f'), @NEWLINE( 1)); ); ENDCALC ```