Regression and Other Stories: Height and weight

Predict weight from height. See Chapters 3, 9 and 10 in Regression and Other Stories.

Load packages

library("rprojroot")
root<-has_file(".ROS-Examples-root")$make_fix_file()
library("rstanarm")

Load data

earnings <- read.csv(root("Earnings/data","earnings.csv"))
head(earnings)

  height weight male  earn earnk ethnicity education mother_education
1     74    210    1 50000    50     White        16               16
2     66    125    0 60000    60     White        16               16
3     64    126    0 30000    30     White        16               16
4     65    200    0 25000    25     White        17               17
5     63    110    0 50000    50     Other        16               16
6     68    165    0 62000    62     Black        18               18
  father_education walk exercise smokenow tense angry age
1               16    3        3        2     0     0  45
2               16    6        5        1     0     0  58
3               16    8        1        2     1     1  29
4               NA    8        1        2     0     0  57
5               16    5        6        2     0     0  91
6               18    1        1        2     2     2  54

Simulating uncertainty for linear predictors and predicted values

Predict weight (in pounds) from height (in inches)

The option refresh = 0 supresses the default Stan sampling progress output. This is useful for small data with fast computation. For more complex models and bigger data, it can be useful to see the progress.

fit_1 <- stan_glm(weight ~ height, data=earnings, refresh = 0)
print(fit_1)

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ height
 observations: 1789
 predictors:   2
------
            Median MAD_SD
(Intercept) -173.1   11.8
height         4.9    0.2

Auxiliary parameter(s):
      Median MAD_SD
sigma 29.0    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

**Predict weight for 66 inches person

coefs_1 <- coef(fit_1)
predicted_1 <- coefs_1[1] + coefs_1[2]*66
round(predicted_1, 1)

(Intercept) 
      153.3

new <- data.frame(height=66)
pred <- posterior_predict(fit_1, newdata=new)

cat("Predicted weight for a 66-inch-tall person is", round(mean(pred)),
"pounds with a sd of", round(sd(pred)), "\n")

Predicted weight for a 66-inch-tall person is 152 pounds with a sd of 29

Center heights

earnings$c_height <- earnings$height - 66
fit_2 <- stan_glm(weight ~ c_height, data=earnings, refresh = 0)
print(fit_2)

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ c_height
 observations: 1789
 predictors:   2
------
            Median MAD_SD
(Intercept) 153.4    0.7 
c_height      5.0    0.2 

Auxiliary parameter(s):
      Median MAD_SD
sigma 29.0    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Point prediction

new <- data.frame(c_height=4.0)
point_pred_2 <- predict(fit_2, newdata=new)
round(point_pred_2, 1)

    1 
173.2

Posterior simulations

variation coming from posterior uncertainty in the coefficients

linpred_2 <- posterior_linpred(fit_2, newdata=new)
hist(linpred_2)

Posterior predictive simulations

variation coming from posterior uncertainty in the coefficients and predictive uncertainty

postpred_2 <- posterior_predict(fit_2, newdata=new)
hist(postpred_2)

Indicator variables

Predict weight (in pounds) from height (in inches)

new <- data.frame(height=66)
pred <- posterior_predict(fit_1, newdata=new)
cat("Predicted weight for a 66-inch-tall person is", round(mean(pred)), "pounds with a sd of", round(sd(pred)), "\n")

Predicted weight for a 66-inch-tall person is 154 pounds with a sd of 29

Including a binary variable in a regression

fit_3 <- stan_glm(weight ~ c_height + male, data=earnings, refresh = 0)
print(fit_3)

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ c_height + male
 observations: 1789
 predictors:   3
------
            Median MAD_SD
(Intercept) 149.5    1.0 
c_height      3.9    0.2 
male         11.9    2.0 

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.7    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

new <- data.frame(c_height=4, male=0)
pred <- posterior_predict(fit_3, newdata=new)
cat("Predicted weight for a 70-inch-tall female is", round(mean(pred)), "pounds with a sd of", round(sd(pred)), "\n")

Predicted weight for a 70-inch-tall female is 165 pounds with a sd of 29

Using indicator variables for multiple levels of a categorical predictor

Include ethnicity in the regression as a factor

fit_4 <- stan_glm(weight ~ c_height + male + factor(ethnicity),
                  data=earnings, refresh = 0)
print(fit_4)

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ c_height + male + factor(ethnicity)
 observations: 1789
 predictors:   6
------
                          Median MAD_SD
(Intercept)               154.4    2.2 
c_height                    3.8    0.2 
male                       12.1    2.0 
factor(ethnicity)Hispanic  -6.2    3.5 
factor(ethnicity)Other    -12.3    5.3 
factor(ethnicity)White     -5.1    2.3 

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.6    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Choose the baseline category by setting the levels

earnings$eth <- factor(earnings$ethnicity,
  levels=c("White", "Black", "Hispanic", "Other"))
fit_5 <- stan_glm(weight ~ c_height + male + ethnicity, data=earnings, refresh = 0)
print(fit_5)

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ c_height + male + ethnicity
 observations: 1789
 predictors:   6
------
                  Median MAD_SD
(Intercept)       154.3    2.3 
c_height            3.9    0.3 
male               12.1    2.0 
ethnicityHispanic  -6.2    3.7 
ethnicityOther    -12.2    5.4 
ethnicityWhite     -5.2    2.4 

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.7    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Alternatively create indicators for the four ethnic groups directly:

earnings$eth_white <- ifelse(earnings$ethnicity=="White", 1, 0)
earnings$eth_black <- ifelse(earnings$ethnicity=="Black", 1, 0)
earnings$eth_hispanic <- ifelse(earnings$ethnicity=="Hispanic", 1, 0)
earnings$eth_other <- ifelse(earnings$ethnicity=="Other", 1, 0)
fit_6 <- stan_glm(weight ~ c_height + male + eth_black + eth_hispanic + eth_other,
                  data=earnings, refresh = 0)
print(fit_6)

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ c_height + male + eth_black + eth_hispanic + eth_other
 observations: 1789
 predictors:   6
------
             Median MAD_SD
(Intercept)  149.1    1.0 
c_height       3.9    0.3 
male          12.1    2.1 
eth_black      5.2    2.3 
eth_hispanic  -1.0    2.9 
eth_other     -7.0    5.0 

Auxiliary parameter(s):
      Median MAD_SD
sigma 28.6    0.5  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

LS0tCnRpdGxlOiAiUmVncmVzc2lvbiBhbmQgT3RoZXIgU3RvcmllczogSGVpZ2h0IGFuZCB3ZWlnaHQiCmF1dGhvcjogIkFuZHJldyBHZWxtYW4sIEplbm5pZmVyIEhpbGwsIEFraSBWZWh0YXJpIgpkYXRlOiAiYHIgZm9ybWF0KFN5cy5EYXRlKCkpYCIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6CiAgICB0aGVtZTogcmVhZGFibGUKICAgIHRvYzogdHJ1ZQogICAgdG9jX2RlcHRoOiAyCiAgICB0b2NfZmxvYXQ6IHRydWUKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKLS0tClByZWRpY3Qgd2VpZ2h0IGZyb20gaGVpZ2h0LiBTZWUgQ2hhcHRlcnMgMywgOSBhbmQgMTAgaW4gUmVncmVzc2lvbgphbmQgT3RoZXIgU3Rvcmllcy4KCi0tLS0tLS0tLS0tLS0KCgpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KG1lc3NhZ2U9RkFMU0UsIGVycm9yPUZBTFNFLCB3YXJuaW5nPUZBTFNFLCBjb21tZW50PU5BKQpgYGAKCiMjIyMgTG9hZCBwYWNrYWdlcwoKYGBge3IgfQpsaWJyYXJ5KCJycHJvanJvb3QiKQpyb290PC1oYXNfZmlsZSgiLlJPUy1FeGFtcGxlcy1yb290IikkbWFrZV9maXhfZmlsZSgpCmxpYnJhcnkoInJzdGFuYXJtIikKYGBgCgojIyMjIExvYWQgZGF0YQoKYGBge3IgfQplYXJuaW5ncyA8LSByZWFkLmNzdihyb290KCJFYXJuaW5ncy9kYXRhIiwiZWFybmluZ3MuY3N2IikpCmhlYWQoZWFybmluZ3MpCmBgYAoKIyMgU2ltdWxhdGluZyB1bmNlcnRhaW50eSBmb3IgbGluZWFyIHByZWRpY3RvcnMgYW5kIHByZWRpY3RlZCB2YWx1ZXMKIyMjIyBQcmVkaWN0IHdlaWdodCAoaW4gcG91bmRzKSBmcm9tIGhlaWdodCAoaW4gaW5jaGVzKQoKVGhlIG9wdGlvbiBgcmVmcmVzaCA9IDBgIHN1cHJlc3NlcyB0aGUgZGVmYXVsdCBTdGFuIHNhbXBsaW5nCnByb2dyZXNzIG91dHB1dC4gVGhpcyBpcyB1c2VmdWwgZm9yIHNtYWxsIGRhdGEgd2l0aCBmYXN0CmNvbXB1dGF0aW9uLiBGb3IgbW9yZSBjb21wbGV4IG1vZGVscyBhbmQgYmlnZ2VyIGRhdGEsIGl0IGNhbiBiZQp1c2VmdWwgdG8gc2VlIHRoZSBwcm9ncmVzcy4KCmBgYHtyIH0KZml0XzEgPC0gc3Rhbl9nbG0od2VpZ2h0IH4gaGVpZ2h0LCBkYXRhPWVhcm5pbmdzLCByZWZyZXNoID0gMCkKcHJpbnQoZml0XzEpCmBgYAoKKipQcmVkaWN0IHdlaWdodCBmb3IgNjYgaW5jaGVzIHBlcnNvbgoKYGBge3IgfQpjb2Vmc18xIDwtIGNvZWYoZml0XzEpCnByZWRpY3RlZF8xIDwtIGNvZWZzXzFbMV0gKyBjb2Vmc18xWzJdKjY2CnJvdW5kKHByZWRpY3RlZF8xLCAxKQpgYGAKCm9yCgpgYGB7ciB9Cm5ldyA8LSBkYXRhLmZyYW1lKGhlaWdodD02NikKcHJlZCA8LSBwb3N0ZXJpb3JfcHJlZGljdChmaXRfMSwgbmV3ZGF0YT1uZXcpCmBgYAoKCgpgYGB7ciB9CmNhdCgiUHJlZGljdGVkIHdlaWdodCBmb3IgYSA2Ni1pbmNoLXRhbGwgcGVyc29uIGlzIiwgcm91bmQobWVhbihwcmVkKSksCiJwb3VuZHMgd2l0aCBhIHNkIG9mIiwgcm91bmQoc2QocHJlZCkpLCAiXG4iKQpgYGAKCiMjIyMgQ2VudGVyIGhlaWdodHMKCmBgYHtyIH0KZWFybmluZ3MkY19oZWlnaHQgPC0gZWFybmluZ3MkaGVpZ2h0IC0gNjYKZml0XzIgPC0gc3Rhbl9nbG0od2VpZ2h0IH4gY19oZWlnaHQsIGRhdGE9ZWFybmluZ3MsIHJlZnJlc2ggPSAwKQpwcmludChmaXRfMikKYGBgCgojIyMjIFBvaW50IHByZWRpY3Rpb24KCmBgYHtyIH0KbmV3IDwtIGRhdGEuZnJhbWUoY19oZWlnaHQ9NC4wKQpwb2ludF9wcmVkXzIgPC0gcHJlZGljdChmaXRfMiwgbmV3ZGF0YT1uZXcpCnJvdW5kKHBvaW50X3ByZWRfMiwgMSkKYGBgCgojIyMjIFBvc3RlcmlvciBzaW11bGF0aW9ucwoKdmFyaWF0aW9uIGNvbWluZyBmcm9tIHBvc3RlcmlvciB1bmNlcnRhaW50eSBpbiB0aGUgY29lZmZpY2llbnRzCgpgYGB7ciB9CmxpbnByZWRfMiA8LSBwb3N0ZXJpb3JfbGlucHJlZChmaXRfMiwgbmV3ZGF0YT1uZXcpCmhpc3QobGlucHJlZF8yKQpgYGAKCiMjIyMgUG9zdGVyaW9yIHByZWRpY3RpdmUgc2ltdWxhdGlvbnMKCnZhcmlhdGlvbiBjb21pbmcgZnJvbSBwb3N0ZXJpb3IgdW5jZXJ0YWludHkgaW4gdGhlIGNvZWZmaWNpZW50cyBhbmQKcHJlZGljdGl2ZSB1bmNlcnRhaW50eQoKYGBge3IgfQpwb3N0cHJlZF8yIDwtIHBvc3Rlcmlvcl9wcmVkaWN0KGZpdF8yLCBuZXdkYXRhPW5ldykKaGlzdChwb3N0cHJlZF8yKQpgYGAKCiMjIEluZGljYXRvciB2YXJpYWJsZXMKIyMjIyBQcmVkaWN0IHdlaWdodCAoaW4gcG91bmRzKSBmcm9tIGhlaWdodCAoaW4gaW5jaGVzKQoKYGBge3IgfQpuZXcgPC0gZGF0YS5mcmFtZShoZWlnaHQ9NjYpCnByZWQgPC0gcG9zdGVyaW9yX3ByZWRpY3QoZml0XzEsIG5ld2RhdGE9bmV3KQpjYXQoIlByZWRpY3RlZCB3ZWlnaHQgZm9yIGEgNjYtaW5jaC10YWxsIHBlcnNvbiBpcyIsIHJvdW5kKG1lYW4ocHJlZCkpLCAicG91bmRzIHdpdGggYSBzZCBvZiIsIHJvdW5kKHNkKHByZWQpKSwgIlxuIikKYGBgCgojIyMjIEluY2x1ZGluZyBhIGJpbmFyeSB2YXJpYWJsZSBpbiBhIHJlZ3Jlc3Npb24KCmBgYHtyIH0KZml0XzMgPC0gc3Rhbl9nbG0od2VpZ2h0IH4gY19oZWlnaHQgKyBtYWxlLCBkYXRhPWVhcm5pbmdzLCByZWZyZXNoID0gMCkKcHJpbnQoZml0XzMpCm5ldyA8LSBkYXRhLmZyYW1lKGNfaGVpZ2h0PTQsIG1hbGU9MCkKcHJlZCA8LSBwb3N0ZXJpb3JfcHJlZGljdChmaXRfMywgbmV3ZGF0YT1uZXcpCmNhdCgiUHJlZGljdGVkIHdlaWdodCBmb3IgYSA3MC1pbmNoLXRhbGwgZmVtYWxlIGlzIiwgcm91bmQobWVhbihwcmVkKSksICJwb3VuZHMgd2l0aCBhIHNkIG9mIiwgcm91bmQoc2QocHJlZCkpLCAiXG4iKQpgYGAKCiMjIyMgVXNpbmcgaW5kaWNhdG9yIHZhcmlhYmxlcyBmb3IgbXVsdGlwbGUgbGV2ZWxzIG9mIGEgY2F0ZWdvcmljYWwgcHJlZGljdG9yPGJyLz4KSW5jbHVkZSBldGhuaWNpdHkgaW4gdGhlIHJlZ3Jlc3Npb24gYXMgYSBmYWN0b3IKCmBgYHtyIH0KZml0XzQgPC0gc3Rhbl9nbG0od2VpZ2h0IH4gY19oZWlnaHQgKyBtYWxlICsgZmFjdG9yKGV0aG5pY2l0eSksCiAgICAgICAgICAgICAgICAgIGRhdGE9ZWFybmluZ3MsIHJlZnJlc2ggPSAwKQpwcmludChmaXRfNCkKYGBgCgojIyMjIENob29zZSB0aGUgYmFzZWxpbmUgY2F0ZWdvcnkgYnkgc2V0dGluZyB0aGUgbGV2ZWxzCgpgYGB7ciB9CmVhcm5pbmdzJGV0aCA8LSBmYWN0b3IoZWFybmluZ3MkZXRobmljaXR5LAogIGxldmVscz1jKCJXaGl0ZSIsICJCbGFjayIsICJIaXNwYW5pYyIsICJPdGhlciIpKQpmaXRfNSA8LSBzdGFuX2dsbSh3ZWlnaHQgfiBjX2hlaWdodCArIG1hbGUgKyBldGhuaWNpdHksIGRhdGE9ZWFybmluZ3MsIHJlZnJlc2ggPSAwKQpwcmludChmaXRfNSkKYGBgCgojIyMjIEFsdGVybmF0aXZlbHkgY3JlYXRlIGluZGljYXRvcnMgZm9yIHRoZSBmb3VyIGV0aG5pYyBncm91cHMgZGlyZWN0bHk6CgpgYGB7ciB9CmVhcm5pbmdzJGV0aF93aGl0ZSA8LSBpZmVsc2UoZWFybmluZ3MkZXRobmljaXR5PT0iV2hpdGUiLCAxLCAwKQplYXJuaW5ncyRldGhfYmxhY2sgPC0gaWZlbHNlKGVhcm5pbmdzJGV0aG5pY2l0eT09IkJsYWNrIiwgMSwgMCkKZWFybmluZ3MkZXRoX2hpc3BhbmljIDwtIGlmZWxzZShlYXJuaW5ncyRldGhuaWNpdHk9PSJIaXNwYW5pYyIsIDEsIDApCmVhcm5pbmdzJGV0aF9vdGhlciA8LSBpZmVsc2UoZWFybmluZ3MkZXRobmljaXR5PT0iT3RoZXIiLCAxLCAwKQpmaXRfNiA8LSBzdGFuX2dsbSh3ZWlnaHQgfiBjX2hlaWdodCArIG1hbGUgKyBldGhfYmxhY2sgKyBldGhfaGlzcGFuaWMgKyBldGhfb3RoZXIsCiAgICAgICAgICAgICAgICAgIGRhdGE9ZWFybmluZ3MsIHJlZnJlc2ggPSAwKQpwcmludChmaXRfNikKYGBgCgo=

Regression and Other Stories: Height and weight

Andrew Gelman, Jennifer Hill, Aki Vehtari

2020-08-22

Load packages

Load data

Simulating uncertainty for linear predictors and predicted values

Predict weight (in pounds) from height (in inches)

Center heights

Point prediction

Posterior simulations

Posterior predictive simulations

Indicator variables

Predict weight (in pounds) from height (in inches)

Including a binary variable in a regression

Using indicator variables for multiple levels of a categorical predictor

Choose the baseline category by setting the levels

Alternatively create indicators for the four ethnic groups directly: