In a project I’m working on, I need to generate predictions from a logistic regression model. That’s typically a really straightforward task—we can just use predict()
to plug a dataset of values into a model, which will spit out predictions either on the (gross, uninterpretable) log odds scale or on the (nice, interpretable) percentage-point scale.1
1 And for pro-level predictions, use predictions()
from {marginaleffects}.
However, in this project I cannot use predict()
—I’m working with a big matrix of posterior coefficient draws from a Bayesian model fit with raw Stan code, so there’s no special predict()
function that will work. Instead, I need to use matrix multiplication and manually multiply a matrix of new data with a vector of slopes from the model.
I haven’t had to matrix multiply coefficients with data since my first PhD stats class back in 2012 and I’ve completely forgotten how.
I created this little guide as a reference for myself, and I figured it’d probably be useful for others, so up on the blog it goes (see the introduction to this post for more about my philosophy of public work).
Example 1: Logistic regression model with an intercept and slopes
Here’s a basic regression model where we predict if a penguin is a Gentoo based on its bill length and body mass.
We can generate predicted values across different three different values of bill length: 40 mm, 44 mm, and 48 mm, holding body mass constant at the average (4207 g):
data_to_plug_in <- expand.grid(
bill_length_mm = c(40, 44, 48),
body_mass_g = mean(penguins$body_mass_g)
)
data_to_plug_in
## bill_length_mm body_mass_g
## 1 40 4207
## 2 44 4207
## 3 48 4207
We can feed this little dataset into the model using predict()
, which can generate predictions as log odds (type = "link"
) or probabilities (type = "response"
):
Yay, nice and easy.
Here’s how to do the same thing with manual matrix multiplication:
# Get all the coefficients
(coefs <- coef(model))
## (Intercept) bill_length_mm body_mass_g
## -32.401514 0.088906 0.006358
# Split intercept and slope coefficients into separate objects
(intercept <- coefs[1])
## (Intercept)
## -32.4
(slopes <- coefs[-1])
## bill_length_mm body_mass_g
## 0.088906 0.006358
# Convert the data frame of new data into a matrix
(data_to_plug_in_mat <- as.matrix(data_to_plug_in))
## bill_length_mm body_mass_g
## [1,] 40 4207
## [2,] 44 4207
## [3,] 48 4207
# Matrix multiply the new data with the slope coefficients, then add the intercept
(log_odds <- as.numeric((data_to_plug_in_mat %*% slopes) + intercept))
## [1] -2.097 -1.741 -1.385
# Convert to probability scale
plogis(log_odds)
## [1] 0.1094 0.1492 0.2002
The results are the same as predict()
:
Example 2: Logistic regression model with a categorical predictor and no intercept
This gets a little more complex when working with categorical predictors, especially if you’ve omitted the intercept term. For instance, in the data I’m working with, we have a model that looks something like this, with 0
added as a term to suppress the intercept and give separate coefficients for each of the levels of sex
:
When using predict()
, we don’t have to do anything special with this intercept-free model. We can plug in a dataset with different variations of predictors:
data_to_plug_in_cat <- expand.grid(
sex = c("female", "male"),
bill_length_mm = c(40, 44, 48),
body_mass_g = mean(penguins$body_mass_g)
)
data_to_plug_in_cat
## sex bill_length_mm body_mass_g
## 1 female 40 4207
## 2 male 40 4207
## 3 female 44 4207
## 4 male 44 4207
## 5 female 48 4207
## 6 male 48 4207
predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
## 1 2 3 4 5 6
## 2.707 -11.493 2.842 -11.357 2.978 -11.222
predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
## 1 2 3 4 5 6
## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
If we want to do this manually, we have to create a matrix version of data_to_plug_in_cat
that has separate columns for sexfemale
and sexmale
. We can’t just use as.matrix(data_to_plug_in_cat)
, since that only has a single column for sex
(and because that column contains text, it forces the rest of the matrix to be text, which makes it so we can’t do math with it anymore):
as.matrix(data_to_plug_in_cat)
## sex bill_length_mm body_mass_g
## [1,] "female" "40" "4207"
## [2,] "male" "40" "4207"
## [3,] "female" "44" "4207"
## [4,] "male" "44" "4207"
## [5,] "female" "48" "4207"
## [6,] "male" "48" "4207"
Instead, we can use model.matrix()
to create a design matrix—also called a dummy-encoded matrix2 or a one-hot encoded matrix—which makes columns of 0s and 1s for each of the levels of sex
2 Though we should probably quit using the word “dummy” because of its ableist connotations—see Google’s developer documentation style guide for alternatives.
data_to_plug_in_cat_mat <- model.matrix(
~ 0 + ., data = data_to_plug_in_cat
)
data_to_plug_in_cat_mat
## sexfemale sexmale bill_length_mm body_mass_g
## 1 1 0 40 4207
## 2 0 1 40 4207
## 3 1 0 44 4207
## 4 0 1 44 4207
## 5 1 0 48 4207
## 6 0 1 48 4207
## attr(,"assign")
## [1] 1 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$sex
## [1] "contr.treatment"
We can now do math with this matrix. Since we don’t have an intercept term, we don’t need to create separate objects for the slopes and intercepts and can matrix multiply the new data matrix with the model coefficients:
# Get all the coefficients
(coefs_cat <- coef(model_categorical))
## sexfemale sexmale bill_length_mm body_mass_g
## -75.68237 -89.88199 0.03387 0.01831
# Matrix multiply the newdata with the slope coefficients
(log_odds_cat <- as.numeric(data_to_plug_in_cat_mat %*% coefs_cat))
## [1] 2.707 -11.493 2.842 -11.357 2.978 -11.222
# Convert to probability scale
plogis(log_odds_cat)
## [1] 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
The results are the same!
predict(model_categorical, newdata = data_to_plug_in_cat, type = "link")
## 1 2 3 4 5 6
## 2.707 -11.493 2.842 -11.357 2.978 -11.222
log_odds_cat
## [1] 2.707 -11.493 2.842 -11.357 2.978 -11.222
predict(model_categorical, newdata = data_to_plug_in_cat, type = "response")
## 1 2 3 4 5 6
## 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
plogis(log_odds_cat)
## [1] 9.374e-01 1.020e-05 9.449e-01 1.169e-05 9.516e-01 1.338e-05
Citation
@online{heiss2023,
author = {Heiss, Andrew},
title = {Manually Generate Predicted Values for Logistic Regression
with Matrix Multiplication in {R}},
date = {2023-08-15},
url = {https://www.andrewheiss.com/blog/2023/08/15/matrix-multiply-logit-predict/},
doi = {10.59350/qba9a-b3561},
langid = {en}
}