Skip to content

Multilevel Modelling with Variational Inference

There have been two reasons for writing this notebook -

  1. To have a port of Multilevel modelling from PyMC3 to PyMC4.
  2. To test the Variational Inference API added this summer.

Radon contamination (Gelman and Hill 2006)

Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household. The EPA did a study of radon levels in 80,000 houses. There are two important predictors:

  • Measurement in basement or first floor (radon higher in basements)

  • Measurement of Uranium level available at county level

We will focus on modeling radon levels in Minnesota. The hierarchy in this example is households within county.

The model building has been inspired from TFP port of Multilevel modelling and the visualizations have been borrowed from PyMC3's Multilevel modelling.

In [ ]:
import arviz as az
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc4 as pm
import tensorflow as tf
import xarray as xr

from tensorflow_probability import bijectors as tfb
logging.getLogger("tensorflow").setLevel(logging.ERROR)
In [ ]:
%config InlineBackend.figure_format = 'retina'
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use('arviz-darkgrid')

Let's fetch the data and start analysing -

In [ ]:
data = pd.read_csv(pm.utils.get_data('radon.csv'))
u = np.log(data.Uppm).unique()
mn_counties = data.county.unique()
floor = data.floor.values.astype(np.int32)

counties = len(mn_counties)
county_lookup = dict(zip(mn_counties, range(counties)))
county_idx = data['county_code'].values.astype(np.int32)
In [ ]:
data.head()
Out[ ]:
Unnamed: 0 idnum state state2 stfips zip region typebldg floor room ... pcterr adjwt dupflag zipflag cntyfips county fips Uppm county_code log_radon
0 0 5081.0 MN MN 27.0 55735 5.0 1.0 1.0 3.0 ... 9.7 1146.499190 1.0 0.0 1.0 AITKIN 27001.0 0.502054 0 0.832909
1 1 5082.0 MN MN 27.0 55748 5.0 1.0 0.0 4.0 ... 14.5 471.366223 0.0 0.0 1.0 AITKIN 27001.0 0.502054 0 0.832909
2 2 5083.0 MN MN 27.0 55748 5.0 1.0 0.0 4.0 ... 9.6 433.316718 0.0 0.0 1.0 AITKIN 27001.0 0.502054 0 1.098612
3 3 5084.0 MN MN 27.0 56469 5.0 1.0 0.0 4.0 ... 24.3 461.623670 0.0 0.0 1.0 AITKIN 27001.0 0.502054 0 0.095310
4 4 5085.0 MN MN 27.0 55011 3.0 1.0 0.0 4.0 ... 13.8 433.316718 0.0 0.0 3.0 ANOKA 27003.0 0.428565 1 1.163151

5 rows × 30 columns

Conventional approaches

Before comparing ADVI approximations on hierarchical models, lets model radon exposure by conventional approaches -

Complete pooling:

Treat all counties the same, and estimate a single radon level. $$ y_i = \alpha + \beta x_i + \epsilon_i $$ where $y_i$ is the logarithm of radon level in house $i$, $x_i$ is the floor of measurement (either basement or first floor) and $\epsilon_i$ are the errors representing measurement error, temporal within-house variation, or variation among houses. The model directly translates to PyMC4 as -

In [ ]:
@pm.model
def pooled_model():
    a = yield pm.Normal('a', loc=0.0, scale=10.0, batch_stack=2)

    loc = a[0] + a[1]*floor
    scale = yield pm.Exponential("sigma", rate=1.0)

    y = yield pm.Normal('y', loc=loc, scale=scale, observed=data.log_radon.values)

Before running the model let’s do some prior predictive checks. These help in incorporating scientific knowledge into our model.

In [ ]:
prior_checks = pm.sample_prior_predictive(pooled_model())
prior_checks
Out[ ]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:               (chain: 1, draw: 1000, pooled_model/a_dim_0: 2, pooled_model/y_dim_0: 919)
      Coordinates:
        * chain                 (chain) int64 0
        * draw                  (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
        * pooled_model/a_dim_0  (pooled_model/a_dim_0) int64 0 1
        * pooled_model/y_dim_0  (pooled_model/y_dim_0) int64 0 1 2 3 ... 916 917 918
      Data variables:
          pooled_model/a        (chain, draw, pooled_model/a_dim_0) float32 3.73589...
          pooled_model/sigma    (chain, draw) float32 0.124252275 ... 0.1704529
          pooled_model/y        (chain, draw, pooled_model/y_dim_0) float32 -11.425...
      Attributes:
          created_at:     2020-09-06T15:12:11.500262
          arviz_version:  0.9.0

To make our lives easier during plotting and diagonsing while using ArviZ, we define a function remove_scope for renaming all variables in InferenceData to their actual distribution name.

In [ ]:
def remove_scope(idata):
    for group in idata._groups:
        for var in getattr(idata, group).variables:
            if "/" in var:
                idata.rename(name_dict={var: var.split("/")[-1]}, inplace=True)
    idata.rename(name_dict={"y_dim_0": "obs_id"}, inplace=True)
In [ ]:
remove_scope(prior_checks)
prior_checks
Out[ ]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:  (a_dim_0: 2, chain: 1, draw: 1000, obs_id: 919)
      Coordinates:
        * chain    (chain) int64 0
        * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
        * a_dim_0  (a_dim_0) int64 0 1
        * obs_id   (obs_id) int64 0 1 2 3 4 5 6 7 ... 911 912 913 914 915 916 917 918
      Data variables:
          a        (chain, draw, a_dim_0) float32 3.7358973 -15.320486 ... -6.4341984
          sigma    (chain, draw) float32 0.124252275 0.21816605 ... 0.1704529
          y        (chain, draw, obs_id) float32 -11.425111 3.614398 ... 6.6331916
      Attributes:
          created_at:     2020-09-06T15:12:11.500262
          arviz_version:  0.9.0

In [ ]:
_, ax = plt.subplots()
prior_checks.assign_coords(coords={"a_dim_0": ["Basement", " First Floor"]}, inplace=True)
prior_checks.prior_predictive.plot.scatter(x="a_dim_0", y="a", color="k", alpha=0.2, ax=ax)
ax.set(xlabel="Level", ylabel="Radon level (Log Scale)");

As there is no coords and dims integration to PyMC4's ModelTemplate, we need a bit extra manipulations to handle them. Here we need to assign_coords to dimensions of variable a to consider Basement and First Floor.

Before seeing the data, these priors seem to allow for quite a wide range of the mean log radon level. Let's fire up Variational Inference machinery and fit the model -

In [ ]:
pooled_advi = pm.fit(pooled_model(), num_steps=25_000)
|>>>>>>>>>>>>>>>>>>>>|
In [ ]:
def plot_elbo(loss):
    plt.plot(loss)
    plt.yscale("log")
    plt.xlabel("Number of iterations")
    plt.ylabel("Negative log(ELBO)")

plot_elbo(pooled_advi.losses)

Looks good, ELBO seems to have converged. As a sanity check, we will plot ELBO each time after fitting a new model to figure out its convergence.

Now, we'll draw samples from the posterior distribution. And then, pass these samples to sample_posterior_predictive to estimate the uncertainty at Basement and First Floor radon levels.

In [ ]:
pooled_advi_samples = pooled_advi.approximation.sample(2_000)
pooled_advi_samples
Out[ ]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:                   (chain: 1, draw: 2000, pooled_model/a_dim_0: 2)
      Coordinates:
        * chain                     (chain) int64 0
        * draw                      (draw) int64 0 1 2 3 4 ... 1996 1997 1998 1999
        * pooled_model/a_dim_0      (pooled_model/a_dim_0) int64 0 1
      Data variables:
          pooled_model/a            (chain, draw, pooled_model/a_dim_0) float32 1.4...
          pooled_model/__log_sigma  (chain, draw) float32 -0.262588 ... -0.23297022
          pooled_model/sigma        (chain, draw) float32 0.7690587 ... 0.79217714
      Attributes:
          created_at:     2020-09-06T15:12:24.652480
          arviz_version:  0.9.0

    • <xarray.Dataset>
      Dimensions:               (pooled_model/y_dim_0: 919)
      Coordinates:
        * pooled_model/y_dim_0  (pooled_model/y_dim_0) int64 0 1 2 3 ... 916 917 918
      Data variables:
          pooled_model/y        (pooled_model/y_dim_0) float64 0.8329 0.8329 ... 1.099
      Attributes:
          created_at:     2020-09-06T15:12:24.654109
          arviz_version:  0.9.0

In [ ]:
posterior_predictive = pm.sample_posterior_predictive(pooled_model(), pooled_advi_samples)
remove_scope(posterior_predictive)
posterior_predictive
Out[ ]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:      (a_dim_0: 2, chain: 1, draw: 2000)
      Coordinates:
        * chain        (chain) int64 0
        * draw         (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999
        * a_dim_0      (a_dim_0) int64 0 1
      Data variables:
          a            (chain, draw, a_dim_0) float32 1.4208013 ... -0.6438127
          __log_sigma  (chain, draw) float32 -0.262588 -0.24097891 ... -0.23297022
          sigma        (chain, draw) float32 0.7690587 0.7858582 ... 0.79217714
      Attributes:
          created_at:     2020-09-06T15:12:24.652480
          arviz_version:  0.9.0

    • <xarray.Dataset>
      Dimensions:  (chain: 1, draw: 2000, obs_id: 919)
      Coordinates:
        * chain    (chain) int64 0
        * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
        * obs_id   (obs_id) int64 0 1 2 3 4 5 6 7 ... 911 912 913 914 915 916 917 918
      Data variables:
          y        (chain, draw, obs_id) float32 0.8419045 0.9926412 ... 0.36609662
      Attributes:
          created_at:     2020-09-06T15:12:25.247406
          arviz_version:  0.9.0

    • <xarray.Dataset>
      Dimensions:  (obs_id: 919)
      Coordinates:
        * obs_id   (obs_id) int64 0 1 2 3 4 5 6 7 ... 911 912 913 914 915 916 917 918
      Data variables:
          y        (obs_id) float64 0.8329 0.8329 1.099 0.09531 ... 1.629 1.335 1.099
      Attributes:
          created_at:     2020-09-06T15:12:24.654109
          arviz_version:  0.9.0

We now want to calculate the highest density interval given by the posterior predictive on Radon levels. However, we are not interested in the HDI of each observation but in the HDI of each level (either Basement or First Floor). We first group posterior_predictive samples using coords and then pass the specific dimensions ("chain", "draw", "obs_id") to az.hdi.

In [ ]:
floor = xr.DataArray(floor, dims=("obs_id"))
hdi_helper = lambda ds: az.hdi(ds, input_core_dims=[["chain", "draw", "obs_id"]])
hdi_ppc = posterior_predictive.posterior_predictive["y"].groupby(floor).apply(hdi_helper)["y"]
hdi_ppc
Out[ ]:
<xarray.DataArray 'y' (group: 2, hdi: 2)>
array([[-0.12597895,  2.84736681],
       [-0.7139045 ,  2.26243448]])
Coordinates:
  * hdi      (hdi) <U6 'lower' 'higher'
  * group    (group) int64 0 1

In addition, ArviZ has also included the hdi_prob as an attribute of the hdi coordinate, click on its file icon to see it!

We will now add one extra coordinate to the observed_data group: the Level labels (not indices). This will allow xarray to automatically generate the correct xlabel and xticklabels so we don’t have to worry about labeling too much. In this particular case we will only do one plot, which makes the adding of a coordinate a bit of an overkill. In many cases however, we will have several plots and using this approach will automate labeling for all plots. Eventually, we will sort by Level coordinate to make sure Basement is the first value and goes at the left of the plot.

In [ ]:
posterior_predictive.rename(name_dict={"a_dim_0": "Level"}, inplace=True)
posterior_predictive.assign_coords({"Level": ["Basement", "First Floor"]}, inplace=True)
level_labels = posterior_predictive.posterior.Level[floor]
posterior_predictive.observed_data = posterior_predictive.observed_data.assign_coords(Level=level_labels).sortby("Level")

Plot the point estimates of the slope and intercept for the complete pooling model.

In [ ]:
xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "First Floor"]})
posterior_predictive.posterior["a"] = posterior_predictive.posterior.a[:, :, 0] + posterior_predictive.posterior.a[:, :, 1] * xvals
pooled_means = posterior_predictive.posterior.mean(dim=("chain", "draw"))

_, ax = plt.subplots()
posterior_predictive.observed_data.plot.scatter(x="Level", y="y", label="Observations", alpha=0.4, ax=ax)

az.plot_hdi(
    [0, 1], hdi_data=hdi_ppc, fill_kwargs={"alpha": 0.2, "label": "Exp. distrib. of Radon levels"}, ax=ax
)

az.plot_hdi(
    [0, 1], posterior_predictive.posterior.a, fill_kwargs={"alpha": 0.5, "label": "Exp. mean HPD"}, ax=ax
)
ax.plot([0, 1], pooled_means.a, label="Exp. mean")

ax.set_ylabel("Log radon level")
ax.legend(ncol=2, fontsize=9, frameon=True);

The 94% interval of the expected value is very narrow, and even narrower for basement measurements, meaning that the model is slightly more confident about these observations. The sampling distribution of individual radon levels is much wider. We can infer that floor level does account for some of the variation in radon levels. We can see however that the model underestimates the dispersion in radon levels across households – lots of them lie outside the light orange prediction envelope. Also, the error rates are high representing high bias. So this model is a good start but we can’t stop there.

No pooling:

Here we do not pool the estimates of the intercepts but completely pool the slope estimates assuming the variance is same within each county. $$ y_i = \alpha_{j[i]} + \beta x_i + \epsilon_i $$ where $j$ = 1, ..., 85 representing each county.

In [ ]:
@pm.model
def unpooled_model():
    a_county = yield pm.Normal('a_county', loc=0., scale=10., batch_stack=counties)
    beta = yield pm.Normal('beta', loc=0, scale=10.)
    loc = tf.gather(a_county, county_idx) + beta*floor
    scale = yield pm.Exponential("sigma", rate=1.)

    y = yield pm.Normal('y', loc=loc, scale=scale, observed=data.log_radon.values)
In [ ]:
unpooled_advi = pm.fit(unpooled_model(), num_steps=25_000)
plot_elbo(unpooled_advi.losses)
|>>>>>>>>>>>>>>>>>>>>|
In [ ]:
unpooled_advi_samples = unpooled_advi.approximation.sample(2_000)
remove_scope(unpooled_advi_samples)
unpooled_advi_samples
Out[ ]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:         (a_county_dim_0: 85, chain: 1, draw: 2000)
      Coordinates:
        * chain           (chain) int64 0
        * draw            (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999
        * a_county_dim_0  (a_county_dim_0) int64 0 1 2 3 4 5 6 ... 79 80 81 82 83 84
      Data variables:
          a_county        (chain, draw, a_county_dim_0) float32 0.36999083 ... 1.03...
          beta            (chain, draw) float32 -0.56419456 -0.76161397 ... -0.6635491
          __log_sigma     (chain, draw) float32 -0.28598142 -0.28851208 ... -0.3346977
          sigma           (chain, draw) float32 0.7512766 0.7493777 ... 0.71555436
      Attributes:
          created_at:     2020-09-06T15:12:43.221131
          arviz_version:  0.9.0

    • <xarray.Dataset>
      Dimensions:  (obs_id: 919)
      Coordinates:
        * obs_id   (obs_id) int64 0 1 2 3 4 5 6 7 ... 911 912 913 914 915 916 917 918
      Data variables:
          y        (obs_id) float64 0.8329 0.8329 1.099 0.09531 ... 1.629 1.335 1.099
      Attributes:
          created_at:     2020-09-06T15:12:43.223073
          arviz_version:  0.9.0

Let’s plot each county's expected values with 94% confidence interval.

In [ ]:
unpooled_advi_samples.assign_coords(coords={"a_county_dim_0": mn_counties}, inplace=True)
az.plot_forest(
    unpooled_advi_samples, var_names="a_county", figsize=(6, 16), combined=True, textsize=8
);