{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Gentle Introduction to Partial Pooled models\n", "\n", "In this blog, we'll learn about the partial pooled models and where they are useful. In addition, we'll use PyMC framework to implement a simple partial pooled model.\n", "\n", "I'll also derive the equations for parameters involved in these models (which will hopefully be fun to read). So, lets get started!\n", "\n", "## A practical example\n", "\n", "Lets take a practical example to understand partial pooled models and related variants.\n", "\n", "Imagine you are a data scientist tasked with analyzing the academic performance of students across various schools in a city. Each school caters to a unique demographic, with varying resources and teaching methods. Your goal is to estimate the average exam scores of students in different schools. However, you face a challenge – some schools have a large number of students with comprehensive data, while other schools are smaller and have limited data available.\n", "\n", "Now lets create some representative data for this task. I'll refer school as a group in some contexts below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Necessary imports\n", "import pymc as pm\n", "import pandas as pd\n", "import numpy as np\n", "import arviz as az\n", "\n", "from typing import List\n", "import matplotlib.pyplot as plt\n", "\n", "%config InlineBackend.figure_format = 'retina'\n", "RANDOM_SEED = 42\n", "az.style.use('arviz-darkgrid')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# The total number of schools in the city\n", "num_of_schools = 50\n", "\n", "# The number of data samples (students) for each school. \n", "data_samples_per_school = np.random.randint(2, 100, size=num_of_schools)\n", "\n", "# The average performance of students in the city. Lets say the performance is measured out of 100\n", "true_global_mean = 63\n", "\n", "# The variation around the global mean, which represents the average performance of students across all schools, follows a normal distribution with a mean of 63 and a standard deviation of 2.\n", "true_global_variability = 2\n", "\n", "# The variability in the exam scores within each school\n", "within_school_variability = 10" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | School ID | \n", "score | \n", "
---|---|---|
Student ID | \n", "\n", " | \n", " |
1697 | \n", "33 | \n", "64.510645 | \n", "
314 | \n", "6 | \n", "62.005591 | \n", "
1786 | \n", "35 | \n", "67.453794 | \n", "
1730 | \n", "35 | \n", "73.349892 | \n", "
1224 | \n", "23 | \n", "56.454934 | \n", "
\n", " \n", "% Assuming a normal model for the data $y_j$ for group j with a normal prior for the mean $\\alpha_j$.\n", "\n", "% Let $y_j = N(\\alpha_j, \\sigma^2_y)$ and $\\alpha_j = N(\\mu, \\sigma^2_{\\alpha})$\n", "\n", "% The likelihood is given by the normal distribution:\n", "$$\n", "p(y_j | \\alpha_j) = \\frac{1}{\\sqrt{2\\pi\\sigma^2_y}} \\exp\\left(-\\frac{(y_j - \\alpha_j)^2}{2\\sigma^2_y}\\right)\n", "$$\n", "\n", "% The log likelihood is the logarithm of the above:\n", "$$\n", "\\log p(y_j | \\alpha_j) = \\log\\left(\\frac{1}{\\sqrt{2\\pi\\sigma^2_y}}\\right) -\\frac{(y_j - \\alpha_j)^2}{2\\sigma^2_y}\n", "$$\n", "\n", "% Simplifying the constant and assuming a sample size of $n_j$ for group j, i.e. to add log prob terms of all samples as they are i.i.d:\n", "$$\n", "\\log p(y_j | \\alpha_j) = -\\frac{n_j}{2}\\log(2\\pi\\sigma^2_y) -\\frac{1}{2\\sigma^2_y}\\sum_{i=1}^{n_j}(y_{ji} - \\alpha_j)^2\n", "$$\n", "\n", "% For the prior:\n", "$$\n", "p(\\alpha_j) = \\frac{1}{\\sqrt{2\\pi\\sigma^2_\\alpha}} \\exp\\left(-\\frac{(\\alpha_j - \\mu)^2}{2\\sigma^2_\\alpha}\\right)\n", "$$\n", "\n", "% The log prior is the logarithm of the above:\n", "$$\n", "\\log p(\\alpha_j) = \\log\\left(\\frac{1}{\\sqrt{2\\pi\\sigma^2_\\alpha}}\\right) -\\frac{(\\alpha_j - \\mu)^2}{2\\sigma^2_\\alpha}\n", "$$\n", "\n", "% Simplifying the constant:\n", "$$\n", "\\log p(\\alpha_j) = -\\frac{1}{2}\\log(2\\pi\\sigma^2_\\alpha) -\\frac{1}{2\\sigma^2_\\alpha}(\\alpha_j - \\mu)^2\n", "$$\n", "\n", "% Now, combining the log likelihood and the log prior for the log posterior:\n", "$$\n", "\\log p(\\alpha_j | y_j) \\propto -\\frac{n_j}{2}\\log(2\\pi\\sigma^2_y) -\\frac{1}{2\\sigma^2_y}\\sum_{i=1}^{n_j}(y_{ji} - \\alpha_j)^2 -\\frac{1}{2}\\log(2\\pi\\sigma^2_\\alpha) -\\frac{1}{2\\sigma^2_\\alpha}(\\alpha_j - \\mu)^2\n", "$$\n", "\n", "% Taking the derivative of the log posterior with respect to $\\alpha_j$, setting it to zero, and solving for $\\alpha_j$ will yield the MAP estimate.\n", "\n", "% Continuing from the log posterior, we have:\n", "$$\n", "\\log p(\\alpha_j | y_j) \\propto -\\frac{1}{2\\sigma^2_y}\\sum_{i=1}^{n_j}(y_{ji} - \\alpha_j)^2 -\\frac{1}{2\\sigma^2_\\alpha}(\\alpha_j - \\mu)^2\n", "$$\n", "\n", "% The derivative of the log posterior with respect to alpha_j is:\n", "$$\n", "\\frac{d}{d\\alpha_j}\\log p(\\alpha_j | y_j) = \\frac{1}{\\sigma^2_y}\\sum_{i=1}^{n_j}(y_{ji} - \\alpha_j) - \\frac{1}{\\sigma^2_\\alpha}(\\alpha_j - \\mu)\n", "$$\n", "\n", "% Setting the derivative to zero for the MAP estimate:\n", "$$\n", "\\frac{1}{\\sigma^2_y}\\sum_{i=1}^{n_j}(y_{ji} - \\alpha_j) - \\frac{1}{\\sigma^2_\\alpha}(\\alpha_j - \\mu) = 0\n", "$$\n", "\n", "% Solving for alpha_j gives us:\n", "$$\n", "\\frac{1}{\\sigma^2_y}\\sum_{i=1}^{n_j}y_{ji} - \\frac{n_j}{\\sigma^2_y}\\alpha_j - \\frac{1}{\\sigma^2_\\alpha}\\alpha_j + \\frac{\\mu}{\\sigma^2_\\alpha} = 0\n", "$$\n", "\n", "% Collect terms involving alpha_j:\n", "$$\n", "\\left(\\frac{n_j}{\\sigma^2_y} + \\frac{1}{\\sigma^2_\\alpha}\\right)\\alpha_j = \\frac{1}{\\sigma^2_y}\\sum_{i=1}^{n_j}y_{ji} + \\frac{\\mu}{\\sigma^2_\\alpha}\n", "$$\n", "\n", "% Since $\\sum_{i=1}^{n_j}y_{ji}$ is just $n_j$ times the sample mean $\\bar{y}_j$:\n", "$$\n", "\\left(\\frac{n_j}{\\sigma^2_y} + \\frac{1}{\\sigma^2_\\alpha}\\right)\\alpha_j = \\frac{n_j}{\\sigma^2_y}\\bar{y}_j + \\frac{\\mu}{\\sigma^2_\\alpha}\n", "$$\n", "\n", "% Divide through by the coefficient of $\\alpha_j$ to solve for it:\n", "$$\n", "\\alpha_j = \\frac{\\frac{n_j}{\\sigma^2_y}\\bar{y}_j + \\frac{\\mu}{\\sigma^2_\\alpha}}{\\frac{n_j}{\\sigma^2_y} + \\frac{1}{\\sigma^2_\\alpha}}\n", "$$\n", "\n", "% If we let $\\mu = \\bar{y}_{\\text{all}}$, we get the original equation provided:\n", "$$\n", "\\hat{\\alpha}_{j}^{\\text{partial-pooled}} \\approx \\frac{\\frac{n_j}{\\sigma^2_y}\\bar{y}_j + \\frac{1}{\\sigma^2_\\alpha}\\bar{y}_{\\text{all}}}{\\frac{n_j}{\\sigma^2_y} + \\frac{1}{\\sigma^2_\\alpha}}\n", "$$\n", "\n", "
\n", "