Jekyll2019-02-26T21:03:00-05:00https://0foldcv.com/feed.xml0-fold Cross-ValidationA math-y blog about machine learning, statistics, data analysis, artificial intelligence, mathematics, computation, and related topics.
Alexej GossmannFrom conditional probability to conditional distribution to conditional expectation, and back2018-08-12T00:00:00-04:002018-08-12T00:00:00-04:00https://0foldcv.com/2018/08/12/conditional_distributions<p>I can’t count how many times I have looked up the formal (measure theoretic) definitions of conditional probability distribution or conditional expectation (even though it’s not that hard :weary:) Another such occasion was yesterday. This time I took some notes.</p>
<h2 id="from-conditional-probability--to-conditional-distribution--to-conditional-expectation">From conditional probability → to conditional distribution → to conditional expectation</h2>
<p>Let <script type="math/tex">X</script> and <script type="math/tex">Y</script> be two real-valued random variables.</p>
<h3 id="conditional-probability">Conditional probability</h3>
<p>For a fixed set <script type="math/tex">B</script> <a href="#FellerVol2">(Feller, 1966, p. 157)</a> defines conditional probability of an event <script type="math/tex">\{Y \in B\}</script> for given <script type="math/tex">X</script> as follows.</p>
<blockquote>
<p>By <script type="math/tex">\prob(Y \in B \vert X)</script> (in words, “a conditional probability of the event <script type="math/tex">\{Y \in B\}</script> for given <script type="math/tex">X</script>”) is meant a function <script type="math/tex">q(X, B)</script> such that for every set <script type="math/tex">A \in \mathbb{R}</script></p>
<script type="math/tex; mode=display">\prob(X \in A, Y \in B) = \int_A q(x, B) \mu(dx)</script>
<p>where <script type="math/tex">\mu</script> is the marginal distribution of <script type="math/tex">X</script>.</p>
</blockquote>
<p>(where <script type="math/tex">A</script> and <script type="math/tex">B</script> are both <a href="https://en.wikipedia.org/wiki/Borel_set">Borel sets</a> on <script type="math/tex">\R</script>.)</p>
<p>That is, the conditional probability can be defined as something that, when integrated with respect to the marginal distribution of <script type="math/tex">X</script>, results in the joint probability of <script type="math/tex">X</script> and <script type="math/tex">Y</script>.</p>
<p>Moreover, note that if <script type="math/tex">A = \R</script> then the above formula yields <script type="math/tex">\prob(Y \in B)</script>, the marginal probability of the event <script type="math/tex">\{ Y \in B \}</script>.</p>
<h4 id="example">Example</h4>
<p>For example, if the joint distribution of two random variables <script type="math/tex">X</script> and <script type="math/tex">Y</script> is the following <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Bivariate_case">bivariate normal</a> distribution</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
X \\
Y
\end{pmatrix}
\sim \mathcal{N} \left(
\begin{pmatrix}
\mu_X \\
\mu_Y
\end{pmatrix},
\begin{pmatrix}
\sigma^2_X & \rho \sigma_X \sigma_Y \\
\rho \sigma_X \sigma_Y & \sigma^2_Y
\end{pmatrix}
\right), %]]></script>
<p>then by sitting down with a pen and paper for some amount of time, it is not hard to verify that the function</p>
<script type="math/tex; mode=display">q(x, B) = \int_B \frac{1}{\sqrt{2\pi(1-\rho^2)}\sigma_Y} \exp\left(-\frac{\left(y - \mu_Y+\frac{\sigma_Y}{\sigma_X}\rho( x - \mu_X)\right)^2}{2(1-\rho^2)\sigma_Y^2}\right) \mathrm{d}y</script>
<p>in this case satisfies the above definition of <script type="math/tex">\prob(Y \in B \vert X)</script>.</p>
<h3 id="conditional-distribution">Conditional distribution</h3>
<p>Later on <a href="#FellerVol2">(Feller, 1966, p. 159)</a> follows up with the notion of conditional probability distribution:</p>
<blockquote>
<p>By a conditional probability distribution of <script type="math/tex">Y</script> for given <script type="math/tex">X</script> is meant a function <script type="math/tex">q</script> of two variables, a point <script type="math/tex">x</script> and a set <script type="math/tex">B</script>, such that</p>
<ol>
<li>
<p>for a fixed set <script type="math/tex">B</script></p>
<script type="math/tex; mode=display">q(X, B) = \prob(Y \in B \vert X )</script>
<p>is a conditional probability of the event <script type="math/tex">\{Y \in B\}</script> for given <script type="math/tex">X</script>.</p>
</li>
<li>
<p><script type="math/tex">q</script> is for each <script type="math/tex">x</script> a probability distribution.</p>
</li>
</ol>
</blockquote>
<p>It is also pointed out that</p>
<blockquote><p>In effect a conditional probability distribution is a family of ordinary probability distributions and so the whole theory carries over without change.</p><cite><a href="#FellerVol2">(Feller, 1966)</a></cite></blockquote>
<p>When I first came across this viewpoint, I found it incredibly enlightening to regard the conditional probability distribution as a <em>family</em> of ordinary probability distributions. :smile:</p>
<h4 id="example-1">Example</h4>
<p>For example, assume that <script type="math/tex">X</script> is an integer-valued and non-negative random variable, and that the conditional probability distribution of <script type="math/tex">Y</script> for given <script type="math/tex">X</script> is an <a href="https://en.wikipedia.org/wiki/F-distribution">F-distribution</a> (denoted <script type="math/tex">\mathrm{F}(d_1, d_2)</script>) with <script type="math/tex">d_1 = e^X</script> and <script type="math/tex">d_2 = 2^X</script> degrees of freedom.
Then the conditional probability distribution of <script type="math/tex">(Y \vert X)</script> can be regarded as a family of probability distributions <script type="math/tex">\mathrm{F}(e^x, 2^x)</script> for <script type="math/tex">x = 0, 1, 2, \dots</script>, whose probability density functions look like this:</p>
<p><img src="https://0foldcv.com/assets/img/2018-08-12-conditional_distributions/conditional_densities.png" alt="Probability density functions of (Y|X=x) for different values x" /></p>
<p>In addition, as pointed out above, if we know the marginal distribution of <script type="math/tex">X</script>, then the conditional probability distribution of <script type="math/tex">(Y \vert X)</script> can be used to obtain the marginal probability distribution of <script type="math/tex">Y</script>, or to randomly sample from the marginal distribution.
Practically it means that if we randomly generate a value of <script type="math/tex">X</script> according to its probability distribution, and use this value to randomly generate a value of <script type="math/tex">Y</script> according to the conditional distribution of <script type="math/tex">Y</script> for the given <script type="math/tex">X</script>, then the observations resulting from this procedure follow the marginal distribution of <script type="math/tex">Y</script>.
Continuing the previous example, assume that <script type="math/tex">X</script> follows a <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a> with parameters <script type="math/tex">n = 5</script> and <script type="math/tex">p = 0.5</script>. Then the described simulation procedure estimates the following shape for the probability density function of <script type="math/tex">\prob(Y)</script>, the marginal distribution of <script type="math/tex">Y</script>:</p>
<p><img src="https://0foldcv.com/assets/img/2018-08-12-conditional_distributions/marginal_density.png" alt="Probability density function of Y" /></p>
<h3 id="conditional-expectation">Conditional expectation</h3>
<p>Finally, <a href="#FellerVol2">(Feller, 1966, p. 159)</a> introduces the notion of conditional expectation.
By the above, for given a value <script type="math/tex">x</script> we have that</p>
<script type="math/tex; mode=display">q(x, B) = \prob(Y \in B \vert X = x), \quad\forall B\in\mathcal{B}</script>
<p>(here <script type="math/tex">\mathcal{B}</script> denotes the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel <script type="math/tex">\sigma</script>-algebra</a> on <script type="math/tex">\R</script>), and therefore, a conditional probability distribution can be viewed as a family of ordinary probability distributions (represented by <script type="math/tex">q</script> for different <script type="math/tex">x</script>s).
Thus, as <a href="#FellerVol2">(Feller, 1966, p. 159)</a> points out, if <script type="math/tex">q</script> is given then the conditional expectation <em>“introduces a new notation rather than a new concept.”</em></p>
<blockquote>
<p>A conditional expectation <script type="math/tex">E(Y \vert X)</script> is a function of <script type="math/tex">X</script> assuming at <script type="math/tex">x</script> the value</p>
<script type="math/tex; mode=display">\E(Y \vert X = x) = \int_{-\infty}^{\infty} y q(x, dy)</script>
<p>provided the integral converges.</p>
</blockquote>
<p>Note that, because <script type="math/tex">\E(Y \vert X)</script> is a function of <script type="math/tex">X</script>, it is a random variable, whose value at an individual point <script type="math/tex">x</script> is given by the above definition.
Moreover, from the above definitions of conditional probability and conditional expectation it follows that</p>
<script type="math/tex; mode=display">\E(Y) = \E(\E(Y \vert X)).</script>
<h4 id="example-cont">Example [cont.]</h4>
<p>We continue with the last example.
From the properties of the <a href="https://en.wikipedia.org/wiki/F-distribution">F-distribution</a> we know that under this example’s assumptions on the conditional distribution, it holds that</p>
<script type="math/tex; mode=display">\E(Y \vert X = x) =
\begin{cases}
\frac{d_2}{d_2 - 2} = \frac{2^x}{2^x - 2}, \quad x > 1,\\
\infty, \quad x \leq 1.
\end{cases}</script>
<p>A rather boring strictly decreasing function of <script type="math/tex">x</script> converging to <script type="math/tex">1</script> as <script type="math/tex">x\to\infty</script>.</p>
<p>Thus, under the example’s assumption on the distribution of <script type="math/tex">X</script>, the conditional expectation <script type="math/tex">\E(Y \vert X)</script> is a discrete random variable, which has non-zero probability mass at the values <script type="math/tex">2, 4/3, 8/7, 16/15,</script> and <script type="math/tex">\infty</script>.</p>
<h2 id="from-conditional-expectation--to-conditional-probability">From conditional expectation → to conditional probability</h2>
<p>An alternative approach is to define the conditional expectation first, and then to define conditional probability as the conditional expectation of <a href="https://en.wikipedia.org/wiki/Indicator_function">the indicator function</a>.
This approach seems less intuitive to me. However, it is more flexible and more general, as we see below.</p>
<h3 id="conditional-expectation-1">Conditional expectation</h3>
<h4 id="a-definition-in-2d">A definition in 2D</h4>
<p>Let <script type="math/tex">X</script> and <script type="math/tex">Y</script> be two real-valued random variables, and let <script type="math/tex">\mathcal{B}</script> denote the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel <script type="math/tex">\sigma</script>-algebra</a> on <script type="math/tex">\R</script>.
Recall that <script type="math/tex">X</script> and <script type="math/tex">Y</script> can be represented as mappings <script type="math/tex">X: \Omega \to \R</script> and <script type="math/tex">Y: \Omega \to \R</script> over some <a href="https://en.wikipedia.org/wiki/Measure_space">measure space</a> <script type="math/tex">(\Omega, \mathcal{A}, \prob)</script>.
We can define <script type="math/tex">\mathrm{E}(Y \vert X=x)</script>, the conditional expectation of <script type="math/tex">Y</script> given <script type="math/tex">X=x</script>, as follows.</p>
<p>A <script type="math/tex">\mathcal{B}</script>-measurable function <script type="math/tex">g(x)</script> is the conditional expectation of <script type="math/tex">Y</script> for given <script type="math/tex">x</script>, i.e.,</p>
<script type="math/tex; mode=display">\mathrm{E}(Y \vert X=x) = g(x),</script>
<p>if for all sets <script type="math/tex">B\in\mathcal{B}</script> it holds that</p>
<script type="math/tex; mode=display">\int_{X^{-1}(B)} Y(\omega) d\prob(\omega) = \int_{B} g(x) d\prob^X(x),</script>
<p>where <script type="math/tex">\prob^X</script> is the marginal probability distribution of <script type="math/tex">X</script>.</p>
<h4 id="interpretation-in-2d">Interpretation in 2D</h4>
<p>If <script type="math/tex">X</script> and <script type="math/tex">Y</script> are real-valued one-dimensional, then the pair <script type="math/tex">(X,Y)</script> can be viewed as a random vector in the plane.
Each set <script type="math/tex">\{X \in A\}</script> consists of parallels to the <script type="math/tex">y</script>-axis, and we can define a <script type="math/tex">\sigma</script>-algebra induced by <script type="math/tex">X</script> as the collection of all sets <script type="math/tex">\{X \in A\}</script> on the plane, where <script type="math/tex">A</script> is a Borel set on the line.
The collection of all such sets forms a <script type="math/tex">\sigma</script>-algebra <script type="math/tex">\mathcal{A}</script> on the plane, which is contained in the <script type="math/tex">\sigma</script>-algebra of all Borel sets in <script type="math/tex">\R^2</script>.
<script type="math/tex">\mathcal{A}</script> is called the <script type="math/tex">\sigma</script>-algebra generated by the random variable <script type="math/tex">X</script>.</p>
<p>Then <script type="math/tex">\mathrm{E}(Y \vert X)</script> can be equivalently defined as a random variable such that</p>
<script type="math/tex; mode=display">\mathrm{E}(Y\cdot I_{A}) = \mathrm{E}(\mathrm{E}(Y \vert X) \cdot I_{A}), \quad \forall A\in\mathcal{A},</script>
<p>where <script type="math/tex">I_{A}</script> denotes the indicator function of the set <script type="math/tex">A</script>.</p>
<h4 id="a-more-general-definition-of-conditional-expectation">A more general definition of conditional expectation</h4>
<p>The last paragraph illustrates that one could generalize the definition of the conditional expectation of <script type="math/tex">Y</script> given <script type="math/tex">X</script> to the conditional expectation of <script type="math/tex">Y</script> given an arbitrary <script type="math/tex">\sigma</script>-algebra <script type="math/tex">\mathcal{B}</script> (not necessarily the <script type="math/tex">\sigma</script>-algebra generated by <script type="math/tex">X</script>).
This leads to the following general definition, which is stated in <a href="#FellerVol2">(Feller, 1966, pp. 160-161)</a> in a slightly different notation.</p>
<p>Let <script type="math/tex">Y</script> be a random variable, and let <script type="math/tex">\mathcal{B}</script> be a <script type="math/tex">\sigma</script>-algebra of sets.</p>
<ol>
<li>
<p>A random variable <script type="math/tex">U</script> is called a conditional expectation of <script type="math/tex">Y</script> relative to <script type="math/tex">\mathcal{B}</script>, or <script type="math/tex">U = \E(Y \vert \mathcal{B})</script>, if it is <script type="math/tex">\mathcal{B}</script>-measurable and</p>
<script type="math/tex; mode=display">\E(Y\cdot I_{B}) = \E(U \cdot I_{B}), \quad \forall B\in\mathcal{B}.</script>
</li>
<li>
<p>If <script type="math/tex">\mathcal{B}</script> is the <script type="math/tex">\sigma</script>-algebra generated by a random variable <script type="math/tex">X</script>, then <script type="math/tex">\E(Y \vert X) = \E(Y \vert \mathcal{B})</script>.</p>
</li>
</ol>
<h3 id="back-to-conditional-probability-and-conditional-distributions">Back to conditional probability and conditional distributions</h3>
<p>Let <script type="math/tex">I_{\{Y \in A\}}</script> be a random variable that is equal to one if and only if <script type="math/tex">Y\in A</script>. The conditional probability of <script type="math/tex">\{Y \in A\}</script> given <script type="math/tex">X = x</script> can be defined in terms of a conditional expectation as</p>
<script type="math/tex; mode=display">\prob(Y \in A \vert X = x) = \E(I_{\{Y \in A\}} \vert X = x).</script>
<p>Under certain regularity conditions the above defines the conditional probability distribution of <script type="math/tex">(Y \vert X)</script>.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="FellerVol2">Feller, W. (1966). <i>An introduction to probability theory and its applications</i> (Vol. 2). John Wiley & Sons.</span></li></ol>Alexej GossmannI can’t count how many times I have looked up the formal (measure theoretic) definitions of conditional probability distribution or conditional expectation (even though it’s not that hard :weary:) Another such occasion was yesterday. This time I took some notes.