Guozhen AIGlobal AI field notes and model intelligence

English translation

Define objective function (negative because we minimize)

Published:

Category: Bayesian Learning

Read time: 4 min

Reads: 0

Lesson #7Views are counted together with the original Chinese articleImages are preserved from the source page

Structure Diagram of Maximum A Posteriori Estimation (MAP)

Bayesian learning centers on integrating prior beliefs with new evidence while explicitly quantifying uncertainty. As you read, structure your understanding as follows:
Bayes’ Theorem and the Posterior Distribution → Definition of Maximum A Posteriori Estimation (MAP) → MAP Application Example: Coin Tossing → Choosing a Prior Distribution,
then return to the code, examples, or metrics in the main text for verification.

MAP Verification Checklist

After reading, test your understanding using a small real-world task:

  • What are the inputs?
  • Where does processing occur?
  • Is the output verifiable and acceptable?
    If the task fails, first revisit “Bayes’ Theorem and the Posterior Distribution”, then check “Definition of Maximum A Posteriori Estimation (MAP)”.

In this tutorial, we delve into Maximum A Posteriori Estimation (MAP). In the previous part, we covered the fundamentals of Bayes’ theorem and its updating rule. Now, we apply Bayes’ theorem to parameter estimation—specifically, using MAP to infer unknown parameters.

Bayes’ Theorem and the Posterior Distribution

The core idea behind Bayes’ theorem is to update our belief about a parameter based on observed data. The posterior distribution represents our updated belief about the parameter after observing the data, expressed mathematically as:

MAP Decision Card

When learning MAP, first compare what information the likelihood and the prior each contribute. When sample size is small or noise is high, the prior strongly influences the final estimate.

p(θD)=p(Dθ)p(θ)p(D)p(\theta \mid D) = \frac{p(D \mid \theta)\, p(\theta)}{p(D)}

where:

  • p(θD)p(\theta \mid D) is the posterior distribution of parameter θ\theta given data DD;
  • p(Dθ)p(D \mid \theta) is the likelihood function—the probability of observing data DD under parameter θ\theta;
  • p(θ)p(\theta) is the prior distribution—our belief about θ\theta before seeing the data;
  • p(D)p(D) is the marginal likelihood (or evidence), which is constant w.r.t. θ\theta and thus plays no role in optimization.

Definition of Maximum A Posteriori Estimation (MAP)

Maximum A Posteriori Estimation (MAP) estimates the parameter value by maximizing the posterior distribution. Specifically, we seek:

Bayesian Learning Reading Map Card

You don’t need to absorb all details of “Maximum A Posteriori Estimation (MAP)” at once. Start with a small, hands-on problem you can verify yourself, then use the diagrams and main text to fill in conceptual gaps.

θ^MAP=argmaxθp(θD)\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \, p(\theta \mid D)

Using Bayes’ theorem, this is equivalent to:

θ^MAP=argmaxθp(Dθ)p(θ)\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \, p(D \mid \theta)\, p(\theta)

since p(D)p(D) is independent of θ\theta and therefore omitted during maximization.

MAP Application Example: Coin Tossing

Suppose we have a biased coin and wish to estimate the probability θ\theta that it lands heads up. We toss it 10 times and observe 7 heads (H) and 3 tails (T). We’ll use MAP to estimate θ\theta.

1. Choosing a Prior Distribution

We adopt a Beta distribution as the prior:

p(θ)=Beta(α,β)=θα1(1θ)β1B(α,β)p(\theta) = \text{Beta}(\alpha, \beta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}

We select α=2\alpha = 2 and β=2\beta = 2, reflecting a prior belief that heads and tails are equally likely before any tosses.

2. Likelihood Function

Given 7 heads and 3 tails in 10 tosses, the likelihood is:

p(Dθ)=θ7(1θ)3p(D \mid \theta) = \theta^7 (1 - \theta)^3

3. Computing the Posterior Distribution

Maximizing the posterior is equivalent to maximizing the unnormalized posterior:

p(θD)p(Dθ)p(θ)θ7(1θ)3θ1(1θ)1=θ8(1θ)4p(\theta \mid D) \propto p(D \mid \theta)\, p(\theta) \propto \theta^7 (1 - \theta)^3 \cdot \theta^{1} (1 - \theta)^{1} = \theta^{8} (1 - \theta)^{4}

4. Solving for the MAP Estimate

To find the value of θ\theta that maximizes θ8(1θ)4\theta^{8}(1 - \theta)^{4}, we differentiate and solve for critical points:

ddθ(θ8(1θ)4)=0\frac{d}{d\theta} \left( \theta^{8} (1 - \theta)^{4} \right) = 0

Alternatively, we can use numerical optimization:

import numpy as np
from scipy.optimize import minimize_scalar

# Define objective function (negative because we minimize)
def objective(theta):
    return - (theta**8 * (1 - theta)**4)

result = minimize_scalar(objective, bounds=(0, 1), method='bounded')
theta_map = result.x
print("MAP estimate:", theta_map)

This code outputs our estimated probability θ\theta of heads.

MAP Application Retrospective Card

If “Maximum A Posteriori Estimation (MAP)” hasn’t fully clicked yet, walk through these four actions again using this card.

MAP Application Verification Card

When reviewing “Maximum A Posteriori Estimation (MAP)”, avoid jumping straight into large projects. Instead, first validate the core logic using a simple, concrete example.

Summary

This tutorial thoroughly introduced the concept and application of Maximum A Posteriori Estimation (MAP). Using the coin-tossing example, we demonstrated how to estimate parameters by maximizing the posterior distribution. In the next tutorial, we will compare Bayesian estimation with frequentist estimation to deepen our understanding of statistical inference.

Mastering MAP lays a solid foundation for further study in Bayesian learning. If you have questions or would like deeper discussion, feel free to ask!

Continue

Keep reading from here

Browse English site

Reader Messages

Reader messages

Questions, corrections, extra sources, or hands-on results can be left here. No login is required.

Max 800 characters

To reduce spam, each message is checked for length, link count, and posting frequency.

0/800

Messages

0 messages
Loading messages...