Stochastic Multi-Armed Bandits (MABs)#
This notes is completed with assistance of ChatGPT
1. Introduction to MABs:
Definition: A decision-maker faces \( k \) options (arms) and must choose one at a time to receive a reward. The goal is to maximize the cumulative reward over time.
Challenge: Balancing exploration (trying out different options) with exploitation (leveraging known information).
2. Bandit Setting vs. Learning with Experts:
Expert Selection: In the expert setting, one has to pick an expert (akin to an arm in MABs).
Observation Limitation: Rewards are observed only for the chosen arm/expert.
3. Exploration vs. Exploitation:
MABs: Represent the simplest setting for balancing exploration and exploitation.
4. Stochastic MAB Setting:
Actions: Possible actions are represented by arms.
Reward Distribution: Each arm \( i \) has a reward distribution \( P_i \) with mean \( \mu_i \).
Goal: Minimize cumulative regret, defined as:
Where \( \mu^* \) is the maximum mean reward across all arms.
5. Greedy Approach:
Estimation: For each arm \( i \), estimate its value \( Q_i \) as:
Where \( \mathbb{1} \) is the indicator function.
Action Selection: Always choose the arm with the highest estimated value.
6. Upper Confidence Bound (UCB):
Estimation: For each arm \( i \), compute its UCB as:
Where \( \mû_i \) is the observed average reward for arm \( i \) and \( N_i \) is the number of times arm \( i \) has been pulled.
Action Selection: Choose the arm with the highest UCB.
7. Contextual Bandits:
Reward Estimation: The reward is estimated based on the context, turning the problem into a regression task.
Where \( X_t \) is the context vector.
Action Selection: UCB is computed based on the regression prediction given the context vector.
8. MABs vs. Reinforcement Learning (RL):
State Transitions: In RL, the chosen action in a state determines or influences the next state.
9. Summary & Looking Ahead:
Key Takeaways: The lecture covered the foundational concept of MABs, strategies like ε-greedy and UCB, and extensions like contextual bandits.
Points of Clarification:
Difference between MABs and Learning with Experts: In MABs, you only observe the reward for the chosen arm.
UCB’s Exploration Term: The exploration term in UCB ensures that arms aren’t prematurely discarded based on limited data.
Contextual Bandits vs. RL: The main distinction is the consideration of state transitions in RL, which are absent in contextual bandits.