Stochastic Multi-Armed Bandits (MABs)#
This notes is completed with assistance of ChatGPT
1. Introduction to MABs:
- Definition: A decision-maker faces \( k \) options (arms) and must choose one at a time to receive a reward. The goal is to maximize the cumulative reward over time. 
- Challenge: Balancing exploration (trying out different options) with exploitation (leveraging known information). 
2. Bandit Setting vs. Learning with Experts:
- Expert Selection: In the expert setting, one has to pick an expert (akin to an arm in MABs). 
- Observation Limitation: Rewards are observed only for the chosen arm/expert. 
3. Exploration vs. Exploitation:
- MABs: Represent the simplest setting for balancing exploration and exploitation. 
4. Stochastic MAB Setting:
- Actions: Possible actions are represented by arms. 
- Reward Distribution: Each arm \( i \) has a reward distribution \( P_i \) with mean \( \mu_i \). 
- Goal: Minimize cumulative regret, defined as: 
Where \( \mu^* \) is the maximum mean reward across all arms.
5. Greedy Approach:
- Estimation: For each arm \( i \), estimate its value \( Q_i \) as: 
Where \( \mathbb{1} \) is the indicator function.
- Action Selection: Always choose the arm with the highest estimated value. 
6. Upper Confidence Bound (UCB):
- Estimation: For each arm \( i \), compute its UCB as: 
Where \( \mû_i \) is the observed average reward for arm \( i \) and \( N_i \) is the number of times arm \( i \) has been pulled.
- Action Selection: Choose the arm with the highest UCB. 
7. Contextual Bandits:
- Reward Estimation: The reward is estimated based on the context, turning the problem into a regression task. 
Where \( X_t \) is the context vector.
- Action Selection: UCB is computed based on the regression prediction given the context vector. 
8. MABs vs. Reinforcement Learning (RL):
- State Transitions: In RL, the chosen action in a state determines or influences the next state. 
9. Summary & Looking Ahead:
- Key Takeaways: The lecture covered the foundational concept of MABs, strategies like ε-greedy and UCB, and extensions like contextual bandits. 
Points of Clarification:
- Difference between MABs and Learning with Experts: In MABs, you only observe the reward for the chosen arm. 
- UCB’s Exploration Term: The exploration term in UCB ensures that arms aren’t prematurely discarded based on limited data. 
- Contextual Bandits vs. RL: The main distinction is the consideration of state transitions in RL, which are absent in contextual bandits. 
