# Reinforcement Learning

## 📚 Overview

Reinforcement Learning (RL) adalah paradigma machine learning dimana agent belajar untuk mengambil actions yang optimal dalam environment melalui trial and error. Agent menerima rewards atau penalties berdasarkan actions yang diambil, dan bertujuan untuk memaksimalkan cumulative reward dalam jangka panjang.

## 🎯 Key Concepts

### 1. **Agent**

Entity yang belajar dan mengambil decisions dalam environment.

### 2. **Environment**

World dimana agent berinteraksi dan mengambil actions.

### 3. **State (S)**

Representasi kondisi environment pada waktu tertentu.

### 4. **Action (A)**

Decision yang dapat diambil oleh agent dalam state tertentu.

### 5. **Reward (R)**

Feedback yang diterima agent setelah mengambil action.

### 6. **Policy (π)**

Strategy yang menentukan action apa yang diambil dalam setiap state.

### 7. **Value Function (V)**

Expected cumulative reward dari state tertentu.

### 8. **Q-Function (Q)**

Expected cumulative reward dari action tertentu dalam state tertentu.

## 🚀 Types of Reinforcement Learning

### 1. **Model-Based RL**

Agent memiliki model dari environment dan dapat simulate outcomes.

**Examples:**

* Dynamic Programming
* Model Predictive Control
* Monte Carlo Tree Search

### 2. **Model-Free RL**

Agent tidak memiliki model environment dan belajar langsung dari experience.

**Examples:**

* Q-Learning
* SARSA
* Deep Q-Networks (DQN)
* Policy Gradient Methods

### 3. **On-Policy vs Off-Policy**

* **On-Policy**: Learn about policy being used to make decisions
* **Off-Policy**: Learn about policy different from one being used

## 🧠 Popular Algorithms

### Value-Based Methods

#### 1. **Q-Learning**

Off-policy algorithm yang learns optimal action-value function.

```python
import numpy as np
import random

class QLearningAgent:
    def __init__(self, state_size, action_size, learning_rate=0.1, discount_factor=0.95, epsilon=0.1):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        
        # Initialize Q-table
        self.q_table = np.zeros((state_size, action_size))
    
    def get_action(self, state):
        """Choose action using epsilon-greedy policy"""
        if random.random() < self.epsilon:
            # Exploration: random action
            return random.randint(0, self.action_size - 1)
        else:
            # Exploitation: best action
            return np.argmax(self.q_table[state])
    
    def update(self, state, action, reward, next_state, done):
        """Update Q-value using Q-learning update rule"""
        if done:
            target = reward
        else:
            target = reward + self.discount_factor * np.max(self.q_table[next_state])
        
        # Q-learning update rule
        self.q_table[state, action] += self.learning_rate * (target - self.q_table[state, action])
    
    def train(self, environment, episodes=1000):
        """Train the agent"""
        episode_rewards = []
        
        for episode in range(episodes):
            state = environment.reset()
            total_reward = 0
            done = False
            
            while not done:
                action = self.get_action(state)
                next_state, reward, done, _ = environment.step(action)
                
                self.update(state, action, reward, next_state, done)
                state = next_state
                total_reward += reward
            
            episode_rewards.append(total_reward)
            
            if episode % 100 == 0:
                avg_reward = np.mean(episode_rewards[-100:])
                print(f"Episode {episode}, Average Reward: {avg_reward:.2f}")
        
        return episode_rewards

# Example usage with simple environment
class SimpleEnvironment:
    def __init__(self, size=5):
        self.size = size
        self.state = 0
        self.goal = size - 1
    
    def reset(self):
        self.state = 0
        return self.state
    
    def step(self, action):
        if action == 0:  # Move left
            self.state = max(0, self.state - 1)
        elif action == 1:  # Move right
            self.state = min(self.size - 1, self.state + 1)
        
        # Reward: +10 for reaching goal, -1 for each step
        if self.state == self.goal:
            reward = 10
            done = True
        else:
            reward = -1
            done = False
        
        return self.state, reward, done, {}
    
    def get_state_size(self):
        return self.size
    
    def get_action_size(self):
        return 2

# Train Q-learning agent
env = SimpleEnvironment(size=5)
agent = QLearningAgent(
    state_size=env.get_state_size(),
    action_size=env.get_action_size(),
    learning_rate=0.1,
    discount_factor=0.95,
    epsilon=0.1
)

rewards = agent.train(env, episodes=500)

# Plot training progress
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Q-Learning Training Progress')
plt.grid(True)
plt.show()

# Show learned Q-table
print("Learned Q-Table:")
print(agent.q_table)
```

**Pros:**

* Simple and effective
* Guaranteed convergence
* Off-policy learning
* Good for discrete state/action spaces

**Cons:**

* Requires discrete state/action spaces
* Memory intensive for large spaces
* May converge slowly
* Doesn't handle continuous spaces well

**Use Cases:**

* Game AI
* Robot navigation
* Resource allocation
* Trading systems

#### 2. **Deep Q-Networks (DQN)**

Q-learning dengan neural networks untuk continuous state spaces.

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

class DQNAgent:
    def __init__(self, state_size, action_size, learning_rate=0.001, discount_factor=0.95, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        
        # Neural networks
        self.q_network = DQN(state_size, action_size)
        self.target_network = DQN(state_size, action_size)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        
        # Experience replay
        self.memory = deque(maxlen=2000)
        self.batch_size = 32
        
        # Update target network
        self.update_target_network()
    
    def update_target_network(self):
        """Update target network with current Q-network weights"""
        self.target_network.load_state_dict(self.q_network.state_dict())
    
    def remember(self, state, action, reward, next_state, done):
        """Store experience in memory"""
        self.memory.append((state, action, reward, next_state, done))
    
    def get_action(self, state):
        """Choose action using epsilon-greedy policy"""
        if random.random() < self.epsilon:
            return random.randint(0, self.action_size - 1)
        
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        q_values = self.q_network(state_tensor)
        return np.argmax(q_values.detach().numpy())
    
    def replay(self):
        """Train on batch of experiences"""
        if len(self.memory) < self.batch_size:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        states = torch.FloatTensor([e[0] for e in batch])
        actions = torch.LongTensor([e[1] for e in batch])
        rewards = torch.FloatTensor([e[2] for e in batch])
        next_states = torch.FloatTensor([e[3] for e in batch])
        dones = torch.BoolTensor([e[4] for e in batch])
        
        # Current Q-values
        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        # Next Q-values
        next_q_values = self.target_network(next_states).max(1)[0].detach()
        target_q_values = rewards + (self.discount_factor * next_q_values * ~dones)
        
        # Loss and optimization
        loss = F.mse_loss(current_q_values.squeeze(), target_q_values)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Decay epsilon
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
    
    def train(self, environment, episodes=1000):
        """Train the agent"""
        episode_rewards = []
        
        for episode in range(episodes):
            state = environment.reset()
            total_reward = 0
            done = False
            
            while not done:
                action = self.get_action(state)
                next_state, reward, done, _ = environment.step(action)
                
                self.remember(state, action, reward, next_state, done)
                state = next_state
                total_reward += reward
                
                # Train on batch of experiences
                self.replay()
            
            episode_rewards.append(total_reward)
            
            # Update target network periodically
            if episode % 100 == 0:
                self.update_target_network()
            
            if episode % 100 == 0:
                avg_reward = np.mean(episode_rewards[-100:])
                print(f"Episode {episode}, Average Reward: {avg_reward:.2f}, Epsilon: {self.epsilon:.3f}")
        
        return episode_rewards
```

**Pros:**

* Handles continuous state spaces
* Can learn complex patterns
* Good for high-dimensional inputs
* Experience replay for stability

**Cons:**

* Computationally expensive
* Requires careful hyperparameter tuning
* Can be unstable during training
* Needs large amounts of data

**Use Cases:**

* Game playing (Atari, Dota 2)
* Robot control
* Autonomous vehicles
* Resource management

### Policy-Based Methods

#### **Policy Gradient Methods**

Directly optimize policy parameters using gradient ascent.

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return F.softmax(self.fc3(x), dim=-1)

class PolicyGradientAgent:
    def __init__(self, state_size, action_size, learning_rate=0.001):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        
        self.policy_network = PolicyNetwork(state_size, action_size)
        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=learning_rate)
        
        # Memory for episode
        self.states = []
        self.actions = []
        self.rewards = []
    
    def get_action(self, state):
        """Sample action from policy"""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action_probs = self.policy_network(state_tensor)
        action_dist = torch.distributions.Categorical(action_probs)
        action = action_dist.sample()
        
        return action.item(), action_dist.log_prob(action)
    
    def remember(self, state, action, reward):
        """Store experience"""
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
    
    def update_policy(self):
        """Update policy using REINFORCE algorithm"""
        if not self.states:
            return
        
        # Convert to tensors
        states = torch.FloatTensor(self.states)
        actions = torch.LongTensor(self.actions)
        rewards = torch.FloatTensor(self.rewards)
        
        # Calculate returns
        returns = []
        R = 0
        for r in reversed(self.rewards):
            R = r + 0.95 * R  # discount factor
            returns.insert(0, R)
        returns = torch.FloatTensor(returns)
        
        # Normalize returns
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Get log probabilities
        action_probs = self.policy_network(states)
        dist = torch.distributions.Categorical(action_probs)
        log_probs = dist.log_prob(actions)
        
        # Policy gradient loss
        loss = -(log_probs * returns).mean()
        
        # Optimization
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear memory
        self.states = []
        self.actions = []
        self.rewards = []
    
    def train(self, environment, episodes=1000):
        """Train the agent"""
        episode_rewards = []
        
        for episode in range(episodes):
            state = environment.reset()
            total_reward = 0
            done = False
            
            while not done:
                action, log_prob = self.get_action(state)
                next_state, reward, done, _ = environment.step(action)
                
                self.remember(state, action, reward)
                state = next_state
                total_reward += reward
            
            # Update policy after episode
            self.update_policy()
            episode_rewards.append(total_reward)
            
            if episode % 100 == 0:
                avg_reward = np.mean(episode_rewards[-100:])
                print(f"Episode {episode}, Average Reward: {avg_reward:.2f}")
        
        return episode_rewards
```

**Pros:**

* Can handle continuous action spaces
* Direct policy optimization
* Good convergence properties
* Natural exploration

**Cons:**

* High variance in gradient estimates
* May converge to local optima
* Requires careful learning rate tuning
* Slower convergence than value-based methods

**Use Cases:**

* Robot control
* Game playing
* Continuous control tasks
* Natural language processing

## 🔧 Advanced Techniques

### 1. **Actor-Critic Methods**

Combine policy gradient with value function estimation.

```python
class ActorCritic(nn.Module):
    def __init__(self, state_size, action_size):
        super(ActorCritic, self).__init__()
        
        # Actor (Policy) network
        self.actor = nn.Sequential(
            nn.Linear(state_size, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, action_size),
            nn.Softmax(dim=-1)
        )
        
        # Critic (Value) network
        self.critic = nn.Sequential(
            nn.Linear(state_size, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
    
    def forward(self, state):
        action_probs = self.actor(state)
        state_value = self.critic(state)
        return action_probs, state_value
```

### 2. **Proximal Policy Optimization (PPO)**

Policy gradient method with importance sampling and clipping.

### 3. **Soft Actor-Critic (SAC)**

Off-policy actor-critic method with entropy maximization.

## 📊 Evaluation Metrics

### 1. **Episode Rewards**

Total reward accumulated per episode.

### 2. **Average Reward**

Mean reward over multiple episodes.

### 3. **Success Rate**

Percentage of successful episodes.

### 4. **Convergence**

Stability of learning over time.

```python
def evaluate_agent(agent, environment, episodes=100):
    """Evaluate trained agent"""
    episode_rewards = []
    success_count = 0
    
    for episode in range(episodes):
        state = environment.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = agent.get_action(state)
            state, reward, done, _ = environment.step(action)
            total_reward += reward
        
        episode_rewards.append(total_reward)
        
        # Define success criteria (e.g., reward > threshold)
        if total_reward > 0:
            success_count += 1
    
    avg_reward = np.mean(episode_rewards)
    success_rate = success_count / episodes
    
    print(f"Evaluation Results:")
    print(f"Average Reward: {avg_reward:.2f}")
    print(f"Success Rate: {success_rate:.2f}")
    
    return avg_reward, success_rate
```

## 🚀 Best Practices

### 1. **Environment Design**

* Clear reward structure
* Appropriate state representation
* Reasonable action space
* Good exploration opportunities

### 2. **Hyperparameter Tuning**

* Learning rate
* Discount factor
* Exploration rate (epsilon)
* Network architecture
* Batch size

### 3. **Training Stability**

* Experience replay
* Target networks
* Gradient clipping
* Reward normalization
* Proper exploration

### 4. **Evaluation Strategy**

* Multiple evaluation runs
* Different random seeds
* Performance metrics
* Comparison baselines

## 📚 References & Resources

### 📖 Books

* [**"Reinforcement Learning: An Introduction"**](https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf) by Richard S. Sutton and Andrew G. Barto
* [**"Deep Reinforcement Learning"**](https://www.deeplearningbook.org/) by Ian Goodfellow, Yoshua Bengio, Aaron Courville
* [**"Algorithms for Reinforcement Learning"**](https://sites.ualberta.ca/~szepesva/RLBook.html) by Csaba Szepesvári

### 🎓 Courses

* [**David Silver's RL Course**](https://www.davidsilver.uk/teaching/)
* [**Berkeley CS285**](https://rail.eecs.berkeley.edu/deeprlcourse/) - Deep Reinforcement Learning
* [**Stanford CS234**](http://web.stanford.edu/class/cs234/) - Reinforcement Learning

### 📰 Research Papers

* [**"Playing Atari with Deep Reinforcement Learning"**](https://arxiv.org/abs/1312.5602) by Mnih et al.
* [**"Human-level control through deep reinforcement learning"**](https://www.nature.com/articles/nature14236) by Mnih et al.
* [**"Proximal Policy Optimization Algorithms"**](https://arxiv.org/abs/1707.06347) by Schulman et al.

### 🐙 GitHub Repositories

* [**OpenAI Baselines**](https://github.com/openai/baselines) - High-quality RL implementations
* [**Stable Baselines3**](https://github.com/DLR-RM/stable-baselines3) - Modern RL implementations
* [**RLlib**](https://github.com/ray-project/ray) - Scalable RL library

### 🎮 Environments

* [**OpenAI Gym**](https://gym.openai.com/) - Classic RL environments
* [**Atari Learning Environment**](https://github.com/mgbellemare/Arcade-Learning-Environment)
* [**MuJoCo**](https://mujoco.org/) - Physics simulation for robotics

## 🔗 Related Topics

* [🧠 ML Fundamentals](https://mahbubzulkarnain.gitbook.io/catatan-seekor-the-series/machine-learning/fundamentals)
* [🔢 Supervised Learning](https://mahbubzulkarnain.gitbook.io/catatan-seekor-the-series/machine-learning/fundamentals/supervised-learning)
* [🎯 Unsupervised Learning](https://mahbubzulkarnain.gitbook.io/catatan-seekor-the-series/machine-learning/fundamentals/unsupervised-learning)
* [🐍 Python ML Tools](https://mahbubzulkarnain.gitbook.io/catatan-seekor-the-series/machine-learning/python-ml)

***

*Last updated: December 2024* *Contributors: \[Your Name]*