Actualizada práctica a gymnasium y extendida

2025-08-23 10:12:20 +00:00 · 2023-04-27 15:42:01 +02:00
parent 380340d66d
commit 542ce2708d
5 changed files with 2169 additions and 3 deletions
--- a/ml5/2_6_0_Intro_RL.ipynb
+++ b/ml5/2_6_0_Intro_RL.ipynb
@@ -48,7 +48,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. [Q-Learning](2_6_1_Q-Learning.ipynb)"
+    "1. [Q-Learning](2_6_1_Q-Learning_Basic.ipynb)\n",
    "1. [Visualization](2_6_1_Q-Learning_Visualization.ipynb)\n",
    "1. [Exercises](2_6_1_Q-Learning_Exercises.ipynb)"
   ]
  },
  {
@@ -64,7 +66,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -78,7 +80,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml5/2_6_1_Q-Learning_Basic.ipynb
+++ b/ml5/2_6_1_Q-Learning_Basic.ipynb
--- a/ml5/2_6_1_Q-Learning_Exercises.ipynb
+++ b/ml5/2_6_1_Q-Learning_Exercises.ipynb
@@ -0,0 +1,138 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos Á. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "\n",
    "## Taxi\n",
    "Analyze the [Taxi problem](https://gymnasium.farama.org/environments/toy_text/taxi/) and solve it applying Q-Learning. You can find a solution as the one previously presented  [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym), and the notebook is [here](https://github.com/wagonhelm/Reinforcement-Learning-Introduction/blob/master/Reinforcement%20Learning%20Introduction.ipynb). Take into account that Gymnasium has changed, so you will have to adapt the code.\n",
    "\n",
    "Analyze the impact of not changing the learning rate or changing it in a different way. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optional exercises\n",
    "Select one of the following exercises.\n",
    "\n",
    "## Blackjack\n",
    "Analyze how to appy Q-Learning for solving Blackjack.\n",
    "You can find information in this [article](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/).\n",
    "\n",
    "## Doom\n",
    "Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "* [Gymnasium documentation](https://gymnasium.farama.org/).\n",
    "* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
    "* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
    "* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
    "* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
    "* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
    "* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos Á. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/ml5/2_6_1_Q-Learning_Visualization.ipynb
+++ b/ml5/2_6_1_Q-Learning_Visualization.ipynb
--- a/ml5/qlearning.py
+++ b/ml5/qlearning.py
@@ -0,0 +1,274 @@
 # Class definition of QLearning
 from pathlib import Path
 from typing import NamedTuple
 import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd
 import seaborn as sns
 from tqdm import tqdm
 import gymnasium as gym
 from gymnasium.envs.toy_text.frozen_lake import generate_random_map
 # Params
 class Params(NamedTuple):
    total_episodes: int  # Total episodes
    learning_rate: float  # Learning rate
    gamma: float  # Discounting rate
    epsilon: float  # Exploration probability
    map_size: int  # Number of tiles of one side of the squared environment
    seed: int  # Define a seed so that we get reproducible results
    is_slippery: bool  # If true the player will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions
    n_runs: int  # Number of runs
    action_size: int  # Number of possible actions
    state_size: int  # Number of possible states
    proba_frozen: float  # Probability that a tile is frozen
    savefig_folder: Path  # Root folder where plots are saved
 class Qlearning:
    def __init__(self, learning_rate, gamma, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.reset_qtable()
    def update(self, state, action, reward, new_state):
        """Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
        delta = (
            reward
            + self.gamma * np.max(self.qtable[new_state][:])
            - self.qtable[state][action]
        )
        q_update = self.qtable[state][action] + self.learning_rate * delta
        return q_update
    def reset_qtable(self):
        """Reset the Q-table."""
        self.qtable = np.zeros((self.state_size, self.action_size))
 class EpsilonGreedy:
    def __init__(self, epsilon, rng):
        self.epsilon = epsilon
        self.rng = rng
    def choose_action(self, action_space, state, qtable):
        """Choose an action `a` in the current world state (s)."""
        # First we randomize a number
        explor_exploit_tradeoff = self.rng.uniform(0, 1)
        # Exploration
        if explor_exploit_tradeoff < self.epsilon:
            action = action_space.sample()
        # Exploitation (taking the biggest Q-value for this state)
        else:
            # Break ties randomly
            # If all actions are the same for this state we choose a random one
            # (otherwise `np.argmax()` would always take the first one)
            if np.all(qtable[state][:]) == qtable[state][0]:
                action = action_space.sample()
            else:
                action = np.argmax(qtable[state][:])
        return action
 def run_frozen_maps(maps, params, rng):
    """Run FrozenLake in maps and plot results"""
    map_sizes = maps
    res_all = pd.DataFrame()
    st_all = pd.DataFrame() 
    for map_size in map_sizes:
            env = gym.make(
            "FrozenLake-v1",
            is_slippery=params.is_slippery,
            render_mode="rgb_array",
            desc=generate_random_map(
                size=map_size, p=params.proba_frozen, seed=params.seed
            ),
    )
    params = params._replace(action_size=env.action_space.n)
    params = params._replace(state_size=env.observation_space.n)
    env.action_space.seed(
            params.seed
        )  # Set the seed to get reproducible results when sampling the action space
    learner = Qlearning(
        learning_rate=params.learning_rate,
        gamma=params.gamma,
        state_size=params.state_size,
        action_size=params.action_size,
    )
    explorer = EpsilonGreedy(
        epsilon=params.epsilon,
        rng=rng
    )
    print(f"Map size: {map_size}x{map_size}")
    rewards, steps, episodes, qtables, all_states, all_actions = run_env(env, params, learner, explorer)
        # Save the results in dataframes
    res, st = postprocess(episodes, params, rewards, steps, map_size)
    res_all = pd.concat([res_all, res])
    st_all = pd.concat([st_all, st])
    qtable = qtables.mean(axis=0)  # Average the Q-table between runs
    plot_states_actions_distribution(
        states=all_states, actions=all_actions, map_size=map_size, params=params
    )  # Sanity check
    plot_q_values_map(qtable, env, map_size, params)
    env.close()
    return res_all, st_all
 def run_env(env, params, learner, explorer):
    rewards = np.zeros((params.total_episodes, params.n_runs))
    steps = np.zeros((params.total_episodes, params.n_runs))
    episodes = np.arange(params.total_episodes)
    qtables = np.zeros((params.n_runs, params.state_size, params.action_size))
    all_states = []
    all_actions = []
    for run in range(params.n_runs):  # Run several times to account for stochasticity
        learner.reset_qtable()  # Reset the Q-table between runs
        for episode in tqdm(
            episodes, desc=f"Run {run}/{params.n_runs} - Episodes", leave=False
        ):
            state = env.reset(seed=params.seed)[0]  # Reset the environment
            step = 0
            done = False
            total_rewards = 0
            while not done:
                action = explorer.choose_action(
                    action_space=env.action_space, state=state, qtable=learner.qtable
                )
                # Log all states and actions
                all_states.append(state)
                all_actions.append(action)
                # Take the action (a) and observe the outcome state(s') and reward (r)
                new_state, reward, terminated, truncated, info = env.step(action)
                done = terminated or truncated
                learner.qtable[state, action] = learner.update(
                    state, action, reward, new_state
                )
                total_rewards += reward
                step += 1
                # Our new state is state
                state = new_state
            # Log all rewards and steps
            rewards[episode, run] = total_rewards
            steps[episode, run] = step
        qtables[run, :, :] = learner.qtable
    return rewards, steps, episodes, qtables, all_states, all_actions
 def postprocess(episodes, params, rewards, steps, map_size):
    """Convert the results of the simulation in dataframes."""
    res = pd.DataFrame(
        data={
            "Episodes": np.tile(episodes, reps=params.n_runs),
            "Rewards": rewards.flatten(),
            "Steps": steps.flatten(),
        }
    )
    res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
    res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])
    st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
    st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])
    return res, st
 def qtable_directions_map(qtable, map_size):
    """Get the best learned action & map it to arrows."""
    qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
    qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
    directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}
    qtable_directions = np.empty(qtable_best_action.flatten().shape, dtype=str)
    eps = np.finfo(float).eps  # Minimum float number on the machine
    for idx, val in enumerate(qtable_best_action.flatten()):
        if qtable_val_max.flatten()[idx] > eps:
            # Assign an arrow only if a minimal Q-value has been learned as best action
            # otherwise since 0 is a direction, it also gets mapped on the tiles where
            # it didn't actually learn anything
            qtable_directions[idx] = directions[val]
    qtable_directions = qtable_directions.reshape(map_size, map_size)
    return qtable_val_max, qtable_directions
 def plot_q_values_map(qtable, env, map_size, params):
    """Plot the last frame of the simulation and the policy learned."""
    qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)
    # Plot the last frame
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    ax[0].imshow(env.render())
    ax[0].axis("off")
    ax[0].set_title("Last frame")
    # Plot the policy
    sns.heatmap(
        qtable_val_max,
        annot=qtable_directions,
        fmt="",
        ax=ax[1],
        cmap=sns.color_palette("Blues", as_cmap=True),
        linewidths=0.7,
        linecolor="black",
        xticklabels=[],
        yticklabels=[],
        annot_kws={"fontsize": "xx-large"},
    ).set(title="Learned Q-values\nArrows represent best action")
    for _, spine in ax[1].spines.items():
        spine.set_visible(True)
        spine.set_linewidth(0.7)
        spine.set_color("black")
    img_title = f"frozenlake_q_values_{map_size}x{map_size}.png"
    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
    plt.show()
 def plot_states_actions_distribution(states, actions, map_size, params):
    """Plot the distributions of states and actions."""
    labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    sns.histplot(data=states, ax=ax[0], kde=True)
    ax[0].set_title("States")
    sns.histplot(data=actions, ax=ax[1])
    ax[1].set_xticks(list(labels.values()), labels=labels.keys())
    ax[1].set_title("Actions")
    fig.tight_layout()
    img_title = f"frozenlake_states_actions_distrib_{map_size}x{map_size}.png"
    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
    plt.show()
 def plot_steps_and_rewards(rewards_df, steps_df,params):
    """Plot the steps and rewards from dataframes."""
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    sns.lineplot(
        data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", ax=ax[0]
    )
    ax[0].set(ylabel="Cumulated rewards")
    sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", ax=ax[1])
    ax[1].set(ylabel="Averaged steps number")
    for axi in ax:
        axi.legend(title="map size")
    fig.tight_layout()
    img_title = "frozenlake_steps_and_rewards.png"
    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
    plt.show()