Borrada versión anterior

Actualizada práctica a gymnasium y extendida
2025-12-16 01:58:15 +00:00 · 2023-04-27 15:43:44 +02:00 · 2023-04-27 15:42:01 +02:00
6 changed files with 2169 additions and 458 deletions
--- a/ml5/2_6_0_Intro_RL.ipynb
+++ b/ml5/2_6_0_Intro_RL.ipynb
@@ -48,7 +48,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. [Q-Learning](2_6_1_Q-Learning.ipynb)"
+    "1. [Q-Learning](2_6_1_Q-Learning_Basic.ipynb)\n",
+    "1. [Visualization](2_6_1_Q-Learning_Visualization.ipynb)\n",
+    "1. [Exercises](2_6_1_Q-Learning_Exercises.ipynb)"
   ]
  },
  {
@@ -64,7 +66,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -78,7 +80,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml5/2_6_1_Q-Learning.ipynb
+++ b/ml5/2_6_1_Q-Learning.ipynb
@@ -1,455 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![](images/EscUpmPolit_p.gif \"UPM\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Course Notes for Learning Intelligent Systems"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2018 Carlos A. Iglesias"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Table of Contents\n",
-    "\n",
-    "* [Introduction](#Introduction)\n",
-    "* [Getting started with OpenAI Gym](#Getting-started-with-OpenAI-Gym)\n",
-    "* [The Frozen Lake scenario](#The-Frozen-Lake-scenario)\n",
-    "* [Q-Learning with the Frozen Lake scenario](#Q-Learning-with-the-Frozen-Lake-scenario)\n",
-    "* [Exercises](#Exercises)\n",
-    "* [Optional exercises](#Optional-exercises)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Introduction\n",
-    "The purpose of this practice is to understand better Reinforcement Learning (RL) and, in particular, Q-Learning.\n",
-    "\n",
-    "We are going to use [OpenAI Gym](https://gym.openai.com/). OpenAI is a toolkit for developing and comparing RL algorithms.Take a loot at ther [website](https://gym.openai.com/).\n",
-    "\n",
-    "It implements [algorithm imitation](http://gym.openai.com/envs/#algorithmic), [classic control problems](http://gym.openai.com/envs/#classic_control), [Atari games](http://gym.openai.com/envs/#atari), [Box2D continuous control](http://gym.openai.com/envs/#box2d), [robotics with MuJoCo, Multi-Joint dynamics with Contact](http://gym.openai.com/envs/#mujoco),  and [simple text based environments](http://gym.openai.com/envs/#toy_text).\n",
-    "\n",
-    "This notebook is based on * [Diving deeper into Reinforcement Learning with Q-Learning](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
-    "\n",
-    "First of all, install the OpenAI Gym  library:\n",
-    "\n",
-    "```console\n",
-    "foo@bar:~$ pip install gym\n",
-    "```\n",
-    "\n",
-    "\n",
-    "If you get the error message 'NotImplementedError: abstract', [execute](https://github.com/openai/gym/issues/775) \n",
-    "```console\n",
-    "foo@bar:~$ pip install pyglet==1.2.4\n",
-    "```\n",
-    "\n",
-    "If you want to try the Atari environment, it is better that you opt for the full installation from the source. Follow the instructions at [https://github.com/openai/gym#id15](OpenGym).\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Getting started with OpenAI Gym\n",
-    "\n",
-    "First of all, read the [introduction](http://gym.openai.com/docs/#getting-started-with-gym) of OpenAI Gym."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Environments\n",
-    "OpenGym provides a number of problems called *environments*. \n",
-    "\n",
-    "Try the 'CartPole-v0' (or 'MountainCar)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import gym\n",
-    "\n",
-    "env = gym.make(\"CartPole-v1\")\n",
-    "#env = gym.make('MountainCar-v0')\n",
-    "#env = gym.make('Taxi-v2')\n",
-    "\n",
-    "observation = env.reset()\n",
-    "for _ in range(1000):\n",
-    "  env.render()\n",
-    "  action = env.action_space.sample() # your agent here (this takes random actions)\n",
-    "  observation, reward, done, info = env.step(action)\n",
-    "\n",
-    "  if done:\n",
-    "    observation = env.reset()\n",
-    "env.close()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This will launch an external window with the game. If you cannot close that window, just execute in a code cell:\n",
-    "\n",
-    "```python\n",
-    "env.close()\n",
-    "```\n",
-    "\n",
-    "The full list of available environments can be found printing the environment registry as follows."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from gym import envs\n",
-    "print(envs.registry.all())"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The environment’s **step** function returns  four values. These are:\n",
-    "\n",
-    "* **observation (object):** an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.\n",
-    "* **reward (float):** amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.\n",
-    "* **done (boolean):** whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.).\n",
-    "* **info (dict):** diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.\n",
-    "\n",
-    "The typical agent loop consists in first calling the method *reset* which provides an initial observation. Then the agent executes an action, and receives the reward, the new observation, and if the episode has finished (done is true). \n",
-    "\n",
-    "For example, analyze this sample of agent loop for 100 ms. The details of the previous variables for this game as described [here](https://github.com/openai/gym/wiki/CartPole-v0) are:\n",
-    "* **observation**: Cart Position, Cart Velocity, Pole Angle, Pole Velocity.\n",
-    "* **action**: 0\t(Push cart to the left), 1\t(Push cart to the right).\n",
-    "* **reward**: 1  for every step taken, including the termination step."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import gym\n",
-    "env = gym.make('CartPole-v0')\n",
-    "for i_episode in range(20):\n",
-    "    observation = env.reset()\n",
-    "    for t in range(100):\n",
-    "        env.render()\n",
-    "        print(observation)\n",
-    "        action = env.action_space.sample()\n",
-    "        print(\"Action \", action)\n",
-    "        observation, reward, done, info = env.step(action)\n",
-    "        print(\"Observation \", observation, \", reward \", reward, \", done \", done, \", info \" , info)\n",
-    "        if done:\n",
-    "            print(\"Episode finished after {} timesteps\".format(t+1))\n",
-    "            break"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# The Frozen Lake scenario\n",
-    "We are going to play to the [Frozen Lake](http://gym.openai.com/envs/FrozenLake-v0/) game.\n",
-    "\n",
-    "The problem is a grid where you should go from the 'start' (S) position to the 'goal position (G) (the pizza!). You can only walk through the 'frozen tiles' (F). Unfortunately, you can fall in a  'hole' (H).\n",
-    "![](images/frozenlake-problem.png \"Frozen lake problem\")\n",
-    "\n",
-    "The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise. The possible actions are going left, right, up or down. However, the ice is slippery, so you won't always move in the direction you intend.\n",
-    "\n",
-    "![](images/frozenlake-world.png \"Frozen lake world\")\n",
-    "\n",
-    "\n",
-    "Here you can see several episodes. A full recording is available at  [Frozen World](http://gym.openai.com/envs/FrozenLake-v0/).\n",
-    "\n",
-    "![](images/recording.gif \"Example running\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Q-Learning with the Frozen Lake scenario\n",
-    "We are now going to apply Q-Learning for the Frozen Lake scenario. This part of the notebook is taken from [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb).\n",
-    "\n",
-    "First we create the environment and a Q-table inizializated with zeros to store the value of each action in a given state. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "import gym\n",
-    "import random\n",
-    "\n",
-    "env = gym.make(\"FrozenLake-v0\")\n",
-    "\n",
-    "\n",
-    "action_size = env.action_space.n\n",
-    "state_size = env.observation_space.n\n",
-    "\n",
-    "\n",
-    "qtable = np.zeros((state_size, action_size))\n",
-    "print(qtable)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we define the hyperparameters."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Q-Learning hyperparameters\n",
-    "total_episodes = 10000        # Total episodes\n",
-    "learning_rate = 0.8           # Learning rate\n",
-    "max_steps = 99                # Max steps per episode\n",
-    "gamma = 0.95                  # Discounting rate\n",
-    "\n",
-    "# Exploration hyperparameters\n",
-    "epsilon = 1.0                 # Exploration rate\n",
-    "max_epsilon = 1.0             # Exploration probability at start\n",
-    "min_epsilon = 0.01            # Minimum exploration probability \n",
-    "decay_rate = 0.01             # Exponential decay rate for exploration prob"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "And now we implement the Q-Learning algorithm.\n",
-    "\n",
-    "![](images/qlearning-algo.png \"Q-Learning algorithm\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# List of rewards\n",
-    "rewards = []\n",
-    "\n",
-    "# 2 For life or until learning is stopped\n",
-    "for episode in range(total_episodes):\n",
-    "    # Reset the environment\n",
-    "    state = env.reset()\n",
-    "    step = 0\n",
-    "    done = False\n",
-    "    total_rewards = 0\n",
-    "    \n",
-    "    for step in range(max_steps):\n",
-    "        # 3. Choose an action a in the current world state (s)\n",
-    "        ## First we randomize a number\n",
-    "        exp_exp_tradeoff = random.uniform(0, 1)\n",
-    "        \n",
-    "        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
-    "        if exp_exp_tradeoff > epsilon:\n",
-    "            action = np.argmax(qtable[state,:])\n",
-    "\n",
-    "        # Else doing a random choice --> exploration\n",
-    "        else:\n",
-    "            action = env.action_space.sample()\n",
-    "\n",
-    "        # Take the action (a) and observe the outcome state(s') and reward (r)\n",
-    "        new_state, reward, done, info = env.step(action)\n",
-    "\n",
-    "        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
-    "        # qtable[new_state,:] : all the actions we can take from new state\n",
-    "        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
-    "        \n",
-    "        total_rewards += reward\n",
-    "        \n",
-    "        # Our new state is state\n",
-    "        state = new_state\n",
-    "        \n",
-    "        # If done (if we're dead) : finish episode\n",
-    "        if done == True: \n",
-    "            break\n",
-    "        \n",
-    "    episode += 1\n",
-    "    # Reduce epsilon (because we need less and less exploration)\n",
-    "    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
-    "    rewards.append(total_rewards)\n",
-    "\n",
-    "print (\"Score over time: \" +  str(sum(rewards)/total_episodes))\n",
-    "print(qtable)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Finally, we use the learnt Q-table for playing the Frozen World game."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "env.reset()\n",
-    "\n",
-    "for episode in range(5):\n",
-    "    state = env.reset()\n",
-    "    step = 0\n",
-    "    done = False\n",
-    "    print(\"****************************************************\")\n",
-    "    print(\"EPISODE \", episode)\n",
-    "\n",
-    "    for step in range(max_steps):\n",
-    "        env.render()\n",
-    "        # Take the action (index) that have the maximum expected future reward given that state\n",
-    "        action = np.argmax(qtable[state,:])\n",
-    "        \n",
-    "        new_state, reward, done, info = env.step(action)\n",
-    "        \n",
-    "        if done:\n",
-    "            break\n",
-    "        state = new_state\n",
-    "env.close()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Exercises\n",
-    "\n",
-    "## Taxi\n",
-    "Analyze the [Taxi problem](http://gym.openai.com/envs/Taxi-v2/) and solve it applying Q-Learning. You can find a solution as the one previously presented  [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym).\n",
-    "\n",
-    "Analyze the impact of not changing the learning rate (alfa or epsilon, depending on the book) or changing it in a different way."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Optional exercises\n",
-    "\n",
-    "## Doom\n",
-    "Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## References\n",
-    "* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
-    "* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
-    "* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
-    "* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
-    "* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
-    "* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Licence"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
-    "\n",
-    "© 2018 Carlos A. Iglesias, Universidad Politécnica de Madrid."
-   ]
-  }
- ],
- "metadata": {
-  "datacleaner": {
-   "position": {
-    "top": "50px"
-   },
-   "python": {
-    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
-   },
-   "window_display": false
-  },
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.9"
-  },
-  "latex_envs": {
-   "LaTeX_envs_menu_present": true,
-   "autocomplete": true,
-   "bibliofile": "biblio.bib",
-   "cite_by": "apalike",
-   "current_citInitial": 1,
-   "eqLabelWithNumbers": true,
-   "eqNumInitial": 1,
-   "hotkeys": {
-    "equation": "Ctrl-E",
-    "itemize": "Ctrl-I"
-   },
-   "labels_anchors": false,
-   "latex_user_defs": false,
-   "report_style_numbering": false,
-   "user_envs_cfg": false
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
--- a/ml5/2_6_1_Q-Learning_Basic.ipynb
+++ b/ml5/2_6_1_Q-Learning_Basic.ipynb
--- a/ml5/2_6_1_Q-Learning_Exercises.ipynb
+++ b/ml5/2_6_1_Q-Learning_Exercises.ipynb
@@ -0,0 +1,138 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](images/EscUpmPolit_p.gif \"UPM\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Course Notes for Learning Intelligent Systems"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos Á. Iglesias"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercises\n",
+    "\n",
+    "\n",
+    "## Taxi\n",
+    "Analyze the [Taxi problem](https://gymnasium.farama.org/environments/toy_text/taxi/) and solve it applying Q-Learning. You can find a solution as the one previously presented  [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym), and the notebook is [here](https://github.com/wagonhelm/Reinforcement-Learning-Introduction/blob/master/Reinforcement%20Learning%20Introduction.ipynb). Take into account that Gymnasium has changed, so you will have to adapt the code.\n",
+    "\n",
+    "Analyze the impact of not changing the learning rate or changing it in a different way. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Optional exercises\n",
+    "Select one of the following exercises.\n",
+    "\n",
+    "## Blackjack\n",
+    "Analyze how to appy Q-Learning for solving Blackjack.\n",
+    "You can find information in this [article](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/).\n",
+    "\n",
+    "## Doom\n",
+    "Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References\n",
+    "* [Gymnasium documentation](https://gymnasium.farama.org/).\n",
+    "* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
+    "* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
+    "* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
+    "* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
+    "* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
+    "* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Licence"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "\n",
+    "© Carlos Á. Iglesias, Universidad Politécnica de Madrid."
+   ]
+  }
+ ],
+ "metadata": {
+  "datacleaner": {
+   "position": {
+    "top": "50px"
+   },
+   "python": {
+    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
+   },
+   "window_display": false
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.10"
+  },
+  "latex_envs": {
+   "LaTeX_envs_menu_present": true,
+   "autocomplete": true,
+   "bibliofile": "biblio.bib",
+   "cite_by": "apalike",
+   "current_citInitial": 1,
+   "eqLabelWithNumbers": true,
+   "eqNumInitial": 1,
+   "hotkeys": {
+    "equation": "Ctrl-E",
+    "itemize": "Ctrl-I"
+   },
+   "labels_anchors": false,
+   "latex_user_defs": false,
+   "report_style_numbering": false,
+   "user_envs_cfg": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/ml5/2_6_1_Q-Learning_Visualization.ipynb
+++ b/ml5/2_6_1_Q-Learning_Visualization.ipynb
--- a/ml5/qlearning.py
+++ b/ml5/qlearning.py
@@ -0,0 +1,274 @@
+# Class definition of QLearning
+
+from pathlib import Path
+from typing import NamedTuple
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+from tqdm import tqdm
+
+import gymnasium as gym
+from gymnasium.envs.toy_text.frozen_lake import generate_random_map
+
+# Params
+
+class Params(NamedTuple):
+    total_episodes: int  # Total episodes
+    learning_rate: float  # Learning rate
+    gamma: float  # Discounting rate
+    epsilon: float  # Exploration probability
+    map_size: int  # Number of tiles of one side of the squared environment
+    seed: int  # Define a seed so that we get reproducible results
+    is_slippery: bool  # If true the player will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions
+    n_runs: int  # Number of runs
+    action_size: int  # Number of possible actions
+    state_size: int  # Number of possible states
+    proba_frozen: float  # Probability that a tile is frozen
+    savefig_folder: Path  # Root folder where plots are saved
+
+
+class Qlearning:
+    def __init__(self, learning_rate, gamma, state_size, action_size):
+        self.state_size = state_size
+        self.action_size = action_size
+        self.learning_rate = learning_rate
+        self.gamma = gamma
+        self.reset_qtable()
+
+    def update(self, state, action, reward, new_state):
+        """Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
+        delta = (
+            reward
+            + self.gamma * np.max(self.qtable[new_state][:])
+            - self.qtable[state][action]
+        )
+        q_update = self.qtable[state][action] + self.learning_rate * delta
+        return q_update
+
+    def reset_qtable(self):
+        """Reset the Q-table."""
+        self.qtable = np.zeros((self.state_size, self.action_size))
+
+
+class EpsilonGreedy:
+    def __init__(self, epsilon, rng):
+        self.epsilon = epsilon
+        self.rng = rng
+
+    def choose_action(self, action_space, state, qtable):
+        """Choose an action `a` in the current world state (s)."""
+        # First we randomize a number
+        explor_exploit_tradeoff = self.rng.uniform(0, 1)
+
+        # Exploration
+        if explor_exploit_tradeoff < self.epsilon:
+            action = action_space.sample()
+
+        # Exploitation (taking the biggest Q-value for this state)
+        else:
+            # Break ties randomly
+            # If all actions are the same for this state we choose a random one
+            # (otherwise `np.argmax()` would always take the first one)
+            if np.all(qtable[state][:]) == qtable[state][0]:
+                action = action_space.sample()
+            else:
+                action = np.argmax(qtable[state][:])
+        return action
+
+
+def run_frozen_maps(maps, params, rng):
+    """Run FrozenLake in maps and plot results"""
+    map_sizes = maps
+    res_all = pd.DataFrame()
+    st_all = pd.DataFrame() 
+  
+    for map_size in map_sizes:
+            env = gym.make(
+            "FrozenLake-v1",
+            is_slippery=params.is_slippery,
+            render_mode="rgb_array",
+            desc=generate_random_map(
+                size=map_size, p=params.proba_frozen, seed=params.seed
+            ),
+    )
+    
+    params = params._replace(action_size=env.action_space.n)
+    params = params._replace(state_size=env.observation_space.n)
+    env.action_space.seed(
+            params.seed
+        )  # Set the seed to get reproducible results when sampling the action space
+    learner = Qlearning(
+        learning_rate=params.learning_rate,
+        gamma=params.gamma,
+        state_size=params.state_size,
+        action_size=params.action_size,
+    )
+    explorer = EpsilonGreedy(
+        epsilon=params.epsilon,
+        rng=rng
+    )
+    print(f"Map size: {map_size}x{map_size}")
+    rewards, steps, episodes, qtables, all_states, all_actions = run_env(env, params, learner, explorer)
+
+        # Save the results in dataframes
+    res, st = postprocess(episodes, params, rewards, steps, map_size)
+    res_all = pd.concat([res_all, res])
+    st_all = pd.concat([st_all, st])
+    qtable = qtables.mean(axis=0)  # Average the Q-table between runs
+
+    plot_states_actions_distribution(
+        states=all_states, actions=all_actions, map_size=map_size, params=params
+    )  # Sanity check
+    plot_q_values_map(qtable, env, map_size, params)
+
+    env.close()
+    return res_all, st_all
+
+def run_env(env, params, learner, explorer):
+    rewards = np.zeros((params.total_episodes, params.n_runs))
+    steps = np.zeros((params.total_episodes, params.n_runs))
+    episodes = np.arange(params.total_episodes)
+    qtables = np.zeros((params.n_runs, params.state_size, params.action_size))
+    all_states = []
+    all_actions = []
+    
+    for run in range(params.n_runs):  # Run several times to account for stochasticity
+        learner.reset_qtable()  # Reset the Q-table between runs
+
+        for episode in tqdm(
+            episodes, desc=f"Run {run}/{params.n_runs} - Episodes", leave=False
+        ):
+            state = env.reset(seed=params.seed)[0]  # Reset the environment
+            step = 0
+            done = False
+            total_rewards = 0
+
+            while not done:
+                action = explorer.choose_action(
+                    action_space=env.action_space, state=state, qtable=learner.qtable
+                )
+
+                # Log all states and actions
+                all_states.append(state)
+                all_actions.append(action)
+
+                # Take the action (a) and observe the outcome state(s') and reward (r)
+                new_state, reward, terminated, truncated, info = env.step(action)
+
+                done = terminated or truncated
+
+                learner.qtable[state, action] = learner.update(
+                    state, action, reward, new_state
+                )
+
+                total_rewards += reward
+                step += 1
+
+                # Our new state is state
+                state = new_state
+
+            # Log all rewards and steps
+            rewards[episode, run] = total_rewards
+            steps[episode, run] = step
+        qtables[run, :, :] = learner.qtable
+
+    return rewards, steps, episodes, qtables, all_states, all_actions
+    
+def postprocess(episodes, params, rewards, steps, map_size):
+    """Convert the results of the simulation in dataframes."""
+    res = pd.DataFrame(
+        data={
+            "Episodes": np.tile(episodes, reps=params.n_runs),
+            "Rewards": rewards.flatten(),
+            "Steps": steps.flatten(),
+        }
+    )
+    res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
+    res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])
+
+    st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
+    st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])
+    return res, st
+    
+def qtable_directions_map(qtable, map_size):
+    """Get the best learned action & map it to arrows."""
+    qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
+    qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
+    directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}
+    qtable_directions = np.empty(qtable_best_action.flatten().shape, dtype=str)
+    eps = np.finfo(float).eps  # Minimum float number on the machine
+    for idx, val in enumerate(qtable_best_action.flatten()):
+        if qtable_val_max.flatten()[idx] > eps:
+            # Assign an arrow only if a minimal Q-value has been learned as best action
+            # otherwise since 0 is a direction, it also gets mapped on the tiles where
+            # it didn't actually learn anything
+            qtable_directions[idx] = directions[val]
+    qtable_directions = qtable_directions.reshape(map_size, map_size)
+    return qtable_val_max, qtable_directions
+
+def plot_q_values_map(qtable, env, map_size, params):
+    """Plot the last frame of the simulation and the policy learned."""
+    qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)
+
+    # Plot the last frame
+    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
+    ax[0].imshow(env.render())
+    ax[0].axis("off")
+    ax[0].set_title("Last frame")
+
+    # Plot the policy
+    sns.heatmap(
+        qtable_val_max,
+        annot=qtable_directions,
+        fmt="",
+        ax=ax[1],
+        cmap=sns.color_palette("Blues", as_cmap=True),
+        linewidths=0.7,
+        linecolor="black",
+        xticklabels=[],
+        yticklabels=[],
+        annot_kws={"fontsize": "xx-large"},
+    ).set(title="Learned Q-values\nArrows represent best action")
+    for _, spine in ax[1].spines.items():
+        spine.set_visible(True)
+        spine.set_linewidth(0.7)
+        spine.set_color("black")
+    img_title = f"frozenlake_q_values_{map_size}x{map_size}.png"
+    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
+    plt.show()
+    
+def plot_states_actions_distribution(states, actions, map_size, params):
+    """Plot the distributions of states and actions."""
+    labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}
+
+    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
+    sns.histplot(data=states, ax=ax[0], kde=True)
+    ax[0].set_title("States")
+    sns.histplot(data=actions, ax=ax[1])
+    ax[1].set_xticks(list(labels.values()), labels=labels.keys())
+    ax[1].set_title("Actions")
+    fig.tight_layout()
+    img_title = f"frozenlake_states_actions_distrib_{map_size}x{map_size}.png"
+    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
+    plt.show()
+
+def plot_steps_and_rewards(rewards_df, steps_df,params):
+    """Plot the steps and rewards from dataframes."""
+    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
+    sns.lineplot(
+        data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", ax=ax[0]
+    )
+    ax[0].set(ylabel="Cumulated rewards")
+
+    sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", ax=ax[1])
+    ax[1].set(ylabel="Averaged steps number")
+
+    for axi in ax:
+        axi.legend(title="map size")
+    fig.tight_layout()
+    img_title = "frozenlake_steps_and_rewards.png"
+    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
+    plt.show()
+
Author	SHA1	Message	Date
cif	3363c953f4	Borrada versión anterior	2023-04-27 15:43:44 +02:00
cif	542ce2708d	Actualizada práctica a gymnasium y extendida	2023-04-27 15:42:01 +02:00