mirror of
https://github.com/gsi-upm/sitc
synced 2024-12-22 03:38:13 +00:00
Actualizada práctica a gymnasium y extendida
This commit is contained in:
parent
380340d66d
commit
542ce2708d
@ -48,7 +48,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. [Q-Learning](2_6_1_Q-Learning.ipynb)"
|
||||
"1. [Q-Learning](2_6_1_Q-Learning_Basic.ipynb)\n",
|
||||
"1. [Visualization](2_6_1_Q-Learning_Visualization.ipynb)\n",
|
||||
"1. [Exercises](2_6_1_Q-Learning_Exercises.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -64,7 +66,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@ -78,7 +80,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.1"
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
|
1384
ml5/2_6_1_Q-Learning_Basic.ipynb
Normal file
1384
ml5/2_6_1_Q-Learning_Basic.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
138
ml5/2_6_1_Q-Learning_Exercises.ipynb
Normal file
138
ml5/2_6_1_Q-Learning_Exercises.ipynb
Normal file
@ -0,0 +1,138 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos Á. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Exercises\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Taxi\n",
|
||||
"Analyze the [Taxi problem](https://gymnasium.farama.org/environments/toy_text/taxi/) and solve it applying Q-Learning. You can find a solution as the one previously presented [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym), and the notebook is [here](https://github.com/wagonhelm/Reinforcement-Learning-Introduction/blob/master/Reinforcement%20Learning%20Introduction.ipynb). Take into account that Gymnasium has changed, so you will have to adapt the code.\n",
|
||||
"\n",
|
||||
"Analyze the impact of not changing the learning rate or changing it in a different way. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Optional exercises\n",
|
||||
"Select one of the following exercises.\n",
|
||||
"\n",
|
||||
"## Blackjack\n",
|
||||
"Analyze how to appy Q-Learning for solving Blackjack.\n",
|
||||
"You can find information in this [article](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/).\n",
|
||||
"\n",
|
||||
"## Doom\n",
|
||||
"Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"* [Gymnasium documentation](https://gymnasium.farama.org/).\n",
|
||||
"* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
|
||||
"* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
|
||||
"* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
|
||||
"* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
|
||||
"* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
|
||||
"* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos Á. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
368
ml5/2_6_1_Q-Learning_Visualization.ipynb
Normal file
368
ml5/2_6_1_Q-Learning_Visualization.ipynb
Normal file
File diff suppressed because one or more lines are too long
274
ml5/qlearning.py
Normal file
274
ml5/qlearning.py
Normal file
@ -0,0 +1,274 @@
|
||||
# Class definition of QLearning
|
||||
|
||||
from pathlib import Path
|
||||
from typing import NamedTuple
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
from tqdm import tqdm
|
||||
|
||||
import gymnasium as gym
|
||||
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
|
||||
|
||||
# Params
|
||||
|
||||
class Params(NamedTuple):
|
||||
total_episodes: int # Total episodes
|
||||
learning_rate: float # Learning rate
|
||||
gamma: float # Discounting rate
|
||||
epsilon: float # Exploration probability
|
||||
map_size: int # Number of tiles of one side of the squared environment
|
||||
seed: int # Define a seed so that we get reproducible results
|
||||
is_slippery: bool # If true the player will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions
|
||||
n_runs: int # Number of runs
|
||||
action_size: int # Number of possible actions
|
||||
state_size: int # Number of possible states
|
||||
proba_frozen: float # Probability that a tile is frozen
|
||||
savefig_folder: Path # Root folder where plots are saved
|
||||
|
||||
|
||||
class Qlearning:
|
||||
def __init__(self, learning_rate, gamma, state_size, action_size):
|
||||
self.state_size = state_size
|
||||
self.action_size = action_size
|
||||
self.learning_rate = learning_rate
|
||||
self.gamma = gamma
|
||||
self.reset_qtable()
|
||||
|
||||
def update(self, state, action, reward, new_state):
|
||||
"""Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
|
||||
delta = (
|
||||
reward
|
||||
+ self.gamma * np.max(self.qtable[new_state][:])
|
||||
- self.qtable[state][action]
|
||||
)
|
||||
q_update = self.qtable[state][action] + self.learning_rate * delta
|
||||
return q_update
|
||||
|
||||
def reset_qtable(self):
|
||||
"""Reset the Q-table."""
|
||||
self.qtable = np.zeros((self.state_size, self.action_size))
|
||||
|
||||
|
||||
class EpsilonGreedy:
|
||||
def __init__(self, epsilon, rng):
|
||||
self.epsilon = epsilon
|
||||
self.rng = rng
|
||||
|
||||
def choose_action(self, action_space, state, qtable):
|
||||
"""Choose an action `a` in the current world state (s)."""
|
||||
# First we randomize a number
|
||||
explor_exploit_tradeoff = self.rng.uniform(0, 1)
|
||||
|
||||
# Exploration
|
||||
if explor_exploit_tradeoff < self.epsilon:
|
||||
action = action_space.sample()
|
||||
|
||||
# Exploitation (taking the biggest Q-value for this state)
|
||||
else:
|
||||
# Break ties randomly
|
||||
# If all actions are the same for this state we choose a random one
|
||||
# (otherwise `np.argmax()` would always take the first one)
|
||||
if np.all(qtable[state][:]) == qtable[state][0]:
|
||||
action = action_space.sample()
|
||||
else:
|
||||
action = np.argmax(qtable[state][:])
|
||||
return action
|
||||
|
||||
|
||||
def run_frozen_maps(maps, params, rng):
|
||||
"""Run FrozenLake in maps and plot results"""
|
||||
map_sizes = maps
|
||||
res_all = pd.DataFrame()
|
||||
st_all = pd.DataFrame()
|
||||
|
||||
for map_size in map_sizes:
|
||||
env = gym.make(
|
||||
"FrozenLake-v1",
|
||||
is_slippery=params.is_slippery,
|
||||
render_mode="rgb_array",
|
||||
desc=generate_random_map(
|
||||
size=map_size, p=params.proba_frozen, seed=params.seed
|
||||
),
|
||||
)
|
||||
|
||||
params = params._replace(action_size=env.action_space.n)
|
||||
params = params._replace(state_size=env.observation_space.n)
|
||||
env.action_space.seed(
|
||||
params.seed
|
||||
) # Set the seed to get reproducible results when sampling the action space
|
||||
learner = Qlearning(
|
||||
learning_rate=params.learning_rate,
|
||||
gamma=params.gamma,
|
||||
state_size=params.state_size,
|
||||
action_size=params.action_size,
|
||||
)
|
||||
explorer = EpsilonGreedy(
|
||||
epsilon=params.epsilon,
|
||||
rng=rng
|
||||
)
|
||||
print(f"Map size: {map_size}x{map_size}")
|
||||
rewards, steps, episodes, qtables, all_states, all_actions = run_env(env, params, learner, explorer)
|
||||
|
||||
# Save the results in dataframes
|
||||
res, st = postprocess(episodes, params, rewards, steps, map_size)
|
||||
res_all = pd.concat([res_all, res])
|
||||
st_all = pd.concat([st_all, st])
|
||||
qtable = qtables.mean(axis=0) # Average the Q-table between runs
|
||||
|
||||
plot_states_actions_distribution(
|
||||
states=all_states, actions=all_actions, map_size=map_size, params=params
|
||||
) # Sanity check
|
||||
plot_q_values_map(qtable, env, map_size, params)
|
||||
|
||||
env.close()
|
||||
return res_all, st_all
|
||||
|
||||
def run_env(env, params, learner, explorer):
|
||||
rewards = np.zeros((params.total_episodes, params.n_runs))
|
||||
steps = np.zeros((params.total_episodes, params.n_runs))
|
||||
episodes = np.arange(params.total_episodes)
|
||||
qtables = np.zeros((params.n_runs, params.state_size, params.action_size))
|
||||
all_states = []
|
||||
all_actions = []
|
||||
|
||||
for run in range(params.n_runs): # Run several times to account for stochasticity
|
||||
learner.reset_qtable() # Reset the Q-table between runs
|
||||
|
||||
for episode in tqdm(
|
||||
episodes, desc=f"Run {run}/{params.n_runs} - Episodes", leave=False
|
||||
):
|
||||
state = env.reset(seed=params.seed)[0] # Reset the environment
|
||||
step = 0
|
||||
done = False
|
||||
total_rewards = 0
|
||||
|
||||
while not done:
|
||||
action = explorer.choose_action(
|
||||
action_space=env.action_space, state=state, qtable=learner.qtable
|
||||
)
|
||||
|
||||
# Log all states and actions
|
||||
all_states.append(state)
|
||||
all_actions.append(action)
|
||||
|
||||
# Take the action (a) and observe the outcome state(s') and reward (r)
|
||||
new_state, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
done = terminated or truncated
|
||||
|
||||
learner.qtable[state, action] = learner.update(
|
||||
state, action, reward, new_state
|
||||
)
|
||||
|
||||
total_rewards += reward
|
||||
step += 1
|
||||
|
||||
# Our new state is state
|
||||
state = new_state
|
||||
|
||||
# Log all rewards and steps
|
||||
rewards[episode, run] = total_rewards
|
||||
steps[episode, run] = step
|
||||
qtables[run, :, :] = learner.qtable
|
||||
|
||||
return rewards, steps, episodes, qtables, all_states, all_actions
|
||||
|
||||
def postprocess(episodes, params, rewards, steps, map_size):
|
||||
"""Convert the results of the simulation in dataframes."""
|
||||
res = pd.DataFrame(
|
||||
data={
|
||||
"Episodes": np.tile(episodes, reps=params.n_runs),
|
||||
"Rewards": rewards.flatten(),
|
||||
"Steps": steps.flatten(),
|
||||
}
|
||||
)
|
||||
res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
|
||||
res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])
|
||||
|
||||
st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
|
||||
st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])
|
||||
return res, st
|
||||
|
||||
def qtable_directions_map(qtable, map_size):
|
||||
"""Get the best learned action & map it to arrows."""
|
||||
qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
|
||||
qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
|
||||
directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}
|
||||
qtable_directions = np.empty(qtable_best_action.flatten().shape, dtype=str)
|
||||
eps = np.finfo(float).eps # Minimum float number on the machine
|
||||
for idx, val in enumerate(qtable_best_action.flatten()):
|
||||
if qtable_val_max.flatten()[idx] > eps:
|
||||
# Assign an arrow only if a minimal Q-value has been learned as best action
|
||||
# otherwise since 0 is a direction, it also gets mapped on the tiles where
|
||||
# it didn't actually learn anything
|
||||
qtable_directions[idx] = directions[val]
|
||||
qtable_directions = qtable_directions.reshape(map_size, map_size)
|
||||
return qtable_val_max, qtable_directions
|
||||
|
||||
def plot_q_values_map(qtable, env, map_size, params):
|
||||
"""Plot the last frame of the simulation and the policy learned."""
|
||||
qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)
|
||||
|
||||
# Plot the last frame
|
||||
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
|
||||
ax[0].imshow(env.render())
|
||||
ax[0].axis("off")
|
||||
ax[0].set_title("Last frame")
|
||||
|
||||
# Plot the policy
|
||||
sns.heatmap(
|
||||
qtable_val_max,
|
||||
annot=qtable_directions,
|
||||
fmt="",
|
||||
ax=ax[1],
|
||||
cmap=sns.color_palette("Blues", as_cmap=True),
|
||||
linewidths=0.7,
|
||||
linecolor="black",
|
||||
xticklabels=[],
|
||||
yticklabels=[],
|
||||
annot_kws={"fontsize": "xx-large"},
|
||||
).set(title="Learned Q-values\nArrows represent best action")
|
||||
for _, spine in ax[1].spines.items():
|
||||
spine.set_visible(True)
|
||||
spine.set_linewidth(0.7)
|
||||
spine.set_color("black")
|
||||
img_title = f"frozenlake_q_values_{map_size}x{map_size}.png"
|
||||
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
|
||||
plt.show()
|
||||
|
||||
def plot_states_actions_distribution(states, actions, map_size, params):
|
||||
"""Plot the distributions of states and actions."""
|
||||
labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}
|
||||
|
||||
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
|
||||
sns.histplot(data=states, ax=ax[0], kde=True)
|
||||
ax[0].set_title("States")
|
||||
sns.histplot(data=actions, ax=ax[1])
|
||||
ax[1].set_xticks(list(labels.values()), labels=labels.keys())
|
||||
ax[1].set_title("Actions")
|
||||
fig.tight_layout()
|
||||
img_title = f"frozenlake_states_actions_distrib_{map_size}x{map_size}.png"
|
||||
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
|
||||
plt.show()
|
||||
|
||||
def plot_steps_and_rewards(rewards_df, steps_df,params):
|
||||
"""Plot the steps and rewards from dataframes."""
|
||||
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
|
||||
sns.lineplot(
|
||||
data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", ax=ax[0]
|
||||
)
|
||||
ax[0].set(ylabel="Cumulated rewards")
|
||||
|
||||
sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", ax=ax[1])
|
||||
ax[1].set(ylabel="Averaged steps number")
|
||||
|
||||
for axi in ax:
|
||||
axi.legend(title="map size")
|
||||
fig.tight_layout()
|
||||
img_title = "frozenlake_steps_and_rewards.png"
|
||||
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
|
||||
plt.show()
|
||||
|
Loading…
Reference in New Issue
Block a user