Online Tutorial(Video)
기본적인 개념 뿐만 아니라 강화학습과 관련된 영상들을 정리하였다.
- 김성훈 교수님의 모두를 위한 RL강좌
- 팡요랩의 강화학습의 기초 이론
- 김태훈님의 알아두면 쓸데있는 신기한 강화학습
- 이웅원님의 RLCode와 A3C 쉽고 깊게 이해하기
- 곽동현님의 Introduction of Deep Reinforcement Learning
- David silver님의 RL Course
- David Silver님의 UCL Course on RL
- Berkely CA의 Deep RL Bootcamp
- UC Berkeley의 CS294-112
- CS 8803 - Reinforcement Learning (Georgia Tech)
- CS885 Reinforcement Learning - Spring 2018 - University of Waterloo
Online Tutorial(Text)
- Sutton 교수님의 Reinforcement Learning: An Introuduction
- An Introduction to Deep Reinforcement Learning. 최근 나온 것이고 2018년 후기 논문까지 다루기 때문에 Sutton 교수님 책 다읽고 그 다음 읽을 책으로 추천
- OpenAI의 Spinning Up
- 이웅원님의 Fundamental of Reinforcement Learning
- Arthur Juliani님의 Tutorial
- 피파 2018 강화학습으로 프리킥 학습하기(영문)
- A Free course in Deep Reinforcement Learning from beginner to expert
Book
- 아서 줄리아니님의 강화학습 첫걸음 블로그를 통해서도 나름 공부가 되는 듯
- 이웅원님의 파이썬과 케라스로 배우는 강화학습
- 파이썬과 케라스를 이용한 딥러닝/강화학습 주식투자
- Maxim Lapan님의 Deep Reinforcement Learning Hands-On
- 텐서플로로 구현하는 딥러닝과 강화학습
- PyTorch를 활용한 강화학습/심층강화학습 실전 입문
- 따라 하면서 배우는 유니티 ML-Agents
Github
'데이터사이언스' 카테고리의 다른 글
Spinning Up in Deep RL (0) | 2020.07.08 |
---|---|
[개념] 딥러닝 역전파 backpropagation 이해 (0) | 2020.07.03 |
pandas에서 사용하기 더 좋은 plot ( Plotting in Pandas Just Got Prettier ) (0) | 2020.06.18 |
LSTM 이해하기 (0) | 2020.05.20 |
카카오 아레나 대회 - 브런치 사용자를 위한 글 추천 (0) | 2020.04.25 |
This script shows an implementation of Actor Critic method on CartPole-V0 environment.
Actor Critic Method
As an agent takes actions and moves through an environment, it learns to map the observed state of the environment to two possible outputs:
- Recommended action: A probabiltiy value for each action in the action space. The part of the agent responsible for this output is called the actor.
- Estimated rewards in the future: Sum of all rewards it expects to receive in the future. The part of the agent responsible for this output is the critic.
Agent and Critic learn to perform their tasks, such that the recommended actions from the actor maximize the rewards.
CartPole-V0
A pole is attached to a cart placed on a frictionless track. The agent has to apply force to move the cart. It is rewarded for every time step the pole remains upright. The agent, therefore, must learn to keep the pole from falling over.
References
Setup
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Configuration parameters for the whole setup
seed = 42
gamma = 0.99 # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0") # Create the environment
env.seed(seed)
eps = np.finfo(np.float32).eps.item() # Smallest number such that 1.0 + eps != 1.0
Implement Actor Critic network
This network learns two functions:
- Actor: This takes as input the state of our environment and returns a probability value for each action in its action space.
- Critic: This takes as input the state of our environment and returns an estimate of total rewards in the future.
In our implementation, they share the initial layer.
num_inputs = 4
num_actions = 2
num_hidden = 128
inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)
model = keras.Model(inputs=inputs, outputs=[action, critic])
Train
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0
while True: # Run until solved
state = env.reset()
episode_reward = 0
with tf.GradientTape() as tape:
for timestep in range(1, max_steps_per_episode):
# env.render(); Adding this line would show the attempts
# of the agent in a pop up window.
state = tf.convert_to_tensor(state)
state = tf.expand_dims(state, 0)
# Predict action probabilities and estimated future rewards
# from environment state
action_probs, critic_value = model(state)
critic_value_history.append(critic_value[0, 0])
# Sample action from action probability distribution
action = np.random.choice(num_actions, p=np.squeeze(action_probs))
action_probs_history.append(tf.math.log(action_probs[0, action]))
# Apply the sampled action in our environment
state, reward, done, _ = env.step(action)
rewards_history.append(reward)
episode_reward += reward
if done:
break
# Update running reward to check condition for solving
running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward
# Calculate expected value from rewards
# - At each timestep what was the total reward received after that timestep
# - Rewards in the past are discounted by multiplying them with gamma
# - These are the labels for our critic
returns = []
discounted_sum = 0
for r in rewards_history[::-1]:
discounted_sum = r + gamma * discounted_sum
returns.insert(0, discounted_sum)
# Normalize
returns = np.array(returns)
returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
returns = returns.tolist()
# Calculating loss values to update our network
history = zip(action_probs_history, critic_value_history, returns)
actor_losses = []
critic_losses = []
for log_prob, value, ret in history:
# At this point in history, the critic estimated that we would get a
# total reward = `value` in the future. We took an action with log probability
# of `log_prob` and ended up recieving a total reward = `ret`.
# The actor must be updated so that it predicts an action that leads to
# high rewards (compared to critic's estimate) with high probability.
diff = ret - value
actor_losses.append(-log_prob * diff) # actor loss
# The critic must be updated so that it predicts a better estimate of
# the future rewards.
critic_losses.append(
huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
)
# Backpropagation
loss_value = sum(actor_losses) + sum(critic_losses)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Clear the loss and reward history
action_probs_history.clear()
critic_value_history.clear()
rewards_history.clear()
# Log details
episode_count += 1
if episode_count % 10 == 0:
template = "running reward: {:.2f} at episode {}"
print(template.format(running_reward, episode_count))
if running_reward > 195: # Condition to consider the task solved
print("Solved at episode {}!".format(episode_count))
break
Visualizations
In early stages of training:
In later stages of training:
'데이터사이언스 > Keras로' 카테고리의 다른 글
멀티 입력중 shape 에러 error 가 발생한 원인 (0) | 2020.05.07 |
---|---|
딥러닝을 위한 고급도구 (0) | 2020.05.07 |
keras LSTM, sequence를 model로 바꿔서 사용할 때 (다중입력 필요시) (0) | 2020.04.29 |
Keras LSTM 입력 포맷의 이해 Understanding Input shapes in LSTM (0) | 2020.04.29 |