블로그 이미지
Flying Mr.Cheon youGom

Recent Comment»

Recent Post»

Recent Trackback»

« 2024/5 »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31

cartpole in keras (a2c)

데이터사이언스/Keras로 | 2020. 6. 28. 23:12 | Posted by youGom
 
 
Introduction

This script shows an implementation of Actor Critic method on CartPole-V0 environment.

Actor Critic Method

As an agent takes actions and moves through an environment, it learns to map the observed state of the environment to two possible outputs:

  1. Recommended action: A probabiltiy value for each action in the action space. The part of the agent responsible for this output is called the actor.
  2. Estimated rewards in the future: Sum of all rewards it expects to receive in the future. The part of the agent responsible for this output is the critic.

Agent and Critic learn to perform their tasks, such that the recommended actions from the actor maximize the rewards.

CartPole-V0

A pole is attached to a cart placed on a frictionless track. The agent has to apply force to move the cart. It is rewarded for every time step the pole remains upright. The agent, therefore, must learn to keep the pole from falling over.

References

 

Setup

In [0]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Create the environment
env.seed(seed)
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0
 

Implement Actor Critic network

This network learns two functions:

  1. Actor: This takes as input the state of our environment and returns a probability value for each action in its action space.
  2. Critic: This takes as input the state of our environment and returns an estimate of total rewards in the future.

In our implementation, they share the initial layer.

In [0]:
num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])
 

Train

In [0]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:  # Run until solved
    state = env.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break
 

Visualizations

In early stages of training:

Imgur

In later stages of training:

Imgur
 
:

jupyter themes 주피터 노트북 테마 바꾸기

서버/Python | 2020. 6. 24. 15:57 | Posted by youGom

매우 간단함.

> pip install jupyterthemes

> jt -l

> jt -t {theme name}

> jt -t chesterish

theme name 목록
   chesterish
   grade3
   gruvboxd
   gruvboxl
   monokai
   oceans16
   onedork
   solarizedd
   solarizedl

 

-----------------------------------------------

넘 밝아서 눈 피로좀 줄이려고 한거라.. 추천하자면 monokai가 눈을 좀 편안하게 해주는 것 같음

> jt -t monokai

-----------------------------------------------

conda에서 한다면, conda activation 한 다음 위에 설치 과정 해주면 된다.

그것도 귀찮다면, (윈도우 사용자라면)

winkey > ananconda prompt consol을 검색해서 실행하면 (base) 로 명령어 창이 뜸

------------------------------------------------

그럴 일은 없겠지만, jt 명령어가 실행이 안될 수 도 있다. 환경변수에 path를 입력해두지 않았다면?

jt의 실행위치는 programdata 폴더 안에 anaconda 폴더가 있고 그 안에 script라는 폴더를 찾아보면 된다.

보통 c드라이드에 설치할텐데.. 리눅스 이거나 다른 드라이브에 있는지도 찾아보면 도움이 될테고, 도저히 모르겠다면

find나 F3으로 jt 를 찾는게 가장 빠를것 같다.

-------------------------------------------------

한번에 되면 좋은데, 오류들이 발생한다.

프록시 있거나 인증서가 필요하다면 이렇게 쓰면 될듯..

pip --cert="path/authfile" --proxy 0.0.0.0:8080 install jupyterthemes

-------------------------------------------------------

 

내 경우에는 두가지 오류가 발생해서 모듈들을 다시 설치함

error : no module named pywin32_bootstrap

> pip install --ignore-installed pywin32 --user

error : numpy.core._ufunc_config'

> pip install --upgrade numpy

---------------------------------------------------

command line에 대해 좀 더 살펴보거나,

python code 내에서 활용하는 등 좀 더 디테일한 내용을 알고 싶다면, 

아래 링크에서 상세히 알아볼 수 있다. ( 영어이니 참고 )

https://github.com/dunovank/jupyter-themes/blob/master/README.md

'서버 > Python' 카테고리의 다른 글

10 Minutes from pandas to Koalas on Apache Spark  (0) 2020.05.07
couchdb basic tutorial  (0) 2020.04.26
tistory api access token 얻는 방법  (0) 2020.04.25
gsm bts using rasberry pi 3  (0) 2019.03.04
pwn basic in python  (0) 2018.12.11
:
https://towardsdatascience.com/plotting-in-pandas-just-got-prettier-289d0e0fe5c0 pandas dataframe에서 plot을 사용할 수 있는데 bokeh, plotly를 사용할 수 있다는 것.
: