RL Value Iteration Method
The Value Iteration method is a fundamental technique used in value-based reinforcement learning. In the previous chapter, we discussed the concept of value-based learning, which forms the basis for the Value Iteration method.
In value-based learning, the goal is to calculate the value (V) of each state in the environment. The value represents the expected return or utility associated with being in a particular state. It quantifies how desirable a state is for an agent.
First, we will calculate the value Q(s, a) from all possible states and actions that the Agent can make.
Update the value of the current state based on the maximum expected return among all possible actions i.e. Q(s, a).
We need to repeat the value iterations till we reach the minimum or maximum.
Okay, it was the brief revision of the Value Iteration Method.
Implementation
Let's Initialise an agent that records values, transitions and Awards.
Value: Store the best transition value for each state.
Reward: Store Reward with respect to state, action and next state
Transition: Keep records of agent state movement transition
class Agent:
def __init__(self,ENV_NAME):
self.env = gym.make(ENV_NAME, render_mode="human", is_slippery=True)
self.state, _ = self.env.reset()
self.reward_bank = collections.defaultdict(float)
self.transition = collections.defaultdict(collections.Counter)
self.value = collections.defaultdict(float)
This function is used to gather random experiences from the environment and update the reward and transition table.
def random_play(self, count):
for i in range(count):
action = self.env.action_space.sample()
n_state, reward, is_done, _, info = self.env.step(action)
self.tansition[(self.state, action)][n_state] += 1
if is_done:
self.state, _ = self.env.reset()
else:
self.state = n_state
Note: we don't need to wait until the end of the episode, to start learning. We perform N steps and remember its outcome. This is a common difference between policy-based where as value-based not remember these values for each transitions.
def value_itrations(self):
# for each satete
for s in range(self.env.observation_space.n):
Vs = []
# for each action a for state s
for a in range(self.env.action_space.n):
# calculate bellman's aproximation value
P = np.array(self.env.env.P[s][a])
Vs.append(self.Q(P))
# get max action_value from all action_value for state s.
self.value[s] = max(Vs)
The above function will update the value Vs for all states with respect to all possible action spaces that can be taken at the state S by an agent. After that, we will get the probability of next_sate and reward for each possible action from this state S. We can get these possible states and actions probability using self.env.env.P[state][action].
P=env.env.P[0][1]
"""list((probility,next_state,reward,is_done))
[(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False)]"""
And will calculate action_value Q(P, s, a) for all possible actions, we can perform from this state, And what will be the probability for the change of the next state for each action? We will use the Bellmen method to calculate the action value The state value at this state will be the maximum action_value from all action space.
def Q(self, p):
reward = 0
action_value = 0.0
# sum of probailitis of action value from state s to s'
for details in p:
s_ = details[1] # state
# probablity of agent move to each satate on given action
p = details[0]
r = details[2]
reward += float(r)
bellman_value = reward + GAMMA* self.value[s_]
action_value += p * bellman_value
return action_value
We can run our Value iteration method until the value goes too high or too low according to your threshold.
agent = Agent()
i = 0
b_r = 0.0
reward = 0
while True:
agent.random_play(100)
TEST_EPISODE = 20
agent.value_itrations()
i += 1
print(f"Value of state {i} ...........................\n{agent.value}")
Once you start running the code, you will notice that the value of each state will begin to update at every iteration. The output will be something like this.
defaultdict(<class 'float'>, {0.0: 0.0, 4.0: 0.0, 1.0: 0.0, 5.0: 0.0, 2.0: 0.0, 6.0: 0.09, 3.0: 0.0, 7.0: 0.0, 8.0: 0.0, 10.0: 0.435, 12.0: 0.0, 9.0: 0.18, 13.0: 0.49799999999999994, 14.0: 1.2798999999999998, 11.0: 0.0, 15.0: 0.0})
Value of state 0 ...........................
defaultdict(<class 'float'>, {0.0: 0.0, 4.0: 0.0, 1.0: 0.0, 5.0: 0.0, 2.0: 0.027, 6.0: 0.1386, 3.0: 0.0081, 7.0: 0.0, 8.0: 0.054, 10.0: 0.51438, 12.0: 0.0, 9.0: 0.2961, 13.0: 0.6221999999999999, 14.0: 1.3409739999999999, 11.0: 0.0, 15.0: 0.0})
Value of state 0 ...........................
defaultdict(<class 'float'>, {0.0: 0.0, 4.0: 0.0162, 1.0: 0.0081, 5.0: 0.0, 2.0: 0.052110000000000004, 6.0: 0.169947, 3.0: 0.020493, 7.0: 0.0, 8.0: 0.10988999999999999, 10.0: 0.5654585999999999, 12.0: 0.0, 9.0: 0.37394099999999997, 13.0: 0.7011344999999999, 14.0: 1.37997793, 11.0: 0.0, 15.0: 0.0})
We can also Update update the value of the step while playing the game.
to do that we have to write a few more functions that will help us select the best action before we take any action.
def select_action(self, state):
best_action, best_value = None, None
for action in range(self.env.action_space.n):
p = self.env.env.P[state][action]
p = np.array(p)
action_value = self.Q(p)
if best_value is None or best_value < action_value:
best_value = action_value
best_action = action
self.value[state] = best_value
return best_action
This function iterates through all possible actions for a given state S and calculates the action_value for that given state S with given action a . Now we will select only that action which has maximum action_value for the given state S and return the action a.
def play_episode(self, env):
total_reward = 0.0
state, _ = env.reset()
while True:
action = self.select_action(state)
n_state, reward, is_done, _, prob = self.env.step(action)
self.reward_bank[(state, action, n_state)] = reward
self.tansition[(state, action)][n_state] += 1
total_reward += reward
if is_done:
self.env.reset()
break
state = n_state
return total_reward
This function will play one episode and return the total reward gained in the entire episode.
Putting all Together
import gym
import collections
import numpy as np
ENV_NAME = "FrozenLake-v1"
GAMMA = 0.9
TEST_EPISODE = 20
class Agent:
def __init__(self):
self.env = gym.make(ENV_NAME, render_mode="human", is_slippery=True)
self.state, _ = self.env.reset()
self.reward_bank = collections.defaultdict(float)
self.tansition = collections.defaultdict(collections.Counter)
self.value = collections.defaultdict(float)
def Q(self, p):
(ts, ta) = np.shape(p)
reward = 0
action_value = 0.0
for details in p:
s_ = details[1]
p = details[0]
r = details[2]
reward += float(r)
bellman_value = reward + GAMMA* self.value[s_]
action_value += p * bellman_value
return action_value
def play_episode(self, env):
total_reward = 0.0
state, _ = env.reset()
while True:
action = self.select_action(state)
n_state, reward, is_done, _, prob = self.env.step(action)
self.reward_bank[(state, action, n_state)] = reward
self.tansition[(state, action)][n_state] += 1
total_reward += reward
if is_done:
self.env.reset()
break
state = n_state
return total_reward
def value_itrations(self):
for s in range(self.env.observation_space.n):
Vs = []
for a in range(self.env.action_space.n):
P = np.array(self.env.env.P[s][a])
Vs.append(self.Q(P))
self.value[s] = max(Vs)
def select_action(self, state):
best_action, best_value = None, None
for action in range(self.env.action_space.n):
p = self.env.env.P[state][action]
p = np.array(p)
action_value = self.Q(p)
if best_value is None or best_value < action_value:
best_value = action_value
best_action = action
self.value[state] = best_value
return best_action
def random_play(self, count):
# t = 1
for i in range(count):
action = self.env.action_space.sample()
n_state, reward, is_done, _, info = self.env.step(action)
self.tansition[(self.state, action)][n_state] += 1
self.reward_bank[(self.state, action, n_state)] = reward
if is_done:
self.state, _ = self.env.reset()
else:
self.state = n_state
agent = Agent()
i = 0
b_r = 0.0
reward = 0
while True:
agent.random_play(100)
TEST_EPISODE = 20
agent.value_itrations()
print(f"Value of state {i} ...........................\n{agent.value}")
i += 0
for _ in range(TEST_EPISODE):
reward += agent.play_episode(t_env)
reward /= TEST_EPISODE
if reward > 0.8:
print(f"Done with Final reard {reward}")
break
Constraints
Value-based learning updates new value over old value hence the old value will be lost. but in Tabular Q Learning.
new value doesn’t want to completely replace the old Q-value of our action with each update. What if we just got lucky and our opponent made a mistake, i.e. we won, but the action wasn’t that good? To account for this, we set the new Q value to a value somewhere between the old Q value and the potential new Q value also known as blending techniques. We use an additional parameter, the learning rate α to set how much the new value will change the old value and get the final formula as below.
Q(S,A)=γ∗maxaQ(S′,a)
By repeatedly playing game after game and updating the Q values after each game, the Q function will converge to the true Q values for Agent.
Q(S,A)=(1−α)∗Q(S,A)+α∗γ∗maxaQ(S′,a)
Where α is the learning rate.
we will discuss all this in the new post. clapp if you think this post was helpful to understand basic concepts or RL algorithms.