RL Value Iteration Method

6 min readSep 9, 2023

The Value Iteration method is a fundamental technique used in value-based reinforcement learning. In the previous chapter, we discussed the concept of value-based learning, which forms the basis for the Value Iteration method.

In value-based learning, the goal is to calculate the value (V) of each state in the environment. The value represents the expected return or utility associated with being in a particular state. It quantifies how desirable a state is for an agent.

First, we will calculate the value Q(s, a) from all possible states and actions that the Agent can make.

For every state s and action in this state perform bellmen's approximate value

Update the value of the current state based on the maximum expected return among all possible actions i.e. Q(s, a).

Every state s in MDP follows Bellman’s update. i.e. Vs <- MAX(Q(s’,a’))

We need to repeat the value iterations till we reach the minimum or maximum.

Okay, it was the brief revision of the Value Iteration Method.

Implementation

Let's Initialise an agent that records values, transitions and Awards.

Value: Store the best transition value for each state.

Reward: Store Reward with respect to state, action and next state

Transition: Keep records of agent state movement transition

class Agent:
    def __init__(self,ENV_NAME):
        self.env = gym.make(ENV_NAME, render_mode="human", is_slippery=True)
        self.state, _ = self.env.reset()
        self.reward_bank = collections.defaultdict(float)
        self.transition = collections.defaultdict(collections.Counter)
        self.value = collections.defaultdict(float)

This function is used to gather random experiences from the environment and update the reward and transition table.

def random_play(self, count):
        for i in range(count):
            action = self.env.action_space.sample()
            n_state, reward, is_done, _, info = self.env.step(action)
            self.tansition[(self.state, action)][n_state] += 1
            if is_done:
                self.state, _ = self.env.reset()

            else:
                self.state = n_state

Note: we don't need to wait until the end of the episode, to start learning. We perform N steps and remember its outcome. This is a common difference between policy-based where as value-based not remember these values for each transitions.

 def value_itrations(self):
        # for each satete
        for s in range(self.env.observation_space.n):
            Vs = []
            # for each action a for state s
            for a in range(self.env.action_space.n):
                # calculate bellman's aproximation value
                P = np.array(self.env.env.P[s][a])
                Vs.append(self.Q(P))
            # get max action_value from all action_value for state s.
            self.value[s] = max(Vs)

The above function will update the value Vs for all states with respect to all possible action spaces that can be taken at the state S by an agent. After that, we will get the probability of next_sate and reward for each possible action from this state S. We can get these possible states and actions probability using self.env.env.P[state][action].

P=env.env.P[0][1]
"""list((probility,next_state,reward,is_done))
[(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False)]"""

And will calculate action_value Q(P, s, a) for all possible actions, we can perform from this state, And what will be the probability for the change of the next state for each action? We will use the Bellmen method to calculate the action value The state value at this state will be the maximum action_value from all action space.

For every state s and action in this state perform bellmen’s approximate value

def Q(self, p):
        reward = 0
        action_value = 0.0
        # sum of probailitis of action value from state s to s'
        for details in p:
            s_ = details[1] # state
            # probablity of agent move to each satate on given action
            p = details[0]  
            r = details[2]
            reward += float(r)
            bellman_value = reward + GAMMA* self.value[s_]
            action_value += p * bellman_value
        return action_value

We can run our Value iteration method until the value goes too high or too low according to your threshold.

agent = Agent()

i = 0
b_r = 0.0
reward = 0
while True:
    agent.random_play(100)
    TEST_EPISODE = 20
    agent.value_itrations()
    i += 1
    print(f"Value of state {i} ...........................\n{agent.value}")

Once you start running the code, you will notice that the value of each state will begin to update at every iteration. The output will be something like this.

defaultdict(<class 'float'>, {0.0: 0.0, 4.0: 0.0, 1.0: 0.0, 5.0: 0.0, 2.0: 0.0, 6.0: 0.09, 3.0: 0.0, 7.0: 0.0, 8.0: 0.0, 10.0: 0.435, 12.0: 0.0, 9.0: 0.18, 13.0: 0.49799999999999994, 14.0: 1.2798999999999998, 11.0: 0.0, 15.0: 0.0})
Value of state 0 ...........................
defaultdict(<class 'float'>, {0.0: 0.0, 4.0: 0.0, 1.0: 0.0, 5.0: 0.0, 2.0: 0.027, 6.0: 0.1386, 3.0: 0.0081, 7.0: 0.0, 8.0: 0.054, 10.0: 0.51438, 12.0: 0.0, 9.0: 0.2961, 13.0: 0.6221999999999999, 14.0: 1.3409739999999999, 11.0: 0.0, 15.0: 0.0})
Value of state 0 ...........................
defaultdict(<class 'float'>, {0.0: 0.0, 4.0: 0.0162, 1.0: 0.0081, 5.0: 0.0, 2.0: 0.052110000000000004, 6.0: 0.169947, 3.0: 0.020493, 7.0: 0.0, 8.0: 0.10988999999999999, 10.0: 0.5654585999999999, 12.0: 0.0, 9.0: 0.37394099999999997, 13.0: 0.7011344999999999, 14.0: 1.37997793, 11.0: 0.0, 15.0: 0.0})

We can also Update update the value of the step while playing the game.

to do that we have to write a few more functions that will help us select the best action before we take any action.

def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            p = self.env.env.P[state][action]
            p = np.array(p)
            action_value = self.Q(p)
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        self.value[state] = best_value
        return best_action

This function iterates through all possible actions for a given state S and calculates the action_value for that given state S with given action a . Now we will select only that action which has maximum action_value for the given state S and return the action a.

def play_episode(self, env):
        total_reward = 0.0
        state, _ = env.reset()

        while True:
            action = self.select_action(state)
            n_state, reward, is_done, _, prob = self.env.step(action)
            self.reward_bank[(state, action, n_state)] = reward
            self.tansition[(state, action)][n_state] += 1
            total_reward += reward
            if is_done:
                self.env.reset()
                break
            state = n_state
        return total_reward

This function will play one episode and return the total reward gained in the entire episode.

Putting all Together

import gym
import collections
import numpy as np


ENV_NAME = "FrozenLake-v1"

GAMMA = 0.9

TEST_EPISODE = 20
class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME, render_mode="human", is_slippery=True)
        self.state, _ = self.env.reset()
        self.reward_bank = collections.defaultdict(float)
        self.tansition = collections.defaultdict(collections.Counter)
        self.value = collections.defaultdict(float)

    def Q(self, p):
        (ts, ta) = np.shape(p)
        reward = 0
        action_value = 0.0
        for details in p:
            s_ = details[1]
            p = details[0]
            r = details[2]
            reward += float(r)
            bellman_value = reward + GAMMA* self.value[s_]
            action_value += p * bellman_value
        return action_value

    def play_episode(self, env):
        total_reward = 0.0
        state, _ = env.reset()

        while True:
            action = self.select_action(state)
            n_state, reward, is_done, _, prob = self.env.step(action)
            self.reward_bank[(state, action, n_state)] = reward
            self.tansition[(state, action)][n_state] += 1
            total_reward += reward
            if is_done:
                self.env.reset()
                break
            state = n_state
        return total_reward

    def value_itrations(self):
        for s in range(self.env.observation_space.n):
            Vs = []
            for a in range(self.env.action_space.n):
                P = np.array(self.env.env.P[s][a])
                Vs.append(self.Q(P))

            self.value[s] = max(Vs)

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            p = self.env.env.P[state][action]
            p = np.array(p)
            action_value = self.Q(p)
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        self.value[state] = best_value
        return best_action

    def random_play(self, count):
        # t = 1
        for i in range(count):
            action = self.env.action_space.sample()
            n_state, reward, is_done, _, info = self.env.step(action)
            self.tansition[(self.state, action)][n_state] += 1
            self.reward_bank[(self.state, action, n_state)] = reward
            if is_done:
                self.state, _ = self.env.reset()

            else:
                self.state = n_state


agent = Agent()

i = 0
b_r = 0.0
reward = 0
while True:
    agent.random_play(100)
    TEST_EPISODE = 20
    agent.value_itrations()
    print(f"Value of state {i} ...........................\n{agent.value}")
    i += 0
    for _ in range(TEST_EPISODE):
        reward += agent.play_episode(t_env)
        reward /= TEST_EPISODE
    if reward > 0.8:
        print(f"Done with Final reard {reward}")
        break

Constraints

Value-based learning updates new value over old value hence the old value will be lost. but in Tabular Q Learning.

new value doesn’t want to completely replace the old Q-value of our action with each update. What if we just got lucky and our opponent made a mistake, i.e. we won, but the action wasn’t that good? To account for this, we set the new Q value to a value somewhere between the old Q value and the potential new Q value also known as blending techniques. We use an additional parameter, the learning rate α to set how much the new value will change the old value and get the final formula as below.

Q(S,A)=γ∗maxaQ(S′,a)

By repeatedly playing game after game and updating the Q values after each game, the Q function will converge to the true Q values for Agent.

Q(S,A)=(1−α)∗Q(S,A)+α∗γ∗maxaQ(S′,a)

Where α is the learning rate.

we will discuss all this in the new post. clapp if you think this post was helpful to understand basic concepts or RL algorithms.

RL Value Iteration Method

Implementation

Written by Sushant Kumar Jha

No responses yet