Lunar Lander Reinforcement Learning

_____________ . _____________ |_|_|_|_|_|_|_| _=|=_ |_|_|_|_|_|_|_| |_|_|_|_|_|_|_|_____[_# # ]_____|_|_|_|_|_|_|_| |_|_|_|_|_|_|_| \Я[_+_]R/ |_|_|_|_|_|_|_| -[_I_I_]- /[_[_|_]_]\ / "|" \

Lunar Lander Deep Q-Learning (Reinforcement Learning)

DQN is an improvement over Q-Learning, a reinforcement learning (RL) algorithm used to train an agent to take optimal actions in an environment. Instead of using a Q-table, DQN uses a neural network to approximate the Q-values for each state-action pair.

!under development



                                                          Q(s,a) Q ( s , a ) + α ( r + γ max ( Q ( s' , a' ) ) - Q ( s , a ) )

  1. Mathematical Foundation The agent tries to learn the optimal action-value function Q(S,A) , The agent tries to learn the optimal action-value function
  2. Q-Function

    \[ Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right] \]

    Bellman Equation for Q-Function

    \[ Q(s, a) = r + \gamma \max_{a'} Q(s', a') \]

  3. Initialization The environment has 8-dimensional state space and 4 possible actions (fire left engine, fire right engine, fire main engine, do nothing).

3. Deep Q-Network (DQN) Approximation

Since storing and updating a Q-table explicitly is infeasible for large state spaces, we approximate Q(S,A) using a neural network Q(theta)(S,A) parameterized by 𝜃 . The neural network learns to minimize the Mean Squared Error (MSE) between the predicted Q-values and the target Q-values.



4. Optimal Q-Function

\[ Q^*(s, a) = \mathbb{E} \left[ R(s, a) + \gamma \max_{a'} Q^*(s', a') \right] \]

Loss Function

\[ \mathcal{l}(\theta) = \mathbb{e} \left[ (y - Q_{\theta}(s, a))^2 \right] \]

4.Exploration vs Exploitation: The Epsilon-Greedy Policy

To balance exploration (trying new actions) and exploitation (using the best action known so far), DQN follows an ε-greedy policy . This ensures that the agent explores more initially and exploits its learned knowledge later. The agent starts with random actions and learns through trial and error. Rewards guide the agent toward soft landings. Over time, the Q-network improves and the agent learns an optimal landing strategy.


DQN - MODEL



Other models

  1. Vanilla DQN
  2. Monte Carlo
  3. Rainbow DQN

Vanilla DQN



Rainbow DQN , Landing With Least Impact

Monte Carlo (Unstable)