_____________ . _____________ |_|_|_|_|_|_|_| _=|=_ |_|_|_|_|_|_|_| |_|_|_|_|_|_|_|_____[_# # ]_____|_|_|_|_|_|_|_| |_|_|_|_|_|_|_| \Я[_+_]R/ |_|_|_|_|_|_|_| -[_I_I_]- /[_[_|_]_]\ / "|" \
DQN is an improvement over Q-Learning, a reinforcement learning (RL) algorithm used to train an agent to take optimal actions in an environment. Instead of using a Q-table, DQN uses a neural network to approximate the Q-values for each state-action pair.
\[ Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right] \]
\[ Q(s, a) = r + \gamma \max_{a'} Q(s', a') \]
3. Deep Q-Network (DQN) Approximation
Since storing and updating a Q-table explicitly is infeasible for large state spaces, we approximate Q(S,A) using a neural network Q(theta)(S,A) parameterized by 𝜃 . The neural network learns to minimize the Mean Squared Error (MSE) between the predicted Q-values and the target Q-values.
4. Optimal Q-Function
\[ Q^*(s, a) = \mathbb{E} \left[ R(s, a) + \gamma \max_{a'} Q^*(s', a') \right] \]
Loss Function
\[ \mathcal{l}(\theta) = \mathbb{e} \left[ (y - Q_{\theta}(s, a))^2 \right] \]
4.Exploration vs Exploitation: The Epsilon-Greedy Policy
To balance exploration (trying new actions) and exploitation (using the best action known so far), DQN follows an ε-greedy policy . This ensures that the agent explores more initially and exploits its learned knowledge later. The agent starts with random actions and learns through trial and error. Rewards guide the agent toward soft landings. Over time, the Q-network improves and the agent learns an optimal landing strategy.
DQN - MODEL