3 books on Q-learning [PDF]

Updated: Mar 31, 2026 | Author: Maria Lin

Q-learning is a reinforcement learning method that allows you to train an agent to play a game in real time, without waiting for the game to end.

Imagine, an agent is playing Pong - moving a paddle up and down, hitting a ball. The agent's neural network receives ball's coordinates, its velocity, rival coordinate and its own coordinate as input. And it has three prediction-outputs: Up, Stop and Down.

The essence of Q-learning is that we introduce rewards for certain actions/achievements of the agent (for example, +10 for hitting a ball and -10 for missing it). We train the agent's neural network so that at each output it produces a prediction - how many points (Q) it will earn by the end of the game if it chooses a given action. After making the prediction, the agent undertakes the action that promises most points.

It's important to understand that the Q value at a given step is a prediction of how many points will be earned from the current step until the end of the game. Everything that happened before this (including the reward just received) is forgotten.

Imagine, that at a given step the agent predicts the following Q-rewards: Up=100, Stop=45, Down=15. Okay, of course, we choose the Up step. And let's say we see that if the agent steps Up, it will hit the ball and earn +10 points. Then we (in mind) step Up and find the Q predictions for the next step. Let's say they are Up=76, Stop=98, Down=16. So, Stop=98 wins. Then we tell the agent: "Look, you're currently estimating the total reward Q = 100. But at the next step, you'll get +10 and you'll estimate the remaining reward at 98 (and then you'll have a better view). So, right now, it would be more correct to estimate the remaining reward as Q = 98 + 10 = 108. So, your error now is 108 - 100 = 8."

And we backpropagate the error along the output that produced 100.

Next time in a similar situation, the agent will estimate Q closer to 108 and at the next step - 98, so the error will be almost zero.

Now let's imagine what happens when the agent doesn't hit the ball. Let's say at the pre-last step, the agent predicts Up = 23, Stop = 12, Down = 42. So, Down = 42 wins. We move down. Lose. Penalty -10. We tell the agent: "You're currently predicting that you'll earn 42 points by the end of the game. But at the next step, you'll firstly get a penalty of -10 and secondly, you'll definitely earn nothing further, meaning at the next step, Q = 0." So, it would be more correct for you to now estimate the remaining reward as Q = -10. So, your error is -10 - 42 = -52."

And we backpropagate the error to the output that produced 42.

Here are some good books about Q-learning:

Multi-Agent Reinforcement Learning: Foundations and Modern Approaches
2024 by Stefano V. Albrecht, Filippos Christianos, Lukas Schefer

Download PDF

Computational Theories of Interaction and Agency
1996 by Philip Agre, Stanley J. Rosenschein

Download PDF

Deep Reinforcement Learning: Frontiers of Artificial Intelligence
2019 by Mohit Sewak

Download PDF

See also: Top 10 eBook Organizers

How to download PDF:

1. Install Gooreader

2. Enter Book ID to the search box and press Enter

3. Click "Download Book" icon and select PDF*

* - note that for yellow books only preview pages are downloaded

Author: Maria Lin

Maria Lin, is a seasoned content writer who has contributed to numerous tech portals, including Mashable and bookrunch, as a guest author. She holds a Master's degree in Journalism from the University of California, where her research predominantly concentrated on mobile apps, software, AI and cloud services. With a deep passion for reading, Maria is particularly drawn to the intersection of technology and books, making book tech a subject of great interest to her. During her leisure time, she indulges in her love for cooking and finds solace in a good night's sleep. You can contact Maria Lin via email maria@bookrunch.com