How the AI works
Pong uses DQN with self-play. Two agents compete; each one trains against a frozen checkpoint of its opponent, so the difficulty scales up automatically as both improve.
State, actions, reward
- State: paddle and ball positions and the ball's velocity.
- Actions: move the paddle up, down, or stay.
- Reward: +1 for scoring, -1 for conceding.
Why self-play matters
Against a fixed opponent an agent can overfit. Self-play creates an ever-improving curriculum: as one side gets better, the other must too, pushing both toward strong, general play.
Staying sharp
The networks are bootstrapped with an analytic intercept policy (track the ball's predicted landing) and anchored to it during training — a regularizer that lets them fine-tune without the catastrophic forgetting that makes self-play agents collapse over time.
What you see on screen
You watch two learned policies rally against each other — no hand-coded paddle AI, just two networks that taught themselves the game and keep their edge.