Solving river crossing puzzles with reinforcement learning

Juna Salviati
6 min readJan 21, 2023


Photo by Peter Neumann on Unsplash

Let’s use machine learning to solve the famous “wolf, goat, cabbage” river crossing puzzle. Full source code:

People in every century liked to challenge themselves with puzzles: Alcuinus, a scholar, wrote a medieval Latin manuscript “Propositiones ad Acuendos Juvenes”, featuring a number of brain-teasers that we can describe as recreational mathematics problems. Probably the most famous among those puzzles is “the problem of the wolf, goat, and cabbage”.

This river crossing puzzle states the following: on a river bank there are a Wolf, a goat and a cabbage. A farmer should use a boat to get the three objects on the other side, caring about not leaving the wolf and the goat nor the cabbage and the goat alone.

This problem can be solved with a simple search but what about using reinforcement learning instead?

Reinforcement learning

Reinforcement learning is characterized by four main elements:

  • the policy
  • a reward signal
  • a value function
  • (optionally) a model

The policy is just a mapping from a state to a set of actions which can be performed at a given time; the reward is a prize (or a penalty) the environment gives to the agent because a particular action has been performed; the value function is an estimation of the reward in the long term; model allows to predict how the environment will behave.

In reinforcement learning problems, we try to maximize the reward in the long term and the agent can use the experience gained, mixed with exploration, to achieve this task.


The task the agent should achieve is then to learn the optimal policy for the environment, i.e. the optimal sequence of actions to perform to get from a starting state to a goal state:

With very very simple words, this formula just states that the optimal policy is the one which maximizes the immediate reward plus the reward in the long term, discounted by a gamma value.

Gamma here is a parameter with a value spanning between 0 and 1 introducing the ability of the algorithm to be far-sighted: if it’s small, then just the immediate reward is taken into account, if it’s “big” then also the following rewards will be taken into account.

Q-learning is used when the transition function for every action or reward is unknown; as an exercise and just to have fun, we can use it in simpler settings, for example, to calculate an optimal policy for the river crossing problem.

This algorithm relies on two matrices — Q and R — summarizing the computed value of the long term reward and the immediate reward for each (state, action) couple and can be described as follows:

  • initialize Q(s,a) to 0 for each (s,a)
  • from the current state:
    - select and perform an action
    - receive an immediate reward
    - observe the new state, s’
    - update Q(s,a) with the following formula (suitable when the outcome of the action is deterministic) and then go to s’:

At the very beginning, the Q matrix is initialized as “all-zero”. Looking at the formula, we observe that the value of Q for a certain (state, action) is tied to the immediate reward (that could also be 0) and the maximum Q for all the couples (state’, action), where state’ is the state resulting from performing the action a when in state s.

So Q is updated in a “backward” fashion.

Trying all the actions for all the states a lot of times (“episodes”) ensures Q to “propagate” from the goal state to the starting state. We can stop to compute Q when Q converges. Every episode terminates when the goal state is reached.

At the end, we take the action with the maximum Q value and we follow states until we arrive to the goal:

Exploration vs exploitation

With every episode, we can progressively rely on what we know by using the calculated Q and begin to explore less. To do this, we introduce an e-greedy (epsilon greedy) mode to make a selection among available actions.

While we explore the space state, we roll a dice to know whether to select the action with the maximum computed Q or to randomly select one, instead: with each and every episode, we decrease epsilon, so that we will choose to use what we know much more frequently than to randomly explore.

Exploring states leads to a better knowledge of rewards in the long term; exploiting what we already know maximizes current reward (even leading to a suboptimal behaviour).

Why did the goat cross the river?

To tackle river crossing puzzles, we can penalize “bad” moves — i.e. moves which could lead to a loss — and reward “good” moves — i.e. moves which could make us progress toward the goal.

So, for example, we can explicitly penalize the action of leaving the wolf and the goat unsupervised on the river bank; we also positively reward reaching the goal.

Class “WolfGoatCabbageQLearning” is initialized with a number of maximum episodes, epsilon, gamma and a list of actions.

Here, we also define whether we want to actually apply an epsilon-greedy selection or not.

Objects in the boat and on the banks are a set (we do not care about their order, they only have to be there), while banks and the boat themselves are a tuple (because their order is important):

We calculate next states for every action with the following function:

To compute Q, we put in a dictionary with a composite key the current state and the next state, making the action implicit:

Convergence here is calculated by checking values in Q for 20 iterations: if they do not change, Q converges.

At the end, we calculate the optimal policy by taking the step corresponding to the maximum Q value:

train() returns the solution steps, the average cumulative reward and epsilon for every episode.

The class can be used as showed in the following snippet:

and to obtain the solution:

*** SOLUTION ***
({'🐺', '🥦', '🐐', '⛵'}, set(), set())
({'🐺', '🥦'}, {'🐐', '⛵'}, set())
({'🐺', '🥦'}, set(), {'🐐', '⛵'})
({'🐺', '🥦'}, {'⛵'}, {'🐐'})
({'🐺', '🥦', '⛵'}, set(), {'🐐'})
({'🐺'}, {'🥦', '⛵'}, {'🐐'})
({'🐺'}, set(), {'🥦', '🐐', '⛵'})
({'🐺'}, {'🐐', '⛵'}, {'🥦'})
({'🐺', '🐐', '⛵'}, set(), {'🥦'})
({'🐐'}, {'🐺', '⛵'}, {'🥦'})
({'🐐'}, set(), {'🐺', '🥦', '⛵'})
({'🐐'}, {'⛵'}, {'🐺', '🥦'})
({'🐐', '⛵'}, set(), {'🐺', '🥦'})
(set(), {'🐐', '⛵'}, {'🐺', '🥦'})
(set(), set(), {'🐺', '🥦', '🐐', '⛵'})

Observations about discounted cumulative reward per episode

Since we have scores and epsilon for each and every episode, we can plot results on a spreadsheet:

When epsilon is big, next state selection is pretty random, so the algorithm hits very high penalties. Progressing with episodes, the algorithm heavily relies on “learned” Q values and the cumulative reward in the episode increases.

Turning e-greedy off, the plot shows rewards without a trend, pointing out the complete randomness in choosing the next state to explore (which can lead to loops, highly decreasing the reward):

Also, since exploration leads to a better Q approximation, algorithm converges faster.



Juna Salviati

Full-time Human-computer interpreter. Opinions are my own.