Solving NetHack's Mastermind

Note: This article was originally written in 2006. It is published here for posterity.

NetHack's Mastermind is very similar to the real Mastermind, with three changes:

Instead of five colors, there are seven notes (A, B, C, D, E, F, and G).
There are five positions, not four.
You can give partial solutions (such as AB even though the solution is five notes long).

A gear represents a note in the correct position. A tumbler represents a note in an incorrect position. So, for example, if you're trying to find the tune FFAAB and you guess AFAFB, you'll get three gears and two tumblers. If you guess CA you'll get one tumbler. If you guess BBBBB you'll get one gear. If you guess DDDDD you'll get silence (in other words, no gears and no tumblers).

This document describes my attempts at optimizing a NetHack Mastermind solver (with help from Sean Kelly and papers from Don Knuth and Barteld Kooi). The first, naïve algorithm finds the solution on average within 15 turns, possibly taking up to 21. The last, most effective algorithm finds the solution on average within 4 turns, possibly taking up to only 6. This is an important result because in NetHack, the level where you play Mastermind is one of the most dangerous, so you have a strong incentive (your own survival) to find the solution as quickly as possible.

Algorithm: Naïve 1

The first tune we always test is A. If that's a gear, then we're done with the first position. We don't know if any more As are in the tune, so our next stab is AA. Back to the first play though: if A produces a tumbler, then we skip it for now and move on to B. If we get silence, then we know A is not in the tune at all, so we remove it from any further consideration. Repeat this until we have found the tune.

This basic algorithm is straightforward and works well enough. This is roughly what I do when I am playing NetHack, since it can be done without assistance.

So let's use this algorithm to try to find a particular tune:

A: one tumbler (we know there's an A but not in the first position, so skip A for a while)
B: one tumbler (ditto. skip B for a while)
C: one gear (great! first position is C. Try C again)
CC: one gear (now we know there are no more Cs in this tune)
CD: two gears (second position is D. try D again)
CDD: two gears (now we know there are no more Ds in this tune)
CDE: two gears (now we know there are no Es in this tune)
CDF: three gears (third position is F. try F again)
CDFF: three gears (so we know there are no more Fs in this tune)
CDFG: three gears (so we know there are no Gs in this tune)
CDFA: three gears, one tumbler (we know there's an A but not in the fourth position, so skip A for a while)
CDFB: four gears (fourth position is B. try B again)
CDFBB: four gears (so we know there are no more Bs in this tune)
CDFBA: five gears (nailed it!)

This first algorithm looks like this in pseudocode:

  notes = A, B, C, D, E, F, G
  tune  = ""

  while tune.length < 5
    try tune + notes[0]

    if tumblers
      # move this note to the end of the note list
      # we're guaranteed to see the correct note before we see this one again
      note = notes.delete_at[0]
      notes.push(note)
    elsif gears > tune.length
      # correct note
      tune += notes[0]
    else
      notes.delete_at[0]
  end

Here are its results:

Average turns per tune: 14.69 (246880 turns to solve all tunes / 16807 tunes)

Algorithm: Naïve 2

The first optimization we can make is that, if at any time we have exactly one possible note left, the rest of the tune most consist solely of that note.

  notes = A, B, C, D, E, F, G
  tune  = ""

  while tune.length < 5
    if notes.size == 1
      tune += notes[0] while tune.length < 5
      last
    end

    try tune + notes[0]

    if any tumblers
      # move this note to the end of the note list
      # we're guaranteed to see the correct note before we see this one again
      notes.push(notes.delete_at[0])
    elsif gears > tune.length
      # correct note
      tune += notes[0]
    else
      notes.delete_at[0]
  end

Average turns per tune: 13.55 (227729 turns / 16807 tunes)
Average turn savings: 1.14

Algorithm: Naïve 3

The next savings (which will be minor compared to the previous one) is if we've seen five distinct notes, we can rule out any notes not among those five:

  notes = A, B, C, D, E, F, G
  tune  = ""
  seen  = nil

  while tune.length < 5
    if seen.size == 5
      foreach element in notes
        delete it if it's not in seen
      end
    end

    if notes.size == 1
      tune += notes[0] while tune.length < 5
      last
    end

    try tune + notes[0]

    if any tumblers
      seen.add(notes[0]) unless seen.contains(notes[0])
      # move this note to the end of the note list
      # we're guaranteed to see the correct note before we see this one again
      notes.push(notes.delete_at[0])
    elsif gears > tune.length
      # correct note
      tune += notes[0]
      seen.add(notes[0]) unless seen.contains(notes[0])
    else
      notes.delete_at[0]
  end

Average turns per tune: 13.50 (226896 turns / 16807 tunes)
Average turn savings: 0.05

Algorithm: Naïve 4

So much for the obvious optimizations. Let's try tweaking some details of the algorithm a bit just to see if it helps or hurts.

You know how when we get a tumbler we move the first note to the last position? Let's try doing that when we get a gear, too. Turns out this saves us almost .25 turns.

  notes = A, B, C, D, E, F, G
  tune  = ""
  seen  = nil

  while tune.length < 5
    if seen.size == 5
      foreach element in notes
        delete it if it's not in seen
      end
    end

    if notes.size == 1
      tune += notes[0] while tune.length < 5
      last
    end

    try tune + notes[0]

    if any tumblers
      seen.add(notes[0]) unless seen.contains(notes[0])
      # move this note to the end of the note list
      # we're guaranteed to see the correct note before we see this one again
      notes.push(notes.delete_at[0])
    elsif gears > tune.length
      # correct note
      tune += notes[0]
      seen.add(notes[0]) unless seen.contains(notes[0])
      notes.push(notes.delete_at[0])
    else
      notes.delete_at[0]
  end

Average turns per tune: 13.27 (223060 turns / 16807 tunes)
Average turn savings: 0.23

Algorithm: Knuth 1

Due to arcanehl's constant (yet ever friendly) prodding, I picked this code up again. He started implementing Knuth's Mastermind algorithm, which, while slower, is much more effective. The average turns to solve a tune goes down from 13 to just 5!

It's a radical change from the algorithm used above, so allow me to explain. The basic idea is after a guess, we eliminate any possibilities that would not produce the same score. For example, if we try AAABB and get 2 gears, 0 tumblers, then we can eliminate any possibility that does not produce 2 gears and 0 tumblers against AAABB (such as CCCCC which would produce 0 gears and 0 tumblers). The initial guess is hardcoded to be AAABB though it could be something entirely different. The next guess is (currently) taken arbitrarily; we use the first possibility.

I changed from Perl to C during testing for speed: arcanehl's Ruby implementation of this algorithm takes hours, mine takes less than twenty-five seconds. Of course, these are the results of timing every possible tune (there are 16807 of them). Going through the algorithm for just one tune will be fast enough at a hundredth the speed.

    possibilities = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AAABB"
    else
        guess = possibilities[0]
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 5.5 (93103 turns / 16807 tunes)
Average turn savings: 7.7

Algorithm: Knuth 2

About the only possible improvement we can make is to generate a better guess than always using the first possible tune. I figured the median tune (or close enough to it) would be better than the first, and I was right, saving over half a turn.

    possibilities = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AAABB"
    else
        guess = possibilities[possibilities.size/2]
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.92 (82643 turns / 16807 tunes)
Average turn savings: 0.62

Algorithm: Knuth 3

Mysteriously, we find that the best initial tune is AABBC, not AAABB. (that's called foreshadowing)

This is the algorithm that is used in Rodney3. To save memory, all he remembers for each player is the results of each tune. Naturally there's a fair amount of reprocessing, but a 16807-element array would be kind of large to save for each person (and I'm afraid of Perl bitfields).

    possibilities = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
        guess = possibilities[possibilities.size/2]
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.85 (81446 turns / 16807 tunes)
Average turn savings: 0.07

Algorithm: Knuth 4

Enough goofing around. Time to implement the rest of Knuth's algorithm: at each step, guess the possibility that would eliminate the most possibilities in the worst case. This means that no matter what the response is for the guess, we'll eliminate a maximum number of possibilities. The way I implement it has time complexity O(n^2) where n is the number of tunes left; I don't know if there's a more efficient solution. For each guess i we score each possible tune j. The worst case for i is equal to the largest number of remaining tunes after playing it (given all possible responses, such as three tumblers and one gear). Then we find the smallest worst case (ie, smallest remaining number of tunes) over all i and we play it!

This takes hours even in my somewhat tuned C implementation. Using the median element as above may be more appropriate if you're playing all 16807 tunes, since this extra processing is probably not worth the turn savings. Otherwise, if you're just finding one tune as in an actual game of NetHack, you'll of course want to use the best algorithm available.

Note: we found the best first tune (AABBC) by running this code without the hardcoded first guess. The code obviously produces the same results without the hardcoded first guess, but there's no reason to do all that processing (which is a lot for all 16807 tunes) when we know the first best guess is always AABBC.

    possibilities = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
          best_worst_case = possibilities.size + 100
        foreach i in possibilities
        remaining = []

        foreach j in possibilities
            score = score(j, i)
            remaining[score] += 1
        end

        worst_case = remaining.max
        if worst_case < best_worst_case
            best_worst_case = worst_case
            guess = i
        end
        end
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.61 (77422 turns / 16807 tunes)
Average turn savings: 0.24

Algorithm: Breadth 1

I was talking in #nethack and someone brought up Mastermind. So that revived this whole thing. I was idly searching on the arXiv for Mastermind papers, and only found the one that proves Mastermind is NP-complete, which while good to know, doesn't directly help me improve my solving algorithms. However, I then remembered Google Scholar and searched for Knuth's paper. I couldn't find it, but it did return a few dozen papers that referenced Knuth's; including one by a Mr. Barteld Kooi. The paper, "Yet Another Mastermind Strategy," is available here. The gist of the algorithm is that instead of taking the best worst case scenario (as we do in Knuth's algorithm), take the tune that would produce the largest number of unique responses. So a tune that can return six unique responses is valued over a tune that can only return three unique responses. In the author's words, "[In this strategy, o]ne only looks at the 'breadth' of a partition. On the other side of the spectrum one finds Knuth's worst case strategy, which only looks at the 'depth' of a partition." This next algorithm implements exactly that; looking at only the breadth.

Also, I've rerun the algorithm without the hardcoded first guess, but it guessed AABBC anyway, which is a convenient result.

    possibilities = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
        best = 0
        foreach i in possibilities
        responses = []

        foreach j in possibilities
            score = score(j, i)
            responses[score] += 1
        end

        cnt = 0
        foreach response in responses
            if response > 0
            cnt += 1
            end
        end

        if cnt > best
            best = cnt
            guess = i
        end
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.5 (76038 turns / 16807 tunes)
Average turn savings: 0.08

Algorithm: Breadth 2

A consistent guess is one that has a chance of being correct; that is, it's consistent with the tunes we've played so far. When figuring out the next tune to play, we discard any inconsistent guesses. Well, let's see what happens when we keep them!

    possibilities = all_tunes = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
        best = 0
        foreach i in all_tunes
        responses = []

        foreach j in possibilities
            score = score(j, i)
            responses[score] += 1
        end

        cnt = 0
        foreach response in responses
            if response > 0
            cnt += 1
            end
        end

        if cnt > best
            best = cnt
            guess = i
        end
        end
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.4 (73803 turns / 16807 tunes)
Average turn savings: 0.13

Algorithm: Knuth 5

Let's apply the same change to Knuth's algorithm, just in case it ends up being better than Breadth 2. Turns out it isn't. The "average turn savings" below is in relation to the Knuth 4 algorithm, not Breadth 2.

    possibilities = all_tunes = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
        best_worst_case = possibilities.size + 100
        foreach i in all_tunes
        remaining = []

        foreach j in possibilities
            score = score(j, i)
            remaining[score] += 1
        end

        worst_case = remaining.max
        if worst_case < best_worst_case
            best_worst_case = worst_case
            guess = i
        end
        end
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.58 (76911 turns / 16807 tunes)
Average turn savings: 0.03

Algorithm: Breadth 3

Let's prefer to guess consistently over inconsistently when able. As of Breadth 2 we take the first tune that produces the maximum partition set size, even if that guess is inconsistent. If two tunes produce the same (maximum) partition set size, and one is consistent and the other is not, we should definitely pick the consistent one, because we might stumble upon the answer. One possible area of improvement is listing every tune that produces the maximum partition set size and picking the median one. Also, say an inconsistent tune produces a partition set size of 9 while a consistent guess produces a partition set size of 8. Should we play the consistent one?

    possibilities = all_tunes = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
        best = 0
        best_is_consistent = false

        foreach i in all_tunes
        responses = []

        foreach j in possibilities
            score = score(j, i)
            responses[score] += 1
        end

        cnt = 0
        foreach response in responses
            if response > 0
            cnt += 1
            end
        end

        if cnt > best or (!best_is_consistent and i_is_consistent and cnt == best)
            best = cnt
            guess = i
            best_is_consistent = i_is_consistent
        end
        end
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average turns per tune: 4.38 (73622 turns / 16807 tunes)
Average turn savings: 0.01

Algorithm: Knuth 6

Let's again apply the same change to Knuth's algorithm, just on the off-chance it ends up being better than Breadth 3. Turns out it isn't. The "average turn savings" below is in relation to the Knuth 5 algorithm, not Breadth 3.

    possibilities = all_tunes = ... # each possible tune
    while possibilities.size > 1
    if possibilities.size == 16807  # first guess
        guess = "AABBC"
    else
        best_worst_case = possibilities.size + 100
        best_is_consistent = false

        foreach i in all_tunes
        remaining = []

        foreach j in possibilities
            score = score(j, i)
            remaining[score] += 1
        end

        worst_case = remaining.max
        if worst_case < best_worst_case or (!best_is_consistent and i_is_consistent and worst_case == best_worst_case)
            best_worst_case = worst_case
            guess = i
            best_is_consistent = i_is_consistent
        end
        end
    end

    real_score = try guess

    possibilities.delete_if {|possibility| score(possibility, guess) != real_score}
    end

Average tunes per turn: 4.51 (75864 turns / 16807 tunes)
Average turn savings: 0.06

Results

Here are the exact results of each algorithm.

Naïve 1: Guess notes in turn, remove note if silence, keep track of tune so far
Naïve 2: If one note left, fill rest of tune with it
Naïve 3: If five notes seen, eliminate others
Naïve 4: Rotate notes on gear
Knuth 1: Eliminate impossible tunes, guess first possibility
Knuth 2: Guess median possibility, not first
Knuth 3: Start with AABBC not AAABB
Knuth 4: Greedily guess best possibility
Knuth 5: Allow inconsistent guesses
Knuth 6: Prefer consistent guesses
Breadth 1: Guess tune that would give the most unique answers
Breadth 2: Allow inconsistent guesses
Breadth 3: Prefer consistent guesses

Naïve

turns	Naïve 1	Naïve 2	Naïve 3	Naïve 4
1
2
3
4
5	1	1	1	1
6	5	6	6	7
7	15	21	21	28
8	35	77	77	85
9	70	238	242	275
10	126	721	750	825
11	210	1596	1632	1934
12	1519	2576	2617	2903
13	2709	3164	3196	3364
14	3395	3066	3076	2972
15	3241	2422	2405	2127
16	2527	1575	1539	1287
17	1610	840	799	638
18	840	364	331	261
19	364	119	100	85
20	119	21	15
21	21
total	246880	227729	226896	223060
average	14.69	13.55	13.50	13.28

Knuth

turns	Knuth 1	Knuth 2	Knuth 3	Knuth 4	Knuth 5	Knuth 6
1	1	1	1	1	1	1
2	19	17	29	29	17	19
3	260	420	463	491	387	423
4	1337	4053	4656	6149	6429	7324
5	6975	8935	8740	9534	9839	8980
6	5812	3208	2752	597	134	60
7	2052	171	166	6
8	336	2
9	13
10	2
total	93103	82643	81446	77422	76911	75864
average	5.54	4.92	4.85	4.61	4.58	4.51

Breadth

turns	Breadth 1	Breadth 2	Breadth 3
1	1	1	1
2	33	41	40
3	653	838	846
4	7215	8608	8770
5	8288	7140	6977
6	607	179	173
7	10
total	76038	73803	73622
average	4.52	4.39	4.38