10.8  Beam Search

In Section 10.7, we introduced the encoder–decoder architecture, and the standard techniques for training them end-to-end. However, when it came to test-time prediction, we mentioned only the greedy strategy, where we select at each time step the token given the highest predicted probability of coming next, until, at some time step, we find that we have predicted the special end-of-sequence “<eos>” token. In this section, we will begin by formalizing this greedy search strategy and identifying some problems that practitioners tend to run into. Subsequently, we compare this strategy with two alternatives: exhaustive search (illustrative but not practical) and beam search (the standard method in practice).

Let’s begin by setting up our mathematical notation, borrowing conventions from Section 10.7. At any time step \(t'\), the decoder outputs predictions representing the probability of each token in the vocabulary coming next in the sequence (the likely value of \(y_{t'+1}\)), conditioned on the previous tokens \(y_1, \ldots, y_{t'}\) and the context variable \(\mathbf{c}\), produced by the encoder to represent the input sequence. To quantify computational cost, denote by \(\mathcal{Y}\) the output vocabulary (including the special end-of-sequence token “<eos>”). Let’s also specify the maximum number of tokens of an output sequence as \(T'\). Our goal is to search for an ideal output from all \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) possible output sequences. Note that this slightly overestimates the number of distinct outputs because there are no subsequent tokens once the “<eos>” token occurs. However, for our purposes, this number roughly captures the size of the search space.

10.8.4 Summary

Sequence searching strategies include greedy search, exhaustive search, and beam search. Beam search provides a trade-off between accuracy and computational cost via the flexible choice of the beam size.

10.8.5 Exercises

  1. Can we treat exhaustive search as a special type of beam search? Why or why not?
  2. Apply beam search in the machine translation problem in Section 10.7. How does the beam size affect the translation results and the prediction speed?
  3. We used language modeling for generating text following user-provided prefixes in Section 9.5. Which kind of search strategy does it use? Can you improve it?

Discussions