\documentclass[11pt]{article}
\usepackage{hyperref}
\usepackage[margin=0.75in]{geometry}
\usepackage{amsmath}
\newcommand{\bos}{\textless s\textgreater\:}
\newcommand{\eos}{\textless /s\textgreater\:}
\newcommand{\pad}{PAD\:}
\begin{document}
\title{LING 575K HW7}
\date{\vspace{-0.2in}Due 11PM on May 19, 2022}
\maketitle
\noindent In this assignment, you will
\begin{itemize}
\item Develop understanding of recurrent neural networks, especially as used for language modeling
\item Implement components of data processing
\item Implement masking of losses for an RNN language model
\end{itemize}
All files referenced herein may be found in \texttt{/dropbox/21-22/575k/hw7/} on patas.
\section{Recurrent Neural Network Decoders/Taggers [35 pts]}
\noindent {\bf Q1: Understanding Masking [15 pts]} Suppose that we want to train a (word-level) language model on the following two sentences:
\begin{center}
\bos the cat sits \eos
\bos the model reads the sentence \eos
\end{center}
We saw in HW6 that padding is necessary to make these sentences have the same length so that they can be batched together, as:
\begin{center}
\bos the cat sits \eos \pad \pad
\bos the model reads the sentence \eos
\end{center}
Please answer the following questions about these sequences:
\begin{itemize}
\item In a recurrent language model, what would the input batch be? What would the target labels be? \hfill [4 pts]
\item Recurrent language models use a \emph{mask} of ones and zeros to `eliminate' the loss for \pad tokens. What would the mask be for this batch? \hfill [3 pts]
\item Suppose that we have the following per-token losses:
\[ \begin{bmatrix} 0.1 & 0.3 & 0.2 & 0.4 & 0.7 & 0.5 \\ 0.2 & 0.6 & 0.1 & 0.8 & 0.9 & 0.4 \end{bmatrix} \]
What is the \emph{masked} loss matrix? \hfill [3 pts]
\item Why is it important to mask losses in this way? What might a model learn to do if the loss is not masked? \hfill [5 pts]
\end{itemize}
\vspace{2em}
\noindent {\bf Q2: Evaluating Language Models [20 pts]} Given a corpus $W = w_1 w_2 \dots w_N$ (so $N$ is the number of tokens in the corpus), a common (intrinsic) evaluation metric for language models is \emph{perplexity}, defined as
\[ PP(W) = P(w_1 \dots w_N)^{-\frac{1}{N}} \]
This can be thought of as the inverse probability that the model assigns to the corpus, normalized by the size of the corpus.
\begin{itemize}
\item Is a lower or higher perplexity better? \hfill [2 pts]
\item For a recurrent language model, write an expression for $P(w_1 \dots w_N)$ using the chain rule of probability. How is this different from the expression for a feed-forward language model? \hfill [5 pts]
\item Show that
\[ PP(W) = e^{-\frac{1}{N} \sum_{i=1}^N \log P(w_i \mid w_{