This is Part Two of the Exploring LSTMs Tutorial. If you haven’t read Part One on LSTM basics, head on over now. Don’t worry, this post will still be here when you get back.
Investigating LSTM Internals
Let’s dig a little deeper. We looked in the last section at examples of hidden states, but I wanted to play with LSTM cell states and their other memory mechanisms too. Do they fire when we expect, or are there surprising patterns?
To investigate, let’s start by teaching an LSTM to count. (Remember how the Java and Python LSTMs were able to generate proper indentation!) So I generated sequences of the form
(N “a” characters, followed by a delimiter X, followed by N “b” characters, where 1 <= N <= 10), and trained a single-layer LSTM with 10 hidden neurons.
As expected, the LSTM learns perfectly within its training range – and can even generalize a few steps beyond it. (Although it starts to fail once we try to get it to count to 19.)
# Here it begins to fail: the model is given 19 “a”s, but outputs only 18 “b”s.
We expect to find a hidden state neuron that counts the number of a’s if we look as its internals. And we do:
I built a small web app to play around with LSTMs, and Neuron #2 seems to be counting both the number of a’s it’s seen, as well as the number of b’s. (Remember that cells are shaded according to the neuron’s activation, from dark red [-1] to dark blue [+1].)
What about the cell state? It behaves similarly:
One interesting thing is that the working memory looks like a “sharpened” version of the long-term memory. Does this hold true in general?
It does. (This is exactly as we would expect, since the long-term memory gets squashed by the tanh activation function and the output gate limits what gets passed on.) For example, here is an overview of all 10 cell state nodes at once. We see plenty of light-colored cells, representing values close to 0.
In contrast, the 10 working memory neurons look much more focused. Neurons 1, 3, 5, and 7 are even zeroed out entirely over the first half of the sequence.
Let’s go back to Neuron #2. Here are the candidate memory and input gate. They’re relatively constant over each half of the sequence – as if the neuron is calculating
a += 1 or
b += 1 at each step.
Finally, here’s an overview of all of Neuron 2’s internals:
If you want to investigate the different counting neurons yourself, you can play around with the visualizer here.
(Note: this is far from the only way an LSTM can learn to count, and I’m anthropomorphizing quite a bit here. But I think viewing the network’s behavior is interesting and can help build better models – after all, many of the ideas in neural networks come from analogies to the human brain, and if we see unexpected behavior, we may be able to design more efficient learning mechanisms.)
Count von Count
Let’s look at a slightly more complicated counter. This time, I generated sequences of the form
(N a’s with X’s randomly sprinkled in, followed by a delimiter Y, followed by N b’s). The LSTM still has to count the number of a’s, but this time needs to ignore the X’s as well.
Here’s the full LSTM. We expect to see a counting neuron, but one where the input gate is zero whenever it sees an X. And we do!
Above is the cell state of Neuron 20. It increases until it hits the delimiter Y, and then decreases to the end of the sequence – just like it’s calculating a
num_bs_left_to_print variable that increments on a’s and decrements on b’s.
If we look at its input gate, it is indeed ignoring the X’s:
Interestingly, though, the candidate memory fully activates on the irrelevant X’s – which shows why the input gate is needed. (Although, if the input gate weren’t part of the architecture, presumably the network would have presumably learned to ignore the X’s some other way, at least for this simple example.)
Let’s also look at Neuron 10.
This neuron is interesting as it only activates when reading the delimiter “Y” – and yet it still manages to encode the number of a’s seen so far in the sequence. (It may be hard to tell from the picture, but when reading Y’s belonging to sequences with the same number of a’s, all the cell states have values either identical or within 0.1% of each other. You can see that Y’s with fewer a’s are lighter than those with more.) Perhaps some other neuron sees Neuron 10 slacking and helps a buddy out.
Next, I wanted to look at how LSTMs remember state. I generated sequences of the form
(i.e., an “A” or B”, followed by 1-10 x’s, then a delimiter “Y”, ending with a lowercase version of the initial character). This way the network needs to remember whether it’s in an “A” or “B” state.
We expect to find a neuron that fires when remembering that the sequence started with an “A”, and another neuron that fires when remembering that it started with a “B”. We do.
For example, here is an “A” neuron that activates when it reads an “A”, and remembers until it needs to generate the final character. Notice that the input gate ignores all the “x” characters in between.
Here is its “B” counterpart:
One interesting point is that even though knowledge of the A vs. B state isn’t needed until the network reads the “Y” delimiter, the hidden state fires throughout all the intermediate inputs anyways. This seems a bit “inefficient”, but perhaps it’s because the neurons are doing a bit of double-duty in counting the number of x’s as well.
Finally, let’s look at how an LSTM learns to copy information. (Recall that our Java LSTM was able to memorize and copy an Apache license.)
(Note: if you think about how LSTMs work, remembering lots of individual, detailed pieces of information isn’t something they’re very good at. For example, you may have noticed that one major flaw of the LSTM-generated code was that it often made use of undefined variables – the LSTMs couldn’t remember which variables were in scope. This isn’t surprising, since it’s hard to use single cells to efficiently encode multi-valued information like characters, and LSTMs don’t have a natural mechanism to chain adjacent memories to form words. Memory networks and neural Turing machines are two extensions to neural networks that help fix this, by augmenting with external memory components. So while copying isn’t something LSTMs do very efficiently, it’s fun to see how they try anyways.)
For this copy task, I trained a tiny 2-layer LSTM on sequences of the form
(i.e., a 3-character subsequence composed of a’s, b’s, and c’s, followed by a delimiter “X”, followed by the same subsequence).
I wasn’t sure what “copy neurons” would look like, so in order to find neurons that were memorizing parts of the initial subsequence, I looked at their hidden states when reading the delimiter X. Since the network needs to encode the initial subsequence, its states should exhibit different patterns depending on what they’re learning.
The graph below, for example, plots Neuron 5’s hidden state when reading the “X” delimiter. The neuron is clearly able to distinguish sequences beginning with a “c” from those that don’t.
For another example, here is Neuron 20’s hidden state when reading the “X”. It looks like it picks out sequences beginning with a “b”.
Interestingly, if we look at Neuron 20’s cell state, it almost seems to capture the entire 3-character subsequence by itself (no small feat given its one-dimensionality!):
Here are Neuron 20’s cell and hidden states, across the entire sequence. Notice that its hidden state is turned off over the entire initial subsequence (perhaps expected, since its memory only needs to be passively kept at that point).
However, if we look more closely, the neuron actually seems to be firing whenever the next character is a “b”. So rather than being a “the sequence started with a b” neuron, it appears to be a “the next character is a b” neuron.
As far as I can tell, this pattern holds across the network – all the neurons seem to be predicting the next character, rather than memorizing characters at specific positions. For example, Neuron 5 seems to be a “next character is a c” predictor.
I’m not sure if this is the default kind of behavior LSTMs learn when copying information, or what other copying mechanisms are available as well.
States and Gates
To really hone in and understand the purpose of the different states and gates in an LSTM, let’s repeat the previous section with a small pivot.
Cell State and Hidden State (Memories)
We originally described the cell state as a long-term memory, and the hidden state as a way to pull out and focus these memories when needed.
So when a memory is currently irrelevant, we expect the hidden state to turn off – and that’s exactly what happens for this sequence copying neuron.
The forget gate discards information from the cell state (0 means to completely forget, 1 means to completely remember), so we expect it to fully activate when it needs to remember something exactly, and to turn off when information is never going to be needed again.
That’s what we see with this “A” memorizing neuron: the forget gate is firing on all cylinders to remember that it’s in an “A” state while it passes through the x’s, and turns off once it’s ready to generate the final “a”.
Input Gate (Save Gate)
We described the job of the input gate (what I originally called the save gate) as deciding whether or not to save information from a new input. Thus, it should turn off at useless information.
And that’s what this selective counting neuron does: it counts the a’s and b’s, but ignores the irrelevant x’s.
Now let’s recap how you could have discovered LSTMs by yourself.
First, many of the problems we’d like to solve are sequential or temporal of some sort, so we should incorporate past learnings into our models. But we already know that the hidden layers of neural networks encode useful information, so why not use these hidden layers as the memories we pass from one time step to the next? And so we get RNNs.
But we know from our own behavior that we don’t keep track of knowledge willy-nilly; when we read a new article about politics, we don’t immediately believe whatever it tells us and incorporate it into our beliefs of the world. We selectively decide what information to save, what information to discard, and what pieces of information to use to make decisions the next time we read the news. Thus, we want to learn how to gather, update, and apply information – and why not learn these things through their own mini neural networks? And so we get LSTMs.
And now that we’ve gone through this process, we can come up with our own modifications.
- For example, maybe you think it’s silly for LSTMs to distinguish between long-term and working memories – why not have one? Or maybe you find separate remember gates and save gates kind of redundant – anything we forget should be replaced by new information, and vice-versa. And now you’ve come up with one popular LSTM variant, the GRU.
- Or maybe you think that when deciding what information to remember, save, and focus on, we shouldn’t rely on our working memory alone – why not use our long-term memory as well? And now you’ve discovered Peephole LSTMs.
Making Neural Nets Great Again
Let’s look at one final example, using a 2-layer LSTM trained on Trump’s tweets. Despite the
tiny big dataset, it’s enough to learn a lot of patterns.
For example, here’s a neuron that tracks its position within hashtags, URLs, and @mentions. Here’s a proper noun detector (note that it’s not simply firing at capitalized words), an auxiliary verb + “to be” detector (“will be”, “I’ve always been”, “has never been”), a quote attributor, and even a MAGA and capitalization neuron.
And here are some of the proclamations the LSTM generates (okay, one of these is a real tweet; I’ll let you guess which):
Unfortunately, the LSTM merely learned to ramble like a madman.
That’s it. To summarize, here’s what you’ve learned:
Here’s what you should save:
And now you’ve earned yourself that donut!
Thanks to Chen Liang for some of the TensorFlow code I used, Ben Hamner and Kaggle for the Trump dataset, and, of course, Schmidhuber and Hochreiter for their original paper. If you want to explore the LSTMs yourself, feel free to play around!