Skip to main content
  1. Blog/

RNN Fever Dream

·7 mins
Building Skippy - This article is part of a series.
Part 2: This Article

Three years after my Markov chain bot called someone a potato in IRC, Google published a paper where a neural network debated the meaning of life. Trained on movie subtitles. The exchange went something like:

Human: What is the purpose of life? Machine: To serve the greater good. Human: What is the purpose of living? Machine: To live forever.

That wasn’t a lookup table. That wasn’t two words predicting a third. Something had changed.

The Seq2Seq Moment
#

Sutskever, Vinyals, and Le dropped their sequence-to-sequence paper in late 2014 and it rewired how everyone thought about language generation. The idea was elegant. Take one LSTM network, feed it an input sequence, let it compress that into a fixed-length vector (the “thought vector,” which is a hell of a name for a tensor). Then hand that vector to a second LSTM that decodes it into an output sequence.

Encoder reads. Decoder writes. The vector in the middle is the entire understanding.

The original application was machine translation, but Vinyals and Le turned it on conversations. Train the encoder on one line of dialogue, train the decoder on the response. Do that across enough movie scripts and the network learns something that looks a lot like how people talk. Not because it understands conversation. Because conversation has statistical patterns just like everything else.

Karpathy’s “The Unreasonable Effectiveness of Recurrent Neural Networks” post landed around the same time and it was the thing that actually got me to sit down and try this. He trained character-level RNNs on Shakespeare, Linux kernel source, LaTeX papers. The outputs were wrong in every factual sense but structurally right in ways that shouldn’t have been possible from a model that only sees one character at a time.

His char-rnn generated C code with proper bracket matching. Functions with correct indentation and realistic (but fake) variable names. It learned to open and close quotes. It tracked whether it was inside a URL. All of this emerged from raw characters. No tokenizer, no grammar rules, no hand-coded anything.

That was the moment where I went from “this is a fun toy” to “this is going to be a thing.”

neuralconvo on a Saturday Afternoon
#

Marc-AndrĂ© Cournoyer had an open source project called neuralconvo that implemented the Google conversational model in Torch7. Two LSTM layers, trained on the Cornell Movie-Dialogs Corpus. Lua, because that’s what Torch7 used, which was its own adventure.

I had upgraded the homelab by this point. Still the same rack, but better hardware. A GTX 780 I pulled from my gaming rig because training on CPU was going to take a week and I didn’t have that kind of patience.

Getting neuralconvo running was a solid afternoon of dependency wrangling. Torch7 needed specific versions of nn, rnn, penlight. CUDA drivers had to match the Torch build. The Cornell dataset needed downloading and extracting into the right directory structure. None of this was hard, exactly. Just fiddly in the way that ML tooling in 2015 was always fiddly.

I kicked off training on 50,000 dialogue pairs with a hidden size of 1,000. The GTX 780 churned through it. Three days for 20 epochs.

First Contact
#

The first time the trained model responded to a question, I just sat there.

> What is the purpose of life?
A gift.

Two words. Not from a database. Not pattern-matched from a script. Generated fresh by a network that had compressed thousands of movie conversations into weight matrices and was now producing novel output.

It was wrong, obviously. It wasn’t wise. It was a statistical artifact of Hollywood screenwriters being dramatic. But it felt different from the Markov chain in a way I couldn’t immediately articulate.

The Markov bot remixed surface patterns. Word A follows Word B. This was doing something deeper. The encoder was reading my entire input, building a representation of it, and the decoder was constructing a response from that representation. The “thought vector” in the middle meant the model could, in theory, hold the meaning of a sentence, not just its last two words.

In practice, it held the vibe more than the meaning. Ask it something the movie corpus covered well (relationships, conflict, existential questions) and the responses were eerily coherent. Ask it something specific or technical and it fell apart. It had never seen a conversation about Linux permissions in a movie script, so it had nothing to work with.

Torch7, Theano, and the Framework Wars
#

2015 was wild for ML frameworks. Torch7 was Lua-based, fast, and had the best GPU support at the time. Theano was Python, more academic, slower to iterate but you could drop into the math more easily. Caffe existed but was mostly for vision. TensorFlow had just been announced and nobody trusted it yet.

I bounced between Torch7 and Theano depending on the project. Torch7 for anything that needed speed and had a working Lua implementation I could fork. Theano for anything where I wanted to understand the gradients and didn’t mind waiting.

The ecosystem was fragmented in a way that’s hard to explain now. Nothing was standardized. Every project had its own data loading format, its own training loop conventions, its own way of saving checkpoints. You couldn’t just pip install a model. You cloned repos, read READMEs that were half-wrong, and debugged shape mismatches at 1am.

But the energy was incredible. Every week someone posted a new result that seemed impossible the month before. Image captioning. Style transfer. Dialogue generation. The hardware was barely keeping up. I was training models on a single consumer GPU that a year earlier would have required a cluster.

The Gap Narrowed
#

The Markov chain had a gap between “appears to understand” and “actually understands” that was wide enough to laugh at. The RNN narrowed it. Not to zero. Not even close. But enough that the laugh caught in your throat sometimes.

The responses weren’t just statistically plausible word sequences anymore. They had something like coherence. The model could track a topic across a sentence. It could generate responses that were contextually appropriate, not just grammatically possible. When it worked, it felt like talking to someone distracted, not someone absent.

When it didn’t work, it was still obviously a machine. It would loop. It would contradict itself within two sentences. It would respond to a question about breakfast with a line about death because that’s what the movie corpus gave it. The failure modes were different from Markov chains, more subtle, harder to spot immediately, but just as fundamental.

What Changed in My Head
#

The Markov chain taught me that training data is everything. The RNN taught me that architecture matters too.

Same training data through a different structure produced qualitatively different output. The Cornell Movie-Dialogs Corpus through a Markov chain gave you remixed movie quotes. Through a seq2seq model, it gave you something that felt like a conversation. The data was identical. The structure made it think (or whatever the machine equivalent of thinking is) differently.

I also learned that the “thought vector” was both the breakthrough and the bottleneck. Compressing an entire input into a single fixed-length vector was elegant, but it meant long inputs lost information. The model remembered the gist, not the details. Ask it a two-word question and the response was sharp. Ask it a paragraph and it got fuzzy. There was a ceiling, and the architecture put it there.

That bottleneck would take a couple more years and a paper called “Attention Is All You Need” to break. But I didn’t know that yet. What I knew was that the itch from 2012 had gotten worse. The question wasn’t hypothetical anymore. Something with actual structure could hold context and generate coherent responses. I’d seen it do it. The question now was how much further it could go.

Building Skippy - This article is part of a series.
Part 2: This Article

Related