<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.1">Jekyll</generator><link href="https://mdcramer.github.io/feed/deep-speeling-blog.xml" rel="self" type="application/atom+xml" /><link href="https://mdcramer.github.io/" rel="alternate" type="text/html" /><updated>2026-04-27T11:57:27-07:00</updated><id>https://mdcramer.github.io/feed/deep-speeling-blog.xml</id><title type="html">Hackin’ and Tinkerin’ | Deep-speeling-blog</title><subtitle>A collection of blogs related to some of my work on GitHub and elsewhere</subtitle><author><name>Mark Cramer</name></author><entry><title type="html">Conclusion</title><link href="https://mdcramer.github.io/deep-speeling-blog/conclusion/" rel="alternate" type="text/html" title="Conclusion" /><published>2018-10-06T00:00:00-07:00</published><updated>2018-10-06T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/conclusion</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/conclusion/"><![CDATA[<p>For a while there, thanks to taking the <a href="https://www.udacity.com/course/deep-learning-nanodegree--nd101">Udacity Deep Learning Nanodegree</a>, which is what got me hooked on deep learning in the first place, I had a bunch of free usage on AWS. A couple of months ago, however, my pool of free usage ran dry and, even though my instance has been dormant for a long time, I started getting charged ~$4 a month. Certainly not a lot of money, but I haven’t worked on this project is almost a year, so I felt it was time to pull the plug.</p>

<p>Despite all my effort, in the end I was not able to reproduce the results on Tal Weiss’ <a href="https://machinelearnings.co/deep-spelling-9ffef96a24f6">Deep Spelling</a> blog post; he described getting 95.5% accuracy on the Validation set and I was never able to get much more than 75%. Apparently reproducing other people’s results <a href="http://blog.kaggle.com/2018/09/19/help-i-cant-reproduce-a-machine-learning-project">can be challenging</a>. <a href="https://www.linkedin.com/in/wang-dong-69b8771a/">Dong Wang</a>, a software engineer who used to work at Pinterest, was able to develop a <a href="https://medium.com/@yaoyaowd/rnn-spelling-correction-to-crack-a-nut-with-a-sledgehammer-7f5aa442c08c">spelling correction RNN</a> that reached 90% accuracy, but he was using a different data set.</p>

<p>The experience, however, was far from being a failure. Not only did I learn a ton from just banging the code together myself and then tweaking an running the models, it was actually a lot of fun. Every time I ran a model it was quite a thrill to watch the numbers come in and see what happened. If my energies weren’t being devoted to a new job and other projects, I might be inclined to continue to push on this.</p>

<p><strong>Update: 17 October 2018</strong> - In April 2018 The Atlantic published an interesting article about how the <a href="https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/">Scientific Paper is Obsolete</a>. In it, the author had this to say:</p>

<blockquote>
  <p>“The more sophisticated science becomes, the harder it is to communicate results. Papers today are longer than ever and full of jargon and symbols. They depend on chains of computer programs that generate data, and clean up data, and plot data, and run statistical models on data. These programs tend to be both so sloppily written and so central to the results that it’s contributed to a replication crisis, or put another way, a failure of the paper to perform its most basic task: to report what you’ve actually discovered, clearly enough that someone else can discover it for themselves.”</p>
</blockquote>

<p>If the results in scientific papers are difficult to reproduce, it should be no surprise that blog posts on the internet would be even more difficult. For what it’s worth, all of my code and data are available so should anyone who wishes to try to replicate my results encounter difficulty, please let me know.</p>]]></content><author><name>Mark Cramer</name></author><category term="goals" /><category term="conclusion" /><summary type="html"><![CDATA[This was an amazing experience, but today I pulled the plug.]]></summary></entry><entry><title type="html">Dropout Experiment with Only Popular Words</title><link href="https://mdcramer.github.io/deep-speeling-blog/dropout-experiment-with-only-popular-words/" rel="alternate" type="text/html" title="Dropout Experiment with Only Popular Words" /><published>2017-11-05T00:00:00-07:00</published><updated>2017-11-05T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/dropout-experiment-with-only-popular-words</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/dropout-experiment-with-only-popular-words/"><![CDATA[<p><a href="/deep-speeding-blog/2017-10-16-No-hablo-español/">As mentioned earlier</a>, I am becoming increasingly skeptical of the possibility of reaching 90% accuracy after 12 hours of training using <a href="https://github.com/MajorTal/DeepSpell/blob/master/keras_spell.py">Mr. Weiss’ methodology</a> and the publicly available <a href="http://research.google.com/pubs/pub41880.html">billion word dataset</a> released by Google. However, I am a stubborn individual and so I thought I would experiment with adjusting the Regularization to see if that helps.</p>

<p>The fact that the previous experiment, where I removed all of the <a href="/deep-speeling-blog/eliminating-uncommon-words-makes-things-worse/">sentences with uncommon words</a>, produced an inferior result makes no sense whatsoever; the problem was significantly simplified and yet the validation results during training were considerably worse. Upon reflection, if the ‘size’ of the problem is reduced and yet the network performs worse, perhaps it is over-fitting.</p>

<p>As such, I decided to adjust the Dropout from the 30% I’ve been using to a more aggressive 50%. (Mr. Weiss’ <a href="https://machinelearnings.co/deep-spelling-9ffef96a24f6">blog</a> says that he used 30% although his code has 20%, so perhaps it’s worth fudging.) If you’ve been following along you can probably guess that, naturally, the result was yet even worse (below).</p>

<p>Continuing, if increasing the Dropout has a significant negative impact on training, then perhaps reducing it would help. Logically this makes little sense to me, as the scope of the problem was just reduced, but I decided to try nonetheless. I launched the training before heading out for a day at an <a href="http://aifrontiers.com/">AI Conference</a>, and things were initially looking <em>great</em> during the first hour before I left the house, but by the time I got home the training went sideways. I decided to run a full 5 epochs just to see how it would finish and at the end I achieve 70.2% accuracy. It should be noted that this is much lower than the <a href="/deep-speeling-blog/running-on-ec2/">75.1% accuracy</a> achieved prior to implementing He normal initialization, extracting non-English sentences from the dataset and removing sentences with uncommon words and numbers.</p>

<p>Every time I’ve tried to implement an improvement it has degraded the network’s ability to learn. I’m again not quite sure where to go from here. Perhaps I’ll dig deeper into the dataset or perhaps I should build the RNN using Keras, just in case something is implemented differently there.</p>

<figure>
    <img src="/assets/images/dropout-experiment-with-popular.png" alt="Validation loss for experiment varying amount of Dropout" /><figcaption>Continuing with the dataset that contains only popular words, adjusting the Dropout produces an effect, but does not improve the results significantly enough.</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="dropout" /><category term="common words" /><category term="uncommon words" /><summary type="html"><![CDATA[Continuing from the previous experiment eliminating uncommon words, adjusting the Dropout produces an effect, but the improvement is not significant enough.]]></summary></entry><entry><title type="html">Eliminating Uncommon Words Makes Things Worse</title><link href="https://mdcramer.github.io/deep-speeling-blog/eliminating-uncommon-words-makes-things-worse/" rel="alternate" type="text/html" title="Eliminating Uncommon Words Makes Things Worse" /><published>2017-11-03T00:00:00-07:00</published><updated>2017-11-05T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/eliminating-uncommon-words-makes-things-worse</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/eliminating-uncommon-words-makes-things-worse/"><![CDATA[<p>This is frustrating.</p>

<p>In the <a href="/deep-speeling-blog/motivation-and-goals/">quest for 90% accuracy</a> I decided to take another look at <a href="https://github.com/MajorTal/DeepSpell/blob/master/keras_spell.py">Mr. Weiss’ code</a> and noticed that in <code class="language-plaintext highlighter-rouge">preprocesses_split_lines4()</code> he is “… selecting only sentences with most-common words.” The call to the function is commented out, and more than half of the function is commented out as well (it appears to be using <a href="https://www.tensorflow.org/tutorials/word2vec">word2vec</a>, but I’m not seeing how), but it got me to thinking:</p>

<ul>
  <li>Perhaps this ‘problem’ can be simplified by removing sentences with very uncommon words, such as those that only ever appear once. It’s unreasonable to think that a neural network could ‘know’ how to spell a word without ever seeing it previously. Those very uncommon words probably contain many proper names and misspellings (which I should check at some point).</li>
  <li>Additionally, while I’m at it, the problem could potentially be simplified even further by removing sentences with numbers. How would a human being, let alone a neural network, know that “I have 41 apples” was, in fact, a transposition of “I have 14 apples.”</li>
</ul>

<p>Therefore, after stripping out all digits and punctuation, I created a frequency dictionary of every ‘word’ in the dataset (“don’t” gets represented as “dont”, which is fine, and “egg-beater” becomes “eggbeater”, which is arguably better) and then removed all sentences containing a ‘word’ that only appears once (i.e. a unique word). Additionally, I removed any sentence containing digits. (This process, by the way, took 30 hours to run.)</p>

<p>While previously my dataset has 4,161,772 lines, the new one has only 3,463,951 lines, for a reduction of 20%, which seems pretty sizable. As such, I was giddily anticipating a considerable improvement in accuracy. Naturally, as has been my experience so far, the outcome (below) was exactly the opposite.</p>

<p>How is this possible? How can removing the obviously difficult ‘words’ to spell from the dataset result in a significantly worse outcome when training. For the moment, I don’t even have a theory as to what might have happened here.</p>

<figure>
    <img src="/assets/images/popular-words-only.png" alt="Validation loss for experiment with dataset that contains only popular words" /><figcaption>Removing sentences with uncommon words from the dataset significantly reduces training performance.</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="datasets" /><category term="common words" /><category term="uncommon words" /><summary type="html"><![CDATA[Eliminating sentences with uncommon words (i.e. those that only appear once) from the dataset does not improve the accuracy of the training, for reasons unknown.]]></summary></entry><entry><title type="html">Reducing the Validation and Test datasets</title><link href="https://mdcramer.github.io/deep-speeling-blog/reducing-the-validation-and-test-datasets/" rel="alternate" type="text/html" title="Reducing the Validation and Test datasets" /><published>2017-10-20T00:00:00-07:00</published><updated>2017-10-23T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/reducing-the-validation-and-test-datasets</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/reducing-the-validation-and-test-datasets/"><![CDATA[<p>Before starting the process of training, it is common practice to first randomly select a portion of the data to set aside for testing after training is complete. Then it is common practice to randomly select a portion of the remaining data for validation during training. The percentages used to split the data will <a href="https://www.youtube.com/watch?v=oJzDsnPq4vU">generally</a> be a function of the amount of data available and the complexity of the model.</p>

<p>Out of habit, I chose to slice off 10% of the data to set aside for the Test dataset. When analyzing accuracy after each epoch, however, I began to notice that the accuracy percentages converged on a relatively stable determination long before the I finished running the Test dataset (see below). This seemed like a watch of both data and time, given that it took ~30 minutes to run the Test dataset. After consulting my colleagues at <a href="https://stats.stackexchange.com/questions/304977/can-i-use-a-tiny-validation-set">CrossValidated</a>, I decided to drop the amount of data set aside for the Test dataset from 10% to 2%. (The same thing could conceivably be done with the Validation dataset used during training, but I’ve been using a single batch for that.)</p>

<p>The result: A little more data for training and a little less time spend computing accuracy after each epoch. Meh, but why not?</p>

<figure>
    <img src="/assets/images/test-accuracy.png" alt="Measurements of accuracy with Test dataset after each epoch of training" /><figcaption>Extensive evaluation of Test dataset does not appreciable improve the determination of accuracy after a certain point.</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="validation" /><category term="test" /><category term="datasets" /><summary type="html"><![CDATA[Depending on the situation, using smaller Validation and Test datasets can leave more data for training, and reduce time spent computing accuracy, without any impact on determining accuracy.]]></summary></entry><entry><title type="html">No hablo español</title><link href="https://mdcramer.github.io/deep-speeling-blog/no-hablo-espanol/" rel="alternate" type="text/html" title="No hablo español" /><published>2017-10-16T00:00:00-07:00</published><updated>2017-10-16T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/no-hablo-espanol</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/no-hablo-espanol/"><![CDATA[<p>As mentioned <a href="/deep-speeling-blog/motivation-and-goals/">at the beginning</a>, the first goal is to reproduce the results produced by Mr. Weiss. After implementing virtually everything in his post, including the <a href="/deep-speeling-blog/he-normal-makes-things-worse/">He normal initialization</a>, and building a highly similar model, the target of 90% accuracy (let alone 95.5%) is proving elusive.</p>

<p>I am thus going to start branching out.</p>

<h2 id="removing-non-english-sentences-from-the-dataset">Removing non-English sentences from the dataset</h2>
<p>It is critical to know the data set. It is also critical to clean the dataset, which is what data scientists <a href="https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/">spend most of their time doing</a>.</p>

<p>Since the inception of this project I’ve been peaking into the and have noticed, despite the “.en” in the source filename, a considerable number of non-English sentences, like <code class="language-plaintext highlighter-rouge">Toujours aussi inconstant, le Brésil, tombé au 19e rang du classement FIFA, a certes réagi après l'ouverture du score de la tête de Gonzalez (7).</code> While I would imagine that a neural network could be trained to simultaneously correct spelling for multiple different languages, this will only increase the complexity of the problem. Additionally, only a smattering of different languages is going to be nothing more than noise. Thus, I’ve decided to scrub the data set of anything non-English.</p>

<p>Using <a href="https://pypi.python.org/pypi/langdetect">langdetect</a> I constructed a routine to iterate through the entire file and remove any sentences that were not identified as English. In the process, I discovered a few things about langdetect:</p>
<ul>
  <li>It is <a href="https://stackoverflow.com/a/38752290/852795">not entirely accurate</a>. Sentences like <code class="language-plaintext highlighter-rouge">You made it home!</code> return “fr”. To mitigate this effect I used <code class="language-plaintext highlighter-rouge">detect_langs</code>, which returns a probability distribution of possible languages, and accept anything that has a non-zero chance of being English.</li>
  <li>It is not consistent. Repeatedly processing <code class="language-plaintext highlighter-rouge">Hello, I'm christiane amanpour.</code> returned <code class="language-plaintext highlighter-rouge">[it:0.8571401485770536, en:0.14285811674731527]</code> then <code class="language-plaintext highlighter-rouge">[it:0.8571403121803622, fr:0.14285888197332486]</code> and then <code class="language-plaintext highlighter-rouge">[it:0.999995562246093]</code>, all of which are incorrect, by the way. (It’s unclear why “Christiane Amanpour” isn’t capitalized, but that’s the way it is in the source file.)</li>
  <li>It throws an error when processing text it cannot “identify,” such as URLs and emails. This actually turns out to be a happy result since things like <code class="language-plaintext highlighter-rouge">http://abcn.ws/11JABPu</code> and <code class="language-plaintext highlighter-rouge">rswilloughby@pomlaw.com</code> are not going to help much with training, so it’s good to get rid of them anyway.</li>
  <li>It is slow. It took 24 hours to process 21,688,362 lines, which works out to about 250 lines a second. I guess that’s not too bad.</li>
</ul>

<p>Perhaps my next project should be to build a language detection neural network. I digress.</p>

<p>In any event, the end result is that 848,326 lines, including 345 “errors,” were removed from the source file. That’s only 3.9% of the lines, however, after further processing to remove sentences that contain characters outside the top 75, the size of the source dataset dropped from 4,154,135 lines to 3,793,771, which is a reduction of 8.7%.</p>

<h2 id="thats-the-theory-anyway">That’s the theory, anyway</h2>
<p>The actual result, after two days of training, is disappointing. In my first run (orange-brown dots) I inadvertently also changed the minimum input size from 5 to 10, which had a pretty negative impact on training so I cut it short. It’s often helpful to not change multiple things at the same time.</p>

<p>The second run (dark blue dots) actually did well, although not better than the run a couple weeks ago when I changed the number of allowable characters from 100 to 75 and before I implemented <a href="/deep-speeling-blog/he-normal-makes-things-worse/">He normal initialization</a>. The final validation accuracy was 70.0%. This is another disappointing result and I continued to be perplexed as to how Mr. Weiss got to 95.5% accuracy.</p>

<p>I have not yet, however, run out of ideas.</p>

<figure>
    <img src="/assets/images/english_only.png" alt="After removing non-English source sentences" />
    <figcaption>Validation loss for various hyperparameter configurations</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="english" /><category term="non-english" /><category term="langdetect" /><category term="language detection" /><summary type="html"><![CDATA[Remove non-English text from the data set.]]></summary></entry><entry><title type="html">He normal makes things worse</title><link href="https://mdcramer.github.io/deep-speeling-blog/he-normal-makes-things-worse/" rel="alternate" type="text/html" title="He normal makes things worse" /><published>2017-10-01T00:00:00-07:00</published><updated>2017-10-07T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/he-normal-makes-things-worse</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/he-normal-makes-things-worse/"><![CDATA[<h2 id="initializing-weights-is-important">Initializing weights is important</h2>
<p>Initializing the weights properly can make the difference between a model that trains nicely and converges to a generalized solution, and one that either explodes, never quite gets there or <a href="https://plus.google.com/+SoumithChintala/posts/RZfdrRQWL6u">trains more slowly</a>. The Udacity <a href="https://github.com/mdcramer/deep-learning/tree/master/seq2seq">sequence-to-sequency RNN example</a> used a random uniform initialization (<code class="language-plaintext highlighter-rouge">tf.random_uniform_initializer(-0.1, 0.1, seed=2)</code>) for the encoder and decoder and then a truncated normal initialization (<code class="language-plaintext highlighter-rouge">tf.truncated_normal_initializer(mean=0.0, stddev=0.1)</code>) for the decoder dense layer, while <a href="https://medium.com/@majortal/deep-spelling-9ffef96a24f6">Weiss</a> used a Gaussian initialization scaled by fan-in, also known as <a href="https://arxiv.org/abs/1502.01852">He normal initialization</a>.</p>

<p>Again, the training results were disappointing. Not only did the He normal initialization (achieved via <code class="language-plaintext highlighter-rouge">tf.contrib.layers.variance_scaling_initializer()</code> - the default is He normal) not surpass the very basic random uniform initialization, but it performed worse. This is both disappointing and surprising, given that normal initializers typically surpass uniform ones.</p>

<h2 id="this-is-odd">This is odd</h2>
<p>Also interesting is how, unlike the previous configurations, with the He normal initialization there are a number of validation losses that are significantly outside of the trend. You can see the couple dozen blue dots in the white space above the He normal validation loss curve. I have no explanation or theory for that.</p>
<figure>
	<img src="/assets/images/he-normal.png" alt="He normal initialization" />
	<figcaption>Validation loss for various hyperparameter configurations</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="he normal" /><category term="initializer" /><summary type="html"><![CDATA[Disappointing that a fancy, modern initialization technique not only didn't help, but made things worse]]></summary></entry><entry><title type="html">Goldilocks batch size</title><link href="https://mdcramer.github.io/deep-speeling-blog/goldilocks-batch-size/" rel="alternate" type="text/html" title="Goldilocks batch size" /><published>2017-10-01T00:00:00-07:00</published><updated>2017-10-04T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/goldilocks-batch-size</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/goldilocks-batch-size/"><![CDATA[<h2 id="batch-size-of-64-is-too-small">Batch size of 64 is too small</h2>
<p>As we learned in a previous post, a <a href="/deep-speeling-blog/smaller-batches-and-dropout/">batch size of 256</a> is too large. Since then I’ve been running with a batch size of 128 when I got to thinking, if 128 is better than 256 then perhaps 64 is even better than that.</p>

<p>It is not.</p>

<p>For whatever reason, training with a batch size of 64 progressed significantly slower than all of the training with a batch size of 128. It was so poor that I cut it off before even completing the first epoch. I’ll be going back to 128 now because, apparently, that one is “just right.”</p>
<figure>
	<img src="/assets/images/batch_size-64.png" alt="Batch size 64" />
	<figcaption>Validation loss for various hyperparameter configurations</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="batch_size" /><summary type="html"><![CDATA[Some batch sizes are too big and some are too small. It's important to find the one that is just right.]]></summary></entry><entry><title type="html">Running on EC2</title><link href="https://mdcramer.github.io/deep-speeling-blog/running-on-ec2/" rel="alternate" type="text/html" title="Running on EC2" /><published>2017-09-27T00:00:00-07:00</published><updated>2017-10-16T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/running-on-ec2</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/running-on-ec2/"><![CDATA[<p>Even with a <a href="/deep-speeling-blog/my-rig/">local GPU</a> there is a certain convenience to training in the cloud. For starters, you can turn off your machine, which is especially handy if your training takes multiple days (or weeks). If training causes your laptop’s fan to whir, there’s also the benefit to some quiet.</p>

<h2 id="getting-off-the-ground">Getting off the ground</h2>
<p>The first step is to create an account at AWS. Since I took the <a href="https://www.udacity.com/course/deep-learning-nanodegree-foundation--nd101">Udacity Deep Learning Nanodegree</a> I already had one.</p>

<p>Next, you’ve got to <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/LaunchingAndUsingInstances.html">launch an instance</a> on EC2. Here things can get complicated. If you just grab a generic <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html">accelerated computing instance</a>, like the g2.2xlarge, you’ll then have to set it up. There are a number of instructional websites on <a href="http://ramhiser.com/2016/01/05/installing-tensorflow-on-an-aws-ec2-instance-with-gpu-support/">getting things set up</a>, which generally involve installing CUDA, cuDNN, Tensorflow and then whatever else you need, but you can also grab an <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html">AMI</a> that <a href="https://aws.amazon.com/marketplace/pp/B01EYKBEQ0?qid=1505924878587&amp;sr=0-1&amp;ref_=srh_res_product_title">already has everything</a> you need. I decided to run with the AMI provided by the Udacity course.</p>

<p>Finally, you need to log into your instance, load up your code, typically accomplished by cloning a Github repo, and then run. If you’re going to be training for any length of time, <a href="http://linux.101hacks.com/unix/nohup-command/">nohup</a> is your friend. I put a number of helpful commands at the top of the .ipynb file in the Github repo.</p>

<h2 id="phoning-home">Phoning home</h2>
<p>One thing which is decidedly inconvenient about running in the cloud is that it’s difficult to know how things are going. You could log into the instance and check an output file, but that can be a hassle. For fun (isn’t all of this fun?) I decided to have my script periodically email me updates. Setting up email on EC2 isn’t too hard, so now at the end of each epoch I get an email like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Batch size: 128
RNN size  : 512
Num layers: 2
Enc. size : 512
Dec. size : 512
Keep prob.: 0.7
Learn rate: 0.001

Epoch   3/4 Batch    100/32453 Inputs (000)    8320 - Loss:  0.053 - Validation loss:  0.037
Epoch   3/4 Batch    200/32453 Inputs (000)    8333 - Loss:  0.078 - Validation loss:  0.056
Epoch   3/4 Batch    300/32453 Inputs (000)    8346 - Loss:  0.052 - Validation loss:  0.044
Epoch   3/4 Batch    400/32453 Inputs (000)    8359 - Loss:  0.052 - Validation loss:  0.044
Epoch   3/4 Batch    500/32453 Inputs (000)    8371 - Loss:  0.065 - Validation loss:  0.042
...
...
...
Epoch   3/4 Batch  32100/32453 Inputs (000)   12416 - Loss:  0.057 - Validation loss:  0.034
Epoch   3/4 Batch  32200/32453 Inputs (000)   12429 - Loss:  0.046 - Validation loss:  0.036
Epoch   3/4 Batch  32300/32453 Inputs (000)   12442 - Loss:  0.036 - Validation loss:  0.035
Epoch   3/4 Batch  32400/32453 Inputs (000)   12455 - Loss:  0.039 - Validation loss:  0.035
Epoch   3/4 Batch  32453/32453 Inputs (000)   12461 - Loss:  0.041 - Validation loss:  0.035

Model training for 21h:2m:48s and saved.
Current accuracy = 70.4%
</code></pre></div></div>

<p>I don’t know why, but getting email updates from EC2 on the status of my neural network training has been more thrilling and I would ever have anticipated.</p>

<h2 id="making-it-happen">Making it happen</h2>
<p>After getting everything set up on EC2, I trained for 3 epochs on my local GPU (using a batch size of 128 and 30% dropout) before transferring the whole thing, minus the source and validation files but including the current state of the graph, into the cloud. The fourth and final epoch was then successfully run on EC2. Yay.</p>

<p>The result below, however, is a bit odd. The discontinuity between the 3rd and 4th epochs is certainly due to the fact the source and validation files were regenerated before training continued. Which leads me to wonder the extent to which the training can be affected by the random sort when generating the source and validation files. It also makes me wonder if any efficiencies can be obtained by generating those files after every epoch, which is not something I have been doing. Those are ideas I will try to explore later.</p>

<p>The final accuracy was 75.1% after over 24 hours of total training. This is quite a bit lower than the goal of 90% after 12 hours, so there is more work to be done.</p>

<figure>
    <img src="/assets/images/ec2.png" alt="Training on EC2" />
    <figcaption>Validation loss for various hyperparameter configurations</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="ec2" /><category term="aws" /><category term="email" /><summary type="html"><![CDATA[Getting up and running in the cloud with email updates from AWS]]></summary></entry><entry><title type="html">Saving for future training</title><link href="https://mdcramer.github.io/deep-speeling-blog/saving-for-future-training/" rel="alternate" type="text/html" title="Saving for future training" /><published>2017-09-24T00:00:00-07:00</published><updated>2017-10-07T00:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/saving-for-future-training</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/saving-for-future-training/"><![CDATA[<p>The Udacity <a href="https://github.com/mdcramer/deep-learning/tree/master/seq2seq">sequence to sequence project</a> that I used as a starting point for the code actually had code to save and then reload the graph. I’m not sure why they decided to do that, but the graph is saved at the end of training and then reloaded right before it is used to make a prediction. So simply going back to continue training after the graph is saved should be easy, right?</p>

<p>Nothing is ever easy.</p>

<h2 id="getting-help-from-stack-overflow">Getting help from Stack Overflow</h2>
<p>The problem was that while the Udacity project saved the graph, it only ‘named’ those variables necessary for the predictions. It did not ‘name’ those variables that are also necessary for continued training. This problem vexed me for days, so I finally went to <a href="https://stackoverflow.com/questions/46374113/indexerror-when-loading-saved-tensorflow-graph-to-continue-training">Stack Overflow</a> for assistance. While there were a number of generous individuals who tried to give me a hand, in the end I had to figure it out myself. For the benefit of others who might stumble across the same problem, I went ahead and answered my own question.</p>

<h2 id="setting-up-the-script-to-run-automatically">Setting up the script to run automatically.</h2>
<p>While I was fiddling with the script I figured I would also get the thing to run ‘automatically’ from top to bottom. Part of the problem with Jupyter (or perhaps one of the benefits of Jupyter), is that you can manually execute the cells one at a time. When doing so it’s easy to jump over any cells that shouldn’t be run, for whatever reason.</p>

<p>Therefore, I arranged the code so that not only could it run from top to bottom automatically, but I could export the script to a .py file and run it from the command line. This will be handy for running in the background on EC2. Additionally, I added the ability to insert the command line switch “small” to indicate that the script should run on the small data. This turns out to be very handy for testing and debugging.</p>]]></content><author><name>Mark Cramer</name></author><category term="training" /><category term="saving" /><category term="stack overflow" /><summary type="html"><![CDATA[When training can last for days, and days, it's nice to be able to save the graph where it is an then pick it up again later.]]></summary></entry><entry><title type="html">Smaller batches and Dropout</title><link href="https://mdcramer.github.io/deep-speeling-blog/smaller-batches-and-dropout/" rel="alternate" type="text/html" title="Smaller batches and Dropout" /><published>2017-09-16T00:00:00-07:00</published><updated>2017-10-06T17:00:00-07:00</updated><id>https://mdcramer.github.io/deep-speeling-blog/smaller-batches-and-dropout</id><content type="html" xml:base="https://mdcramer.github.io/deep-speeling-blog/smaller-batches-and-dropout/"><![CDATA[<h2 id="batch-size-of-256-is-too-large">Batch size of 256 is too large</h2>

<p>I used a batch size of 256 during my first crack at training the RNN and, as can be seen below, it crashed before completing the first epoch. I’m not sure how long it took to get to that point, but it was probably only a couple hours in.</p>

<h2 id="adding-dropout">Adding Dropout</h2>

<p>Not only did dropping the batch size down to 128 improve training performance (the validation loss came down more quickly), but the RNN was able to successfully train for 6 hours on my local machine. That being said, it’s generally a good idea to use some form of <a href="https://en.wikipedia.org/wiki/Regularization_(mathematics)">Regularization</a> to prevent over-fitting. The Udacity project didn’t have any Regularization, so I added 30% <a href="https://en.wikipedia.org/wiki/Dropout_(neural_networks)">Dropout</a>. This had the happy effect of initially increasing the rate at which the validation loss dropped, but at the end of the epoch it was still about the same without dropout.</p>

<p>The spelling correction is also not working particularly well. As an example, “he had dated forI much of the past” becomes “Sou had teap for much to heads tap”. There’s obviously more work to be done.</p>
<figure>
<img src="/assets/images/batch256-and-dropout.png" alt="256 batch size and 30% Dropout" />
<figcaption>Validation loss for various hyperparameter configurations</figcaption>
</figure>]]></content><author><name>Mark Cramer</name></author><category term="batch_size" /><category term="dropout" /><summary type="html"><![CDATA[Decreasing the batch size improved performance and adding Dropout helped even more.]]></summary></entry></feed>