Rails Machine - Views and Actions http://featherplane.com/bcms_news/articles/feed en-us Passenger 4.0.1 Is Out! <p> <a href="http://blog.phusion.nl/2013/05/06/phusion-passenger-4-0-1-final-release/">Passenger 4.0.1 is out</a> and that means we need to add support to <a href="https://github.com/railsmachine/moonshine">Moonshine</a> for it! &nbsp;If you update Moonshine, you&#39;ll have everything you need to deploy Passenger 4.0.1 - I highly recommend it. &nbsp;Why? &nbsp;Out of band work (garbage collection)! &nbsp;The Phusion guys <a href="http://blog.phusion.nl/2013/01/22/phusion-passenger-4-technology-preview-out-of-band-work/">blogged about it during one of the betas</a>, but their instructions for setting it up don&#39;t work very well with the final release. &nbsp;I think this is my favorite new feature in Passenger and could end up being extremely useful for folks running larger apps that struggle with garbage collection. &nbsp;I got it working this morning on one of our internal apps and decided I would save you the trouble of figuring it out yourself. &nbsp;</p> <p> First, you need to add passenger to your Gemfile. &nbsp;I don&#39;t normally put it in mine since it lives outside of the Rails stack, but you&#39;ll need it for this:</p> <pre> <code>gem &#39;passenger&#39;, &#39;4.0.1&#39;</code></pre> <p> Now we need to add some cool stuff to your config/application.rb. Near the top, after you require rails, add this:</p> <pre> <code>if [&#39;staging&#39;,&#39;production&#39;].include?(Rails.env) require &#39;phusion_passenger/rack/out_of_band_gc&#39; require &#39;phusion_passenger/public_api&#39; end</code></pre> <p> And inside the module where you have all your other configuration, this:</p> <pre> <code>if [&#39;staging&#39;,&#39;production&#39;].include?(Rails.env) config.middleware.use PhusionPassenger::Rack::OutOfBandGc, 5 PhusionPassenger.on_event(:oob_work) do # Phusion Passenger has told us that we&#39;re ready to perform OOB work. t0 = Time.now GC.start Rails.logger.info &quot;Out-Of-Bound GC finished in #{Time.now - t0} sec&quot; end end</code></pre> <p> And one last chunk. In app/controllers/application_controller.rb, you need to add a before_filter to add the header telling Passenger to do its thing:</p> <pre> <code>before_filter :add_out_of_band_header private def add_out_of_band_header response.headers[&quot;X-Passenger-Request-OOB-Work&quot;] = &quot;true&quot; end</code></pre> <p> Now, after you deploy that (to staging first, I hope), you should be able to tail the log and see the app running GC every 5 requests. If your app is super busy, you may want to change the setting to more requests, but 5 is probably a good place to start.</p> <p> I can foresee a lot of other uses for out of band work, almost anything that the user shouldn&#39;t have to wait for but that needs to done while you have access to the request (assuming that&#39;s something out of band work can do - I haven&#39;t checked yet).</p> Mon, 06 May 2013 00:00:00 +0000 http://featherplane.com/articles/2013/05/06/passenger-4-0-1-is-out http://featherplane.com/articles/2013/05/06/passenger-4-0-1-is-out Watch That First Request, It's a Doozy <p> I have a love/hate relationship with the asset pipeline. &nbsp;Ok, it&#39;s mostly hate. &nbsp;I applaud what it tries to do, but it takes something relatively simple - javascript and CSS - and makes it painful, mostly by making deploys take <em>forever </em>(even with <a href="https://github.com/ndbroadbent/turbo-sprockets-rails3">turbo-sprockets-rails3</a>).</p> <p> For example, I was working on something yesterday and doing deploys to the staging environment. &nbsp;I noticed that it took forever to be able to start serving requests. &nbsp;Passenger-status showed all three instances were running, and were all stuck serving their first request. &nbsp;A little digging revealed that there were a bunch of node.js processes running compiling assets! &nbsp;What the?! &nbsp; I thought I was precompiling everything! &nbsp;Nope, apparently not.</p> <p> After a little digging, I found this setting:</p> <pre> <code>config.assets.compile = true</code></pre> <p> Set that to <code>false</code> and redeployed, and all of a sudden, I was seeing 500 responses for assets that weren&#39;t precompiled! &nbsp;I ended up spending about 30 minutes adding a bunch of new javascript and CSS files to the <code>config.assets.precompile</code> list and eventually the app started up fine and I got rid of the unpleasant &quot;first request takes forever&quot; symptom in Passenger.</p> <p> I think this is most likely going to bite you if you use a gem that provides its own assets and doesn&#39;t add them to the precompile list on its own (which it should).</p> <p> I hope that helps someone avoid some downtime / slow requests in the future, because it took me a while to figure out!</p> Fri, 08 Mar 2013 00:00:00 +0000 http://featherplane.com/articles/2013/03/08/watch-that-first-request-it-s-a-doozy http://featherplane.com/articles/2013/03/08/watch-that-first-request-it-s-a-doozy 30 Minutes in the Life of Troubleshooting Weird Rails Stuff <p> Managing hundreds of Rails apps gives us the opportunity to see all of the amazing ways that Ruby and Rails can misbehave. &nbsp;I had an experience so weird last night that I think it deserves its own blog post.</p> <p> An app that hadn&#39;t been deployed in two weeks all of a sudden returned a 500 error for the URL we use to monitor it. &nbsp;Looking at the error, Passenger couldn&#39;t find the Rails gem. &nbsp;This is a 2.3 app, and doesn&#39;t use Bundler yet, so... &nbsp;this was weird. &nbsp;I&#39;ve been working with Rails for almost 8 years now and have never seen this before in an app that hadn&#39;t been deployed to recently. &nbsp;I could understand if you changed the RAILS_GEM_VERSION line in environment.rb but hadn&#39;t actually installed that version of Rails, but that didn&#39;t happen in this case.</p> <p> What did I do? &nbsp;First, I tried the console to see if that could find the rails gem. &nbsp;It worked fine. &nbsp;</p> <p> Having proved to myself that the rails gem was installed and at least working for one part of the app, I concentrated on Passenger. &nbsp;I uninstalled all the old versions of the gem, then reinstalled the version the app was using, re-ran passenger-install-apache2-module and restarted apache. &nbsp;No luck.</p> <p> Then I decided to remove rails from the equation and ran rake rails:freeze:gems, which moves all the rails gems to vendor/gems. &nbsp;After doing that, I got the real error. &nbsp;Somehow, rack 1.5.2 had gotten installed, which apparently hates old Rails apps. &nbsp;I removed it, restarted apache (just for luck) and ta-da, the app was alive again and I could go back to sleep.</p> <p> Other than it happening at 3 this morning, this is why I love working at Rails Machine. &nbsp;If it can go wrong with Rails, we have a front row seat for it. &nbsp;Wait, no, that&#39;s wrong. &nbsp;We&#39;re on the stage, fixing it and making sure apps stay up and running. &nbsp;</p> <p style="text-align: center;"> <img alt="wow! bravo!" src="http://ha.lwvr.net/reactions/wow_bravo.gif" style="width: 500px; height: 282px;" /></p> Thu, 21 Feb 2013 00:00:00 +0000 http://featherplane.com/articles/2013/02/21/30-minutes-in-the-life-of-troubleshooting-weird-rails-stuff http://featherplane.com/articles/2013/02/21/30-minutes-in-the-life-of-troubleshooting-weird-rails-stuff Moonshine, Now 92.6% More Enterprisey <div class="article-body"> Last week, we added support for <a href="https://www.phusionpassenger.com/enterprise">Passenger Enterprise</a> to <a href="http://github.com/railsmachine/moonshine">Moonshine</a>, allowing anyone to easily deploy apps with Passenger Enterprise and easily making sure that it&rsquo;s compiled correctly, licensed and configured to give your app all the shiny new features (<span class="caps">OMG</span>, rolling restarts!!). <p> Upgrading is straightforward. After you purchase your license, download the passenger-enterprise-server gem from your dashboard and throw it in vendor/gems. Then, download your license and put it in app/manifests/templates. Now, in your moonshine.yml, it should look something like this:</p> <pre> <code>:passenger: :enterprise: true :gemfile: vendor/gems/passenger-enterprise-server-3.0.18.gem :version: 3.0.17 :rolling_restarts: true</code></pre> <p> &nbsp;</p> <p> We&rsquo;ve also added support for the new Enterprise-only features:</p> <ul> <li> :rolling_restarts: Defaults to false, but you should <em>really</em> set it to true.</li> <li> :max_request_time: Defaults to 0. Be careful with this one because there appears to be a bug that kills of passenger instances as they start if they take longer to spin up than the value of this setting. I&rsquo;d recommend setting it to something extreme like 120 for now (hopefully your app doesn&rsquo;t take 2 minutes to start up).</li> <li> :memory_limit: Defaults to 0. We haven&rsquo;t set this one in production yet, but I&rsquo;d set it the absolute maximum you&rsquo;re comfortable running your app (like 500M).</li> <li> :resist_deployment_errors: Defaults to false.</li> <li> :debug_log_file: This is a great feature. It moves all the apache errors related to startup or stopped connections to a separate log file. We&rsquo;ve been setting this to the /srv/<span class="caps">APPNAME</span>/current/log/debug.log.</li> </ul> <p> All of these are on the shiny new <a href="https://github.com/railsmachine/moonshine/wiki/Default-Configuration">Default Configuration</a> page on the Moonshine wiki along with all the other settings you can tweak with Moonshine.</p> <p> We&rsquo;ve had several Rails Machine customers roll out Passenger Enterprise recently and we&rsquo;ve heard glowing reports from everyone, especially for the rolling restarts.</p> <p> Are you using Passenger Enterprise? What do you think?</p> </div> Thu, 06 Dec 2012 00:00:00 +0000 http://featherplane.com/articles/2012/12/06/moonshine-now-92-6-more-enterprisey http://featherplane.com/articles/2012/12/06/moonshine-now-92-6-more-enterprisey Heyo From the New Guy! <div class="article-body"> <p> Hi! My name&rsquo;s <a href="http://railsmachine.com/about/#ritbreisler">Rit Breisler</a> , and I&rsquo;m the newish guy at <a href="http://railsmachine.com">Rails Machine</a> . I&rsquo;ve got about six weeks under my belt, and the <em>tl;dr</em> of it is that this is a seriously remarkable shop to work in, and I couldn&rsquo;t have asked for a better bunch of folks to work with.</p> <p> The slightly longer take is that I came to Rails Machine because I was looking for particular qualities in a shop; I was seeking a place of great challenge where the people were real thinkers, and just as importantly, real doers. I was looking to be in a state of motion as much as possible, and advancing my art toward what I believe is the most important goal for <a href="https://twitter.com/search?q=%23opslife&amp;src=typd">Ops life</a> .</p> <p> I come from an Ops background, and (as you probably have gathered from my <a href="http://railsmachine.com/about/#ritbreisler">about</a> page on the site) I strongly believe that everyone who does Web Ops (be you Dev, be you Ops) must strive to bridge the gap between what used to be considered very separate disciplines. My role at Rails Machine is allowing me to pursue that goal, and at a rate which is frankly terrifying, but thrilling, all while being surrounded by people who are thoughtful, expert in their disciplines, and most importantly, give a damn. That last one might cause a double-take in some of you, but it&rsquo;s actually my number one desired quality in a place of employment, and something that I&rsquo;ve found to be quite rare. Rails Machine has it.</p> <p> Another brand new thing is that I&rsquo;m remote for the first time (I call Boston home, and mostly work from there; Rails Machine HQ is in Savannah). If you are considering taking on that challenge as well, I&rsquo;ve learned a thing or two so far which I&rsquo;ll share:</p> <ul> <li> <a href="http://www.gnu.org/software/screen/"><span class="caps">GNU</span> Screen</a> is an extremely powerful tool for managing shells on remote hosts, supporting on-the-go ops, but it also happens to be an outstanding remote pairing too. Learn all the screen-fu you can deal with.</li> <li> Use <a href="http://campfirenow.com/">Campfire</a> for team chat. I cannot tell people enough how vastly superior group chat is to what I refer to as &ldquo;silo chat&rdquo; (individuals all having essentially invisible ad-hoc conversations from which no one else on the team can benefit).</li> <li> Engage in ChatOps. Get a bot going for your Campfire, like <a href="http://hubot.github.com/"><span class="caps">HUBOT</span></a> (see our <span class="caps">CTO</span> Josh <a href="http://confreaks.com/videos/1262-rockymtnruby2012-chatops-using-chat-as-a-command-line-for-your-company">talk about it</a> , or Github&rsquo;s Jesse Newland <a href="http://www.youtube.com/watch?feature=player_embedded&amp;v=DH2twW0dmrM">as well</a> , for some persuasive arguments).</li> <li> Escalate to video chat when you need it, because greater than 80 percent of communication is nonverbal. No, I&rsquo;m serious. Text is great for pretty much everything, but sometimes you really just need to be face to face. FaceTime, Skype, Google Hangouts, whatever you prefer, but do these (very yes).</li> </ul> <p> Keeping sharp and staying able to ingest all of my new various -Fu isn&rsquo;t easy, but I&rsquo;ve been keeping up with the usual suspects (I&rsquo;ve been reading <a href="http://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/ref=sr_1_1?ie=UTF8&amp;qid=1351869407&amp;sr=8-1&amp;keywords=web+operations">Web Operations</a> again), practicing, and of course trying to keep balance with all of the normal things (sleep plenty, eat your green vegetables &ndash; y&rsquo;know, stuff we learn from Spider-Man). Still, I&rsquo;ve learned more in six weeks than I have in the past year. It&rsquo;s absurd. And it&rsquo;s great &ndash; exactly what I was seeking.</p> <p> Every day I get to work on exciting software and hardware, learn all kinds of rad things, establish relationships with and deliver value for our customers, and I feel privileged to be here. I was seeking challenge, caring, and geeky camaraderie; I&rsquo;ve found that in Rails Machine, and I&rsquo;m delighted to be on board!</p> </div> Fri, 16 Nov 2012 00:00:00 +0000 http://featherplane.com/articles/2012/11/16/heyo-from-the-new-guy http://featherplane.com/articles/2012/11/16/heyo-from-the-new-guy A Web Operations View of Election Day <div class="article-body"> <p> <img alt="" src="http://cdn.railsmachine.com/images/Screen_Shot_2012_11_15_at_3_13_38_PM.png" /></p> <p> Yes, I know the election was last week, but it&rsquo;s taken us that long to recover from the <em>sheer awesomeness</em> of how well it went for our customers. No, not the outcome of the election, but their availability and performance throughout.</p> <p> We have two customers we worked closely with to make sure they were ready for the election rush: <a href="http://mobilecommons.com">Mobile Commons</a> and <a href="http://nationbuilder.com">NationBuilder</a>. Each one will probably end up being their own blog post, but this time, I wanted to talk about how we worked with NationBuilder to make sure things went smoothly last week.</p> <p> For NationBuilder, last week was about both making sure the main product was ready for everything their customers would throw at it in the days leading up to Election Day, but also planning, provisioning and deploying everything needed to launch their <a href="http://elections.nationbuilder.com">Election Center</a> product &ndash; a first-of-its-kind national voter registration database open to everyone.</p> <p> We&rsquo;re really proud to be involved in our little way with both products, and I want to break down a little of what I think went into making last week so successful (as in, no one got woken up in the middle of the night and things were actually faster than the week before).</p> <p> I think all of these are applicable to most relationships between Development and Operations teams &ndash; it just happens that in this case, the operations team is a separate company:</p> <ol> <li> <strong>Teamwork &amp; Emphasis on Performance</strong>: We spent the summer and early part of the fall working with NationBuilder to add capacity, <em>and</em> on making sure their performance was everything it could be. We spent a lot of time looking at Scout and NewRelic to see what could be improved and then worked together to make it happen. By election day, by working together to attack hot spots, we were able to <strong>cut NationBuilder&rsquo;s average response time by 33%</strong>.</li> <li> <strong>Capacity Planning</strong>: With both customers, they were very clear about their goals, and we were able to measure our progress against those goals. We made sure they had the right pieces in place to support the traffic they expected &ndash; and the traffic no one could predict.</li> <li> <strong>Frequent Checkpoints</strong>: We met weekly with NationBuilder to make sure we were all on track with not only deployment tasks, but performance and feature development. The NationBuilder team did <em>an amazing job</em> fixing performance issues when we pointed them out &ndash; to the point that when I pulled the transaction report from NewRelic, almost all the requests were new to the report.</li> <li> <strong>Fix the Problem</strong>: The NationBuilder folks are great under duress. It doesn&rsquo;t happen often, but when things go wrong, we work together to fix it in the moment, and save the root cause analysis for after.</li> <li> <strong>Move Quickly but Deliberately</strong>: NationBuilder launched an entirely new product right before the election, <a href="https://elections.nationbuilder.com/">The NationBuilder Election Center</a>, and wanted it to be able to handle a <em>lot</em> of load right away. This started as a prototype and ended up being a fairly complex app with a lot of dependencies and a fairly large hardware footprint. By relying on our existing good communication, we were able to scale it up on all new hardware and launch with a lot of press attention and use.</li> </ol> <p> And it warms our nerdy little hearts when we get feedback like this from NationBuilder&rsquo;s <span class="caps">CEO</span>:</p> <blockquote> I just wanted to say thanks to everyone at Rails Machine for your efforts, particularly the last couple of weeks as we&rsquo;ve barreled toward election night. The last 5 days were so immensely critical to us. We are working really hard to establish trust in our company, and political tech is notorious for crashing on election day. We didn&rsquo;t even have a blip, so we earned a lot of cred. You helped make that happen in a big way. Thank you.</blockquote> <p> Like I said, we&rsquo;re really proud to have played a small part in the democratic process and in helping NationBuilder and Mobile Commons meet their goals for performance and availability. Now it&rsquo;s time to get ready for the holiday season!</p> </div> <p> &nbsp;</p> Thu, 15 Nov 2012 00:00:00 +0000 http://featherplane.com/articles/2012/11/15/a-web-operations-view-of-election-day http://featherplane.com/articles/2012/11/15/a-web-operations-view-of-election-day Red(is) Alert! <div class="article-body"> <p> We have a <em>lot</em> of customers who use <a href="http://redis.io">Redis</a> as a part of their infrastructure. Most of them start off using it with <a href="https://github.com/defunkt/resque">Resque</a> to process background jobs. Then, they figure they&rsquo;ve got keys and values they want to keep somewhere, and well, Redis is sitting there looking all friendly and you&rsquo;ve already got it configured for Resque, so why not use it for storing things?</p> <p> <img alt="" src="http://ha.lwvr.net/castle_wat.gif" /></p> <p> And that&rsquo;s when things get interesting. You see, Redis keeps <em>everything</em> in memory and saves to disk periodically based on configuration. Unfortunately, Redis kicks off a new process during a save, which uses up as much memory as the main process. So, in essence, <em>you always have to use less than 50% of the available memory on your server</em> when using Redis.</p> <p> Why? Well, as soon as your Redis process grows large enough to hit the magical 50% memory limit, <em>it can no longer save</em>. When this happens, the only thing you can do is foreground save, which leaves Redis unavailable until the save is finished (which on large installations can be more than a minute) or you delete keys until things can save again.</p> <p> <img alt="" src="http://ha.lwvr.net/inc.jpg" /></p> <p> The neckbeards in the audience may be feeling superior at this point, thinking to themselves, &ldquo;It&rsquo;s your own fault. You didn&rsquo;t allocate memory correctly in the first place&rdquo; or &ldquo;that&rsquo;s what monitoring is for.&rdquo; And you&rsquo;d be partially correct. The problem is that web apps <em>grow</em> over time, and sometimes get popular very quickly. That means you may end up with more jobs than you have workers to promptly handle &ndash; which means the queue grows &ndash; which means Redis grows &ndash; which means someone&rsquo;s waking up in the middle of the night to babysit a sick Redis that can&rsquo;t save because it ate too much. Or, true story, someone (let&rsquo;s say Resque) decides to log stack traces to Redis and you somehow trigger the <em>world&rsquo;s biggest stack trace</em> over and over again, causing Redis to explode, leaving little bits of stack traces dripping off the ceiling.</p> <p> There are things you can do to avoid these disasters, and most of them are not hard (although they may be expensive), and some things to watch out for when monitoring Redis to make sure you&rsquo;re looking at the right things (because, on top of a tendency to be a naughty little imp when it gets full, it also <em>lies</em>).</p> <h3> Over-Provision Everything</h3> <p> The argument from Redis fans is that if you only configured your servers correctly to begin with, then you wouldn&rsquo;t have any problems with it. And in exactly one way, they&rsquo;re right. So go ahead and provision your Redis server like your future in-laws told you to buy diamonds and houses &ndash; always get the one twice as big as you think you&rsquo;ll need because it will eventually feel small.</p> <p> Get more memory than you think you&rsquo;ll need, more disk, and have more workers available than you think you&rsquo;ll need. Just do it.</p> <h3> Always Have a Backup Plan</h3> <p> One way to get around foreground saving on the primary server is to have a standby that you can foreground save on without affecting workers or users. This allows you to babysit a sad Redis without making any other part of your infrastructure sad. Yes, you&rsquo;re spending for another server that&rsquo;s the <em>same size as the primary</em> (that part is really important &ndash; it needs to be able to take over in case the primary goes away), but your peace of mind is worth it, right?</p> <p> Also, if you need to restart Redis for any reason, and you have a lot of data, restarting can take <em>a long time</em>. So, you break the slave relationship, have your app talk to the slave Redis, restart the master and then make it a slave of the old slave. You can then reverse that process if you need to restart the former slave.</p> <p> The other backup plan you need to have is to know what you can delete without causing irreparable harm to your app. Know how to get those keys and delete them (like Resque stack traces). When you hit that 50% memory mark, this should be the first thing you do to try to get back to your happy place.</p> <h3> Monitoring the Right Things at the Right Time</h3> <p> The <strong><span class="caps">INFO</span></strong> command returns <em>a lot</em> of information, and a lot of is actually useful. Some of it, though, is wrong. For example, never trust the <strong>used_memory_human</strong> entry, because I&rsquo;ve seen it be off by more than 500mb. Always look at <strong>used_memory</strong> for monitoring and do the conversion yourself if you need to.</p> <p> You need to set up alerts for yourself (using <a href="http://scoutapp.com">Scout</a> maybe) that trigger <em>before</em> Redis gets to 50% of available memory, because once Redis passes that mark, things are going to be very sad.</p> <p> You should also monitor <strong>last_save_time</strong> and alert if it hasn&rsquo;t saved in a reasonable amount of time, where <em>reasonable</em> is defined by you. Another option to get this info is to monitor the Redis log file for failed saves (we use the <a href="https://scoutapp.com/plugin_urls/341-log-watcher">Log Watcher</a> plugin in Scout for this for &ldquo;Can&rsquo;t save in background&rdquo;)</p> <p> If you&rsquo;re using Redis for Resque (and you probably are), you need to closely watch the size of your queues, know your average velocity on completing jobs, and do some math to figure out how efficient your workers are. Why? Because you need to know when your workers will no longer be able to keep up with the pace of incoming jobs with enough time to spin up more workers if you need them. You don&rsquo;t want to get into a state where you have queues on fire (putting Redis into a bad state) and not enough workers to put out that fire.</p> <h3> Look Into Append-Only File</h3> <p> Redis has an alternate <a href="http://redis.io/topics/persistence">persistence strategy</a>, called <span class="caps">AOF</span> that eases some of the pain of forking the process to save a snapshot, but not all of it, since the folks at Redis suggest using both. If you&rsquo;d like to read more on Redis persistence, <a href="http://oldblog.antirez.com/post/redis-persistence-demystified.html">antirez wrote an epic blog post about it</a>.</p> <h3> Separate Functions</h3> <p> Don&rsquo;t run your workers on your Redis server. Workers use up <span class="caps">RAM</span> that Redis needs to be able to save things. Workers should have their own servers, just like your app server is a separate thing from your Redis server (please tell me it&rsquo;s separate).</p> <p> If you&rsquo;re happily using Redis for Resque, and then want to use Redis for some other nefarious purpose, <em>get yourself another Redis server</em>. Don&rsquo;t combine them. The two uses are completely different and grow differently.</p> <h3> Have an Expiration Strategy</h3> <p> Redis is in a weird place between &ldquo;hot&rdquo; caching like memcached where things fall out of the bottom of the cache when memory is needed and disk-based data stores where you don&rsquo;t need to store everything in memory. So, the storage strategy in Redis is a little hard to get your head around. You have it because it&rsquo;s persistent, which is good. But, you also have to have enough memory to keep <em>everything in memory all the time</em>. So, you can&rsquo;t keep <em>everything</em> in it <em>forever</em>, because it&rsquo;s also not horizontally scalable (yet).</p> <p> You need to come up with a expiration plan for how you&rsquo;re going to expire things out of Redis and either forget them forever or move them somewhere less volatile and less &ldquo;expensive&rdquo; (disk is cheaper than <span class="caps">RAM</span>). You could do it by activity, by time, whatever, just have a plan and have it built into the code so it can be run if you need to free up memory to make Redis happy again.</p> <p> Because Redis will get sad. It just will.</p> <p> <img alt="" src="http://ha.lwvr.net/sad_busey.gif" /></p> <h3> In Conclusion</h3> <p> Redis is a great in-memory database. It&rsquo;s extremely flexible, developer-friendly and easy to get started with. The key with <em>any</em> piece of software is to use it correctly, know what it looks like when it&rsquo;s about to fail, and how to keep it well-fed and happy. I could (and probably will) have written this blog post about any of the software our customers use, because we&rsquo;ve had problems with <em>all</em> of them. I hope this post saves you some time and makes you a happier and healthier Redis user!</p> </div> Mon, 22 Oct 2012 00:00:00 +0000 http://featherplane.com/articles/2012/10/22/red-is-alert http://featherplane.com/articles/2012/10/22/red-is-alert New Relic + Rails Machine partnership brings free monitoring to customers <div class="article-body"> <div class="nr-pr-img"> <p> <img class="nr-pr-img-nr" src="http://cdn.railsmachine.com/images/newreliclogobugrgbhex.png" style="margin-right: 30px;" /></p> </div> <p> It&rsquo;s with great excitement that we get to announce our partnership with New Relic! This will allow us to provide New Relic Standard for free, forever, as long as you&rsquo;re a Rails Machine customer. We want your applications to kick ass and run well. New Relic allows us to peer inside the applications we manage and make sure they&rsquo;re doing just that.</p> <p> &ldquo;New Relic, Inc. is the all-in-one web application performance management provider for the cloud and the datacenter. Its SaaS solution combines real user monitoring, server monitoring, application monitoring, and availability monitoring in a single solution built from the ground up. It changes the way organizations manage web application performance in real-time, enabling developers and operations teams to quickly and cost effectively monitor, troubleshoot, and tune application performance.&rdquo;</p> <p> And the best part is that it&rsquo;s free for all customers! To get started, head on over to <a href="http://newrelic.com/railsmachine">http://newrelic.com/railsmachine</a> and claim your free New Relic Standard account today.</p> <p> Official Press Release: <a href="http://www.marketwatch.com/story/rails-machine-adds-free-application-performance-monitoring-from-new-relic-to-its-application-management-platform-for-rails-2012-10-18" target="_blank">New Relic Press Release</a></p> <p> <a class="signupnow-nr" href="http://newrelic.com/railsmachine" style="margin-bottom: 20px;">Sign Up Now</a></p> </div> <p> &nbsp;</p> Thu, 18 Oct 2012 00:00:00 +0000 http://featherplane.com/articles/2012/10/18/new-relic-rails-machine-partnership-brings-free-monitoring-to-customers http://featherplane.com/articles/2012/10/18/new-relic-rails-machine-partnership-brings-free-monitoring-to-customers Adobe's Source Code Pro <div class="article-body"> <p> Source Code Pro, A new monospaced font to help cure those tired eyes.</p> <p> Yesterday, Paul D. Hunt, announced that Adobe was releasing the open sourced monospace font, Source Code Pro. (Adobe announced a lot of <a href="http://blog.typekit.com/2012/09/24/introducing-adobe-edge-web-fonts/">new, exciting things</a> yesterday.) And today, the entire Rails Machine team updated their text editor preferences to take full advantage of the pretty.</p> <p> Source Code Pro seeks to remove the &ldquo;monotonous rhythm&rdquo; that seems to pervade many monospaced fonts by relying on the clarity of Adobe&rsquo;s <a href="http://blogs.adobe.com/typblography/2012/08/source-sans-pro.html">Source Sans Pro</a>. It ships as a family with six different weights, we used regular in our text editors, and really makes a difference when used in Terminal. Read about the <a href="http://blogs.adobe.com/typblography/2012/09/source-code-pro.html">insight</a> that went into Adobe&rsquo;s first open source font family and then head on over to <a href="http://sourceforge.net/projects/sourcecodepro.adobe/">Source Forge</a> to get started. I put it to use on the Rails Machine site via <a href="http://www.google.com/webfonts/specimen/Source+Code+Pro">Google Web Fonts</a> earlier this morning and it&rsquo;s looking great:</p> <p> &nbsp;</p> <pre> <code>Source Code Pro in the wild.</code></pre> <p> If you&rsquo;ve never installed a font before, I&rsquo;ll step you through the easy process using <a href="http://support.apple.com/kb/HT2509">Font Book</a> on a Mac:</p> <ol> <li> <a href="http://sourceforge.net/projects/sourcecodepro.adobe/">Download Source Code Pro via Source Forge</a></li> <li> Open Font Book</li> <li> Hit &lsquo;command + o&rsquo; to locate the font</li> <li> Find the Source Code Pro folder in your downloads folder (you may need to unzip the folder first)</li> <li> &lsquo;command click&rsquo; all the <span class="caps">OTF</span> font files and click &lsquo;Open&rsquo;</li> <li> Update the fonts in your favorite editor</li> </ol> <p> Enjoy and thanks to Paul and Adobe for their hardwork! You can find Paul on Twitter via <a href="https://twitter.com/pauldhunt">@pauldhunt</a></p> </div> Tue, 25 Sep 2012 00:00:00 +0000 http://featherplane.com/articles/2012/09/25/adobe-s-source-code-pro http://featherplane.com/articles/2012/09/25/adobe-s-source-code-pro MongoDB Manipulation, Mastery and Monkey Business <div class="article-body"> <p> I&rsquo;m a big fan of <a href="http://mongodb.org">MongoDB</a>. I used it for a product at my last company and found it to be easy to manage and deploy and fun to develop with. We have a couple of customers here at Rails Machine that are also big fans of MongoDB and we helped one of them upgrade their replica set today from two nodes with an arbiter to three <em>gigantic</em> servers. You&rsquo;d think that completely replacing the hardware that runs your database would be painful&hellip; but with Mongo, it&rsquo;s really not. Here&rsquo;s how we did it.</p> <p> Since these were new servers, and our customer has a staging environment, we did a <em>lot</em> of testing of the new servers by adding them to the staging replica set. It was really easy to move things around using the <a href="http://github.com/railsmachine/moonshine_mongodb">Moonshine MongoDB plugin</a>. Some of the tests we ran:</p> <ul> <li> How long does it take for a new node to become a fully functioning secondary? <ul> <li> Depends on the amount of data, but not as long as you&rsquo;d expect.</li> </ul> </li> <li> What happens when a secondary unexpectedly drops out? <ul> <li> Not much. If there was an election, it was so quick we didn&rsquo;t notice.</li> </ul> </li> <li> What happens when the primary unexpectedly drops out? <ul> <li> There&rsquo;s an election and things &ldquo;flap&rdquo; for a few seconds. We did this a few times, and elections took as little as 2 seconds and as long as 10.</li> </ul> </li> <li> Does it take longer to sync two new secondaries than one? <ul> <li> As long as there are as many secondaries as there are new nodes, then no.</li> </ul> </li> </ul> <p> After we were happy that the servers were ready, we removed them from the staging replica set and deleted their data. With Moonshine, removing them from staging and adding them to the production deployment was just moving a few lines of configuration around.</p> <p> I&rsquo;ve changed the names of the servers and ip addresses for this example, so, let&rsquo;s pretend that things look like this:</p> <ul> <li> Existing replica set: <ul> <li> arbiter: 10.0.0.1</li> <li> donald: 10.0.0.2</li> <li> daisy: 10.0.0.3 &ndash; the current primary (pretty much force to be primary because we set its priority to 2, which will come in later)</li> </ul> </li> <li> New &lsquo;super&rsquo; mongo servers: <ul> <li> huey: 10.0.0.4</li> <li> dewey: 10.0.0.5</li> <li> louie: 10.0.0.6</li> </ul> </li> </ul> <p> We did this over two days to make sure the new servers were &ldquo;happy&rdquo; and ready to take over all the traffic, but here&rsquo;s what we did:</p> <ol> <li> We deployed to all the servers to make sure the correct iptables rules were in place so that everything that needed to talk to the new Mongos could, and also to add the three new nodes to the app config (they should be in the config <em>after</em> the two current nodes).</li> <li> Confirmed that the new servers could connect to, and be connected to by, the old ones (you&rsquo;ll get an error if things are wrong, otherwise you&rsquo;ll connect and can run queries.): <ul> <li> from huey: <ul> <li> <code>mongo 10.0.0.2</code></li> <li> <code>mongo 10.0.0.3</code></li> </ul> </li> <li> from daisy: <ul> <li> <code>mongo 10.0.0.4</code></li> <li> <code>mongo 10.0.0.5</code></li> <li> <code>mongo 10.0.0.6</code></li> </ul> </li> </ul> </li> <li> Now you need to connect to the current primary (daisy for this story) and reconfigure things. Since we always wanted there to be a quorum of &ldquo;up&rdquo; nodes in the replica set to keep things from possibly going south (as in, not enough functioning nodes to elect a primary, which I&rsquo;ve seen before and is unpleasant). We need to first remove the arbiter and add one of the new nodes: <ol> <li> <code>config = rs.conf()</code></li> <li> <code>huey = {_id:4,host:&#39;10.0.0.4:27017&#39;,hidden:true,priority:0}</code></li> <li> <code>config.members.push(huey)</code></li> <li> <code>config.members.splice(INDEX,1)</code> where <span class="caps">INDEX</span> is the index of the arbiter in the members array.</li> <li> <code>config.version++</code></li> <li> Now, before we commit this, we need to look at the config variable and make sure things make sense: <ul> <li> Are all the nodes you want to be in the list of members?</li> <li> Do they all have their host field set to &ldquo;IP:PORT&rdquo;?</li> <li> Are the new nodes set to hidden?</li> <li> Do the new nodes all have unique _id fields?</li> <li> Do they all have priority set to 0?</li> </ul> </li> <li> <code>rs.reconfig(config)</code></li> </ol> </li> <li> Now you get to obsessively run <code>rs.status()</code> over and over again until huey is all synced up and a full-fledged member of the set (which will happen when it&rsquo;s no longer <span class="caps">RECOVERING</span> and says <span class="caps">SECONDARY</span>). You may see a few things that look like error messages while this is happening: <ul> <li> &ldquo;errmsg&rdquo; : &ldquo;initial sync need a member to be primary or secondary to do our initial sync&rdquo; &ndash; This almost always means the election is taking place. Wait a few seconds and run rs.status() again.</li> <li> &ldquo;errmsg&rdquo; : &ldquo;initial sync cloning db: <span class="caps">DBNAME</span>&rdquo; &ndash; This is good! That means the sync is happening.</li> <li> &ldquo;errmsg&rdquo; : &ldquo;syncThread: 10278 dbclient error communicating with server: 10.0.0.5:27017&rdquo; &ndash; I saw this one a couple times right after I triggered an election. I think this is the normal &ldquo;I&rsquo;ve just triggered an election and am switching connections&rdquo; message.</li> <li> If you see any other errors, google them, because I didn&rsquo;t see them.</li> </ul> </li> <li> Once the new node is a <span class="caps">SECONDARY</span>, you can add the other two (because there are now three healthy nodes, adding two won&rsquo;t cause an imbalance): <ol> <li> <code>config = rs.conf()</code></li> <li> <code>dewey = {_id:5,host:&#39;10.0.0.5:27017&#39;,hidden:true,priority:0}</code></li> <li> <code>louie = {_id:6,host:&#39;10.0.0.6:27017&#39;,hidden:true,priority:0}</code></li> <li> <code>config.members.push(dewey)</code></li> <li> <code>config.members.push(louie)</code></li> <li> <code>config.version++</code></li> <li> Before we commit this, go through the checklist we went through the first time we did this and make sure we don&rsquo;t have any typos or other mistakes. If things are cool:</li> <li> <code>rs.reconfig(config)</code></li> <li> Again, obsessively run <code>rs.status()</code> until everything&rsquo;s happy.</li> </ol> </li> <li> Once the new nodes are all listed as <span class="caps">SECONDARY</span> in rs.status(), you&rsquo;ve successfully added the new nodes to the replica set. This is where we stopped on day one so we could watch things to make sure everything was fine with the new nodes. But, once the new nodes are secondaries, you can continue with dropping the old nodes: <ol> <li> We created a pull request that had the updated app config that removes the two old nodes.</li> <li> We also stopped all the resque workers at this point to make sure we didn&rsquo;t cause any jobs to fail during the election.</li> <li> Open up the mongo console on the current primary and get rolling!</li> <li> <code>config = rs.conf()</code></li> <li> For each of the hidden nodes in the members list: <ul> <li> <code>config.members[x].hidden = false</code></li> <li> <code>config.members[x].priority = 1</code></li> </ul> </li> <li> Now we need to give one of the new nodes a priority <em>higher</em> than the current primary&rsquo;s, so: <ul> <li> <code>config.members[3].priority = 3</code></li> </ul> </li> <li> <code>config.version++</code></li> <li> Look at the config variable again and make sure it&rsquo;s got all the right stuff in it and then:</li> <li> <code>rs.reconfig(config)</code></li> <li> Do the rs.status() dance until the new primary is elected. There may be a considerable amount of &ldquo;flapping&rdquo; during the election. I&rsquo;ve seen it take as little as 2 seconds and as long as 10 for a new primary to be elected. Just keep checking <code>rs.status()</code> until things calm down.</li> </ol> </li> <li> After the new primary is elected, we merged the pull request that removes the old nodes from the app config and deployed.</li> <li> Once the deploy was done and the app was up and running talking to the new primary, it was time to remove the old nodes from the replica set. <ol> <li> Connect to the <em>new</em> primary&rsquo;s mongo console.</li> <li> <code>config = rs.conf()</code></li> <li> And now we need to splice out the old nodes. For each of the old nodes: <ul> <li> <code>config.members.splice(INDEX,1)</code> (where <span class="caps">INDEX</span> is the indexes of the old node we&rsquo;re removing)</li> </ul> </li> <li> <code>config.version++</code></li> <li> Go through the checklist again, and this time, make sure the old nodes are no longer in the members array.</li> <li> <code>rs.reconfig(config)</code></li> <li> There might be some more flapping here as it disconnects the old nodes. We definitely saw a few seconds of &ldquo;weirdness&rdquo; when we did it.</li> </ol> </li> <li> That&rsquo;s pretty much it!</li> </ol> <p> Overall, the entire process went <em>very</em> smoothly. The only issue we had was when we removed the old nodes, the apps lost their connection to MongoDB entirely and refused to connect. An apache restart fixed that issue. We think it was a &ldquo;failed&rdquo; restart during the deploy that didn&rsquo;t restart all of the passenger instances. That was the only real downtime during the entire migration and it lasted for only a couple of minutes while we restarted Apache.</p> <p> Having done this process a few times now, and having done this with other database systems in the past, I&rsquo;m really impressed with how easy it is to manipulate replica sets with MongoDB. It&rsquo;s a lot easier than I originally thought it would be and while running three instances for a replica set is more expensive than the regular master/slave setup you see with traditional databases, it makes a lot of sense and works really well in the &ldquo;real world&rdquo;.</p> <p> I&rsquo;d love to hear how other folks have done this kind of thing with MongoDB!</p> <p> And as a congratulations for getting to the bottom of this post, here&rsquo;s a photo of a corgi:</p> <p> <a href="http://www.flickr.com/photos/dainec/59778701/" title="Corgi in the leaves by Aine D, on Flickr"><img alt="Corgi in the leaves" height="207" src="http://farm1.staticflickr.com/27/59778701_f8aa71a20d_m.jpg" width="240" /></a></p> </div> Tue, 28 Aug 2012 00:00:00 +0000 http://featherplane.com/articles/2012/08/28/mongodb-manipulation-mastery-and-monkey-business http://featherplane.com/articles/2012/08/28/mongodb-manipulation-mastery-and-monkey-business A New Machine <div class="article-body"> <p> It&rsquo;s that time of year again. Kids are being shuffled back to school after what was hopefully a long and eventful Summer break. Parents are adjusting their morning routines by packing pencils, paper, and lunches &mdash; or is it iPads and styluses &mdash; into backpacks to begin hitting the road to do battle with the other sleep deprived parents, or maybe they&rsquo;re exploring plans for a new Fall season of experiments, internal apps, traveling, rebranding, redesigning, reducing, and ping pong tournaments? Because that&rsquo;s how Rails Machine plans to spend the rest of this year, by improving all the things.</p> <p> And when I say all the things, I mean all the things. We&rsquo;ve looked at everything from how we can become more efficient and productive in our efforts to complete customer requests to how we can increase our output and support for our open source deployment tool, Moonshine. We&rsquo;ve looked at how we can improve the capacity and speed of our servers while reducing our energy consumption. We&rsquo;re going to be including more automation in our daily tasks and we&rsquo;re even sending members of our team out on the road to talk about some of the great things we have in the works.</p> <p> The best part, our customers are going to benefit the most of all from our internal tinkering. We can&rsquo;t wait to roll up our sleeves and roll out the new machine.</p> </div> Tue, 14 Aug 2012 00:00:00 +0000 http://featherplane.com/articles/2012/08/14/a-new-machine http://featherplane.com/articles/2012/08/14/a-new-machine ChatOps - Using New Relic in Campfire <div class="article-body"> <p> Just awhile back, I wrote about <a href="http://railsmachine.com/articles/2012/05/23/building-a-bot-with-hubot/">how to build a Campfire bot with hubot</a>. I&rsquo;ve been doing a lot of work on our very own hubot since then (Claptrap, by name), and it&rsquo;s been a pretty rad experience.</p> <p> You&rsquo;d think that hubot is all fun &amp; lulz if you take a look at the <a href="http://hubot-script-catalog.herokuapp.com/">Hubot Script Catalog</a>, but there are definitely some &lsquo;productive&rsquo; hubot-scripts in the mix. Today, we&rsquo;re going to checkout one in particular that is totally <a href="https://twitter.com/#!/search/%23opslife">#opslife</a>.</p> <h3> Meet newrelic.coffee</h3> <p> We are huge fans of <a href="http://newrelic.com/">New Relic</a> in general, but it&rsquo;s even better when you can have easy access to it from Campfire. Check it:</p> <p> <img alt="" src="http://cdn.railsmachine.com/images/abe_newrelic_me1.jpg" /></p> <p> There&rsquo;s just a few steps to get going. First, we head to New Relic to collect some information:</p> <ul> <li> Login at <a href="https://rpm.newrelic.com">https://rpm.newrelic.com</a></li> <li> You&rsquo;ll be redirected to a <span class="caps">URL</span> like https://rpm.newrelic.com/accounts/<span class="caps">XXXXXX</span>/applications <ul> <li> Make note of <span class="caps">XXXXXX</span>, we&rsquo;ll be using it for <span class="caps">HUBOT</span>_<span class="caps">NEWRELIC</span>_<span class="caps">ACCOUNT</span>_ID</li> </ul> </li> <li> Click through to one of your applications, and you&rsquo;ll be at a <span class="caps">URL</span> like https://rpm.newrelic.com/accounts/<span class="caps">XXXXXX</span>/applications/<span class="caps">YYYYYY</span> <ul> <li> Make note of <span class="caps">YYYYYY</span>, we&rsquo;ll be using it for <span class="caps">HUBOT</span>_<span class="caps">NEWRELIC</span>_<span class="caps">APP</span>_ID</li> </ul> </li> <li> From the topbar, click on your account name, then Account Settings <ul> <li> Select the &lsquo;Data sharing&rsquo; tab</li> <li> &lsquo;Enable <span class="caps">API</span> access&rsquo; if it&rsquo;s not enabled</li> <li> Make note of the <span class="caps">API</span> Key: 12345, we&rsquo;ll be using it for <span class="caps">HUBOT</span>_<span class="caps">NEWRELIC</span>_<span class="caps">API</span>_<span class="caps">KEY</span></li> <li> If you don&rsquo;t see Account Settings, you&rsquo;ll need to get an account admin to do this, or make you an admin</li> </ul> </li> </ul> <p> With this information, we can apply it the shell environment:</p> <pre> <code>export HUBOT_NEWRELIC_ACCOUNT_ID=XXXXXX export HUBOT_NEWRELIC_APP_ID=YYYYYY export HUBOT_NEWRELIC_API_KEY=12345</code></pre> <p> And then update <code>hubot-scripts.json</code> to include newrelic:</p> <pre> <code>[ # other scripts here &#39;newrelic.coffee&#39;, # even more scripts here ]</code></pre> <p> Restart hubot, and you&rsquo;re good to go!</p> <h3> Our Special Sauce</h3> <p> One thing you may have noticed about the NewRelic hubot-script is that you can only configure a single application. We highly recommend NewRelic for our managed hosting customers, so this was immediately a problem for us.</p> <p> To work around, I&rsquo;ve taken the normal script, and lovingly seasoned it with data from our internal servers. By doing this, we can go from any server on our infrastructure to correct application in NewRelic:</p> <p> <img alt="" src="http://cdn.railsmachine.com/images/newrelic_script_with_special_sauce.jpg" /></p> <h3> Wrapping up</h3> <p> All in all, we&rsquo;re pretty happy with this action so far. There&rsquo;s always room for improvement, but fortunately, we&rsquo;re <a href="https://github.com/github/hubot-scripts/blob/master/src/scripts/newrelic.coffee#L32-41">just consuming</a> the <a href="https://newrelic.com/docs/docs/rest-api-users-guide">NewRelic <span class="caps">REST</span> <span class="caps">API</span></a> , so it&rsquo;s only a matter of writing some code to have even more awesomeness in Campfire.</p> </div> <p> &nbsp;</p> Wed, 18 Jul 2012 00:00:00 +0000 http://featherplane.com/articles/2012/07/18/chatops-using-new-relic-in-campfire http://featherplane.com/articles/2012/07/18/chatops-using-new-relic-in-campfire When Things Go From Good to Sad... to Seriously Serious <div class="article-body"> <p> <em>(this was written by our intrepid Director of Operations, Travis Graham, who was too busy to proofread and edit it, so I did it for him. I swear he wrote, like, 95% of it. I added the top tips and the thing about open office hours. I certainly didn&rsquo;t have to fix any typos or weird comma things. &mdash; Kevin)</em></p> <p> Some of the most common alerts that come in are http timeout, memory utilization, server down, and ssh timeout alerts. Depending on the customer and how they&rsquo;ve built their application infrastructure out gives me an initial idea of where to look. Having the hands on experience of dealing with thousands of these types of alerts and fixing them creates a &ldquo;run book&rdquo;, of sorts, to know where to start investigating and what things to look for in order to fix the problem before it becomes <em>seriously serious</em>. Here are some tips for dealing with these basic alerts and how to find and fix them.</p> <h3> <span class="caps">HTTP</span> Alerts</h3> <p> The most common cause of http alerts is the passenger global queue has backed up causing the http check to fail.</p> <p> The first thing I like to check is &ldquo;top&rdquo; to get an overview of the current running passenger processes: total number running, their uptime, and memory footprint. If there are no passenger processes, I check to see if a recent deploy has gone out and look to see if apache was restarted or is even running at all. Sometimes a typo makes its way into the code base which causes apache to fail to start. You are deploying to staging; aren&rsquo;t you!?</p> <p> A couple quick tips on top: You can change the column the list is sorted on by using &lt; and &gt;, and see what command is actually being run by hitting <strong>c</strong>. Knowing which <span class="caps">URL</span> is misbehaving in your app is <em>priceless</em> and being able to sort by <span class="caps">CPU</span> and memory usage is nice too.</p> <p> If there are passenger processes and they look normal across the board, I jump to check passenger-status to see if the global queue has backed up. Most often, this is the case, and a simple apache restart will clear the global queue, and the site will start loading again. If an apache restart clears things up, you&rsquo;re good to go. If the global queue fills up quickly after restarting apache, check the number of requests each passenger process has served and make sure they are incrementing. If they aren&rsquo;t incrementing and you see &ldquo;Sessions: 1&rdquo;, this means something about the request is either long running or blocking. Debugging passenger processes is a post in itself; so, more on that at a later date.</p> <p> Sometimes, there may be an abundance of connections to the server and apache gets overloaded. You may just be getting a legitimate increase in traffic; so, I would start by increasing the number of MaxClients in your apache config and see if that gets your app back up and running. If you think the traffic might be malicious in nature, you&rsquo;ll want to track down the connections and count them to see if there&rsquo;s an extremely high number of connections from a single or a few IPs. If so, you&rsquo;ll want to see if it&rsquo;s a valid user that&rsquo;s abusing your app or someone up to no good. Using this &ldquo;netstat -an | grep -e &lsquo;:80 &rsquo; -e &rsquo;:443&rsquo; | awk &lsquo;{ print $5 }&rsquo; | awk -F: &lsquo;{ print $1 }&rsquo; | sort | uniq -c | sort -n | tail -n 20&rdquo; will give a sorted list of the top 20 connections to the server via http/https. You can then investigate the IPs and make a judgement call on blocking connections. Using &ldquo;sudo iptables -I <span class="caps">INPUT</span> -s 178.172.178.172 -j <span class="caps">DROP</span>&rdquo; will drop all traffic coming from that address.</p> <h3> Memory Alerts</h3> <p> Often times, passenger processes have gone off the reservation, have become leaky, and need to be killed. We have a <a href="https://github.com/railsmachine/moonshine_passenger_monitor/">moonshine plugin that monitors passenger processes for memory utilization</a> and ensures they are still being maintained by the passenger application spawner. If a passenger process is no longer being managed by the spawner, it can&rsquo;t be killed when traffic decreases or, depending on your passenger setting, it maxes out on requests served or times out.</p> <p> If your application hasn&rsquo;t been broken out into separate roles per server, you may have overloaded what your current server is capable of. Often times, a database gets big or resque processes start to grow as they fork. Restarting these services can release memory that&rsquo;s not had a chance to be released and will buy some time to find a longer term solution, such as debugging memory leaks, planned downtime for a server upgrade, or separating your database out onto its own server.</p> <p> Another common memory hog is a late night cronjob that kicks off a rake task. Just because it&rsquo;s 2am doesn&rsquo;t mean you can load the whole database into memory to update a few million rows. Don&rsquo;t let nasty rake tasks keep you or your ops team up at night. Find and fix the memory hungry code or break the single task into smaller parts.</p> <p> One thing to be careful of is the very real chance the <span class="caps">OOM</span> killer starts killing off important processes. While working to reduce memory usage, sometimes you need to ensure data integrity. I like to set the <span class="caps">OOM</span> killer to ignore things, like mysql or redis-server, if the server looks close to OOMing and I need more time. You may also want to protect your <span class="caps">SSH</span> session; otherwise, you&rsquo;ll be disconnected and can&rsquo;t fix the problem without rebooting the server. You can do this using &ldquo;echo -17 &gt; /proc/$<span class="caps">PID</span>/oom_adj&rdquo; where $<span class="caps">PID</span> is the <span class="caps">PID</span> of the process you are wanting to protect. Once you have your processes protected, kill off anything else that isn&rsquo;t essential that might be using memory to buy enough time for a clean shutdown or restart.</p> <h3> Server Down</h3> <p> Yep, not much you can do to rescue this type of alert unless it&rsquo;s a false positive. If the server is down, restart it and begin digging into the logs. The most common cause of a server down alert is such a rapid growth of memory that the server OOMs before a memory alert can come in. These alerts are hard to pinpoint with a cause, but good monitoring can be a life saver. We use Scout to monitor our servers, so we can check our Scout graphs for a rapid increase in memory usage or possibly disk usage that took the server down. Some other interesting pieces of information about what might have lead up to the server going down can be gathered from traffic graphs and passenger graphs to see if there was a sudden increase in traffic that caused too many passenger processes to spin up. Checking through the server logs and apache logs to see if there&rsquo;s a common thread to follow in order to find the cause can sometimes be tedious; because when a server OOMs and goes down, a lot of valuable information might be lost because the logs can&rsquo;t be written to.</p> <p> If your server is up, but your app isn&rsquo;t responding you might be getting DoS&rsquo;d or flooded. If you can connect to your server, possibly via <span class="caps">KVM</span> or <span class="caps">IPMI</span>, you can still troubleshoot the cause and block traffic that&rsquo;s preventing your ping check from being green. You&rsquo;ll want to use the netstat command I mentioned earlier, but taylor it to fit the need of the moment.</p> <h3> <span class="caps">SSH</span> Alerts</h3> <p> Normally, <span class="caps">SSH</span> alerts take some time to investigate because you have to wait for the server to become available again to checkout what&rsquo;s happening. You might get lucky and make it in the first or second attempt. Jump to the log directory and check the auth log or security log for failed <span class="caps">SSH</span> attempts. If you have a script kiddie that happens to be doing a dictionary brute force attack on your server, you&rsquo;ll see a plethora of failed attempts for either the same user with bad passwords, or an alphabetical list of random usernames. We use <span class="caps">SSH</span> keys and no passwords to increase the security of our servers; so, a valid user trying to login and failing that many times is very unlikely. You&rsquo;ll see the IP address of the machine trying to connect, so you&rsquo;ll be able to block it in the firewall. Using iptables on the server is an easy way to quickly block the offending IP address from making future attempts until iptables is restarted or a reboot happens. Use the iptables command mentioned earlier to block this traffic.</p> <p> A long term fix would be installing and configuring something like fail2ban and letting it automatically block IPs based on the rules you setup. This can be a premature optimization, so I would wait until <span class="caps">SSH</span> alerts become a problem before spending time setting it up. More often than not, the quick block using iptables is sufficient. If you do go this route, be sure to configure your hosts.allow correctly by white listing your IP so you don&rsquo;t lock yourself out late at night while making a typo run on your password.</p> <p> To wrap things up with a word of advice, all these solutions to common problems we see are based on having monitoring setup and alerting when a set of conditions have been met. Sometimes, you&rsquo;re able to set the thresholds for alerts low enough so you have time to react and fix things. Sometimes, it&rsquo;s just not possible to be proactive. The best way ahead is with a plan and remaining calm. If you don&rsquo;t do the basic investigative steps to correctly identify the problem, you may make mistakes and &ldquo;fix&rdquo; things that are not the real problem and possibly cause more harm. While things can be stressful, knowing the basic steps of investigation and paths to fix common problems goes a long way to keeping your app up and running and your customers happy.</p> <p> This is also probably a good time to mention that we&rsquo;re doing <a href="http://railsmachine.com/articles/2012/06/29/open-office-hours/">open office hours</a> on July 12th from 2-3PM Eastern if you&rsquo;ve got questions about anything in this post or pretty much anything else.</p> </div> Thu, 05 Jul 2012 00:00:00 +0000 http://featherplane.com/articles/2012/07/05/when-things-go-from-good-to-sad-to-seriously-serious http://featherplane.com/articles/2012/07/05/when-things-go-from-good-to-sad-to-seriously-serious Open Office Hours <div class="article-body"> <p> <strong>Come hang with us! We are excited to announce our new Open Office Hours in July</strong> where you can ask anything you want about our open source products, like <a href="http://railsmachine.com/projects">Moonshine</a> for example, advice on how to be ready when you end up on the homepage of Reddit, or why we do the things we do.</p> <p> Open Office Hours will be held Thursday, July 12th at 2pm through a Google Hangout, spots are limited to 9 people, but you will still be able to view the hangout. You can also participate via Twitter using the <strong>#oohrm</strong> hashtag to send us questions and comments.</p> <p> We hope that you&rsquo;ll join us. We look forward to meeting everyone and making the web a better place.</p> <p> <a href="https://plus.google.com/events/calrhdebm7s7gs53m0qd6dj3k9c/109692651803328489891">Open Office Hours</a><br /> <strong>Thursday, July 12th</strong><br /> <strong>2pm&ndash;3pm</strong></p> <p> You&rsquo;ll need a Google Plus account to join in the fun, add us to your circles via the Google Badge below.</p> <div class="googlebadge"> <div class="googlebadgeinner"> <plus height="69" href="https://plus.google.com/109692651803328489891" rel="author" width="200"></plus></div> </div> <hr /> </div> Fri, 29 Jun 2012 00:00:00 +0000 http://featherplane.com/articles/2012/06/29/open-office-hours http://featherplane.com/articles/2012/06/29/open-office-hours Upcoming July Events <div class="article-body"> <p> One of the many things we look forward to at the Rails Machine office are events in Savannah that bring some awareness to the local tech scene and it just happens that two such events are happening this month, Tech Crunch Mini Meetup and the Downtown Tech Crawl.</p> <h3> TechCrunch Mini-Meetup</h3> <p> <img alt="" src="http://cdn.railsmachine.com/images/tcmm.jpg" /></p> <p> <strong>July 6th&ndash;11th, Tech Crunch, is doing a tour of tech companies in the Southeast</strong>, blogging about it along the way, and Savannah is their first stop of the tour this year. From there they&rsquo;ll hit Atlanta and then right up to the coast until they reach Charlotte, NC. <a href="http://creativecoast.org">The Creative Coast</a> is lending their great space, we&rsquo;re going to be supplying the drinks, and the food will be supplied by the farm to fork liaison, <a href="http://revivalfoods.com">Revival Foods</a>, so we hope for a spirited evening of spreading the good word about what Savannah has to offer the tech world.</p> <blockquote> <p> &ldquo;The goal here is networking and connecting with the exciting projects happening in your cities. These mini-meetups are a great way to get noticed and to chat about what you&rsquo;re working on and, in addition, get some advice on next steps.&rdquo;</p> </blockquote> <p> <a href="http://techcrunch.com/events/southeast-mini-meetup-savannah/">Tech Crunch Mini Meetup</a><br /> Friday, July 6th<br /> at The Creative Coast Office<br /> 15 W. York St.<br /> 6pm &ndash; 10pm</p> <hr /> <h3> Downtown Tech Crawl</h3> <p> <img alt="" src="http://cdn.railsmachine.com/images/mdsav.jpg" /></p> <p> <strong>July 17th, Made in Savannah, is putting together the first Downtown Tech Crawl</strong> to give curious Savannah residents a chance to peer into the local tech scene, <em>via trolley rides no less.</em> We&rsquo;re the last stop of the tour before the closing party starts at Thinc and we&rsquo;re excited to be a part of the event. The variety of companies participating in the inaugural crawl presents a great way for someone wanting to get a bird&rsquo;s eye view of what Savannah&rsquo;s creative companies are all about. If you&rsquo;re thinking about planning a trip to this haunted city, this would be a great way to start your trip. You can reserve a free spot on the trolley via <a href="http://downtowntechcrawl.eventbrite.com/">EventBright</a>. Oh, I didn&rsquo;t mention that the whole crawl if free of charge? Free food, drinks, and transportation, how cool is that!? We hope to see you there.</p> <blockquote> <p> &ldquo;Hop on an <a href="http://www.oldsavannahtours.com/">Old Savannah Tours</a> and get an inside look into the growing technology community in downtown Savannah. Established Start-up Savannah Technology companies and start-up ventures will open their workspaces and give a quick show and tell of their product and team. Curious Savannahians will be able to check out their digs, hear about their products and meet the people leading the technology movement in Savannah.&rdquo;</p> </blockquote> <p> <a href="http://savnh.com/tech-crawl">Downtown Tech Crawl</a><br /> July 17th<br /> Start at 35 Barnard St.<br /> 4pm &ndash; 7:30pm</p> <hr /> </div> <p> &nbsp;</p> Thu, 28 Jun 2012 00:00:00 +0000 http://featherplane.com/articles/2012/06/28/upcoming-july-events http://featherplane.com/articles/2012/06/28/upcoming-july-events