Blew Technology Thoughts and snippets.

19Aug/100

Proactive vs. Reactive Monitoring

Inevitably after you've got some important stuff running on your servers you'll discover that things aren't working like you're expecting or perhaps not even working at all.  Sounds like it's time to think about how you're going to monitor that stuff you've spent so much time tending to.

At a high level monitoring falls into two categories: proactive and reactive.  Reactive monitoring happens after an event has taken place  while proactive monitoring gives you a heads up some time before the event is likely to occur.  Reactive monitoring is generally low hanging fruit once you've got a monitoring framework in place that fits your environment.  You can set up alerts that a service is no longer available, a filesystem has filled up, etc.  If you're paying attention you can often build new proactive alerts based on reactive alerts that you've had to address...

Good proactive monitoring grows out of root cause analysis.  By paying attention to the log messages and performance metrics that you've collected before and during an event you can often create proactive alerts that can clue you into a problem before it becomes serious.  It's also important to sift through this data with team members so you can share knowledge and troubleshooting techniques.  Different folks interpret error messages differently (and not always correctly).  Here are some tips for making the knowledge transfer and analysis go smoothly:

  • Draw out a timeline of events.  Start the timeline in the middle of the whiteboard because you'll likely go back farther to find the root cause than you might initially think.
  • Get a meeting room with a projector to make it easy for everyone to see what you see (if possible).
  • Gather as much data from the time period in question and related systems as possible.  The information in the logs is generally your primary troubleshooting method.
  • Divide and conquer the log review process among team members in the room (if possible).

Making a group analysis a cultural norm can really help future problems be dealt with faster.  You can learn more from failure than from success.

11Aug/100

Django fixtures with ManyToMany fields

This evening I learned how to set up a ManyToMany value in a fixture.  I ended up using the "dumpdata" feature of manage.py to figure out the syntax.

I've got a ManyToMany field for one of my models called "district".  I wanted to link the model instance to one district object (district ID #1):

"district": [1]
If I wanted add them to more than one district it would look like this:
"district": [1, 2, 5]
Filed under: Uncategorized No Comments
25May/100

Ubuntu 10.04 nvidia Driver Screen Resolution Issue

Recently at work I installed Ubuntu 10.04 on a workstation under my desk.  After getting it installed I couldn’t get the monitor working at 1900×1200 as it was designed to do, so I hunted around on the intertubes for a while and found an option that got it running correctly without having to do lots of modeline weirdness.

In the video card section of /etc/X11/xorg.conf add the following line:

Option	    "ModeValidation" "NoMaxPClkCheck"

This disables pixel clock checks which probably only really matters for CRT displays (don’t quote me on that).

Filed under: Uncategorized No Comments
18Nov/090

Why your daily team meeting sucks

Note this is is written from the "old school" ops perspective.  Sprint meetings are kind of a different animal.

Your daily team meeting sucks.  You know it and your fellow team members know it.  Deep down it's eating at you that you spend somewhere between 5 and 50 minutes in a room going over your hopes and dreams for the work day.  This can be especially disheartening if your coffee hasn't kicked in and/or you're coming down from that meth bender you were on earlier in the week.

Here's why it sucks.

Nobody can stay on topic

You can't and your boss can't.  Since there's no specific agenda for the meeting, discussions go off on wild tangent quickly.  You start out talking about the random task for a customer and suddenly you're all bitching about how much you hate toe socks.  Maybe not toe socks specifically (though there's not much to like about toe socks).

50% of the team isn't ready for the meeting because they're caught up with more pressing issues

Disks are on fire!  The customer needs their issue fixed ASAP (for reals)!  These things happen in the morning, especially if you're working for a west coast company with east coast customers.  More than likely your team meeting will end up being right in the middle of the morning when you're caught up with these sorts of issues.  Sometimes you can sit out if it's a really urgent issue, but since nobody's taking notes and everyone assumes that someone else will debrief  you, the information shared in that meeting is likely lost.

The biggest reason your daily team meeting sucks?

It's a coverup

Your daily team meeting is a cover for a lack of process around the work you're doing.  If your workflow processes were well implemented and properly thought out you'd know when to talk to your co-workers about inter dependencies because the workflow would call these things out.  It may be a project manager that helps make sure the communication is in place, or it could be automatic ticket creation for canned or often repeated tasks.

It's not all bad though

Your DTM likely does do some good things though.  It's an opportunity for team members to be in the same communication space (conference bridge, meeting room, whatever).  As good as your workflow is you still need face time with co-workers, especially if they work on the same kinds of things you do.

If you're unable to fix the root problems, a daily meeting is a necessary thing.  The communication has to happen one way or another.

Generally speaking, meetings aren't always the real problem, and meetings aren't necessarily a solution.  Meetings are a means to an end, facilitating communication needed to get your "real work" done.

So go do it already.

24Jan/090

mysqldump – INSERTs too big

So recently I was attempting to migrate some rather large tables from one (slow) database host to another.  I was running mysqldump piped into a mysql client on localhost.  Unfortunately, I ran into a snag:

mysqldump: Error 2013: Lost connection to MySQL server during query when dumping table `SOME_TABLE` at row: 14913098

I had two things working against me in this situation.

  1. I was forking a mysqldump process for each table in the database, so I was running 100+ mysqldump processes at the same time.
  2. The host the data was dumping from was slow.

So since mysqldump returned the error, the issue seems to have originated on the host I was dumping from.  This is sometimes due to a max_allowed_packet issue, but max_allowed_packet was set at 16M on both hosts.  I also found this blog entry that sounded similar:
http://jeremy.zawodny.com/blog/archives/000690.html

Unfortunately, -q is enabled by default with --opt.  Foiled again!  I found some mentions of setting timeouts really really high on the database server, which made me think "What if the host is so slow it's not able to return data before the session timeout is hit?"  So how do I make mysqldump return the data more often...

I started playing with the different options.  max_allowed_packet still returns a large INSERT.  Setting  --no-extended-insert would also get this result, but that could more than double my migration time (already projected to be several days).  Then I found this only slightly documented option:

--net_buffer_length

The default setting seems to be 1M in my installation, so setting this down to 128K or 64K will reduce the size of the INSERT generated.  This also means that data is flushed out to the client more often, working around setting the timeouts obcenely high.  This also means that if something is really causing the source database host to crunch and return really slowly, we'll probably return data fast enough to avoid hitting the timeouts.  If you've got rows bigger than what you set net_buffer_length to, mysqldump is smart enough to adjust the buffer for that row so you won't get a partial result.

Tagged as: , , No Comments
5Nov/080

Sun 2540 Network Timeouts – RESOLVED

We recently took delivery of two 2540 storage arrays to be used with MySQL and ZFS.  These are great little boxes that offer a lot of bang for the buck.

After getting them on-site and online I started seeing lots of dropped packets behavior between the CAM host and other devices on the same VLAN when CAM was attempting to communicate with the arrays.  Initially we thought this was a bad edge switch, but as it turns out, it looks to be related to the bug described here:

http://sunsolve.sun.com/search/document.do?assetkey=1-66-240105-1

Originally we were only upgrading the arrays to the firmwares that come with the latest release of the 6.0.x branch because a Sun tech was concerned about false positives on broken drives/controllers with the 6.1.whatever CAM software.  However, after running into this network issue, I've upgraded CAM to 6.1.2.8.  I'm guessing that the issues he was concerned about are gone since it's been a few months since the ticket was open.

To get the arrays with more disks to upgrade correctly, I've actually had to connect both controllers to sepearate network ports on the back of the CAM host (on different subnets to avoid routing issues).  This method didn't actually work until I un-registered and re-registered the array within CAM.  What a pain!

I hope this helps someone else out there if they run into this issue.

8Aug/080

Everyone into the (Server) Pool

So you've got a web server...you've started your new internet application and you're going to strike it rich.  Awesome.  Next thing you know, you've grown and you need more web servers.  Congratulations!  You've probably built or purchased some sort of load balancer to put in front of your web servers.

Next thing you know, you've diversified and you want to run a few different apps, or if you're lucky, a few different domains using the same codebase.  You think "I probably need a few new servers to run this hot new application".  You're probably right.  You may think "I should probably set up a new server pool/farm for this new domain/app/whatever".  Think twice!

Additional server pools or farms need justification.  They're additional overhead when you've probably already got enough on your plate.  Here are some problems with additional server pools:

Additional management overhead on your load balancer.  Put enough enough pools on there and you could actually increase latency and/or decrease throughput.  Sure, with modern hardware this isn't a huge deal.  You might need a thousand pools before you start seeing a reduction in performance, but if you have a really complex rule or transform set applied to URLs for a pool, this could come to a head sooner.

Additional management overhead on your servers.  Keeping track of which servers are running which domain or which application can be difficult, especially if your growth has been organic.  Every customer facing web server should be able to run any domain you host.  This probably doesn't apply to a major hosting company hosting different customer URLs since you may not get that much control over the application, but if you're fortunate enough to have customers running something written to be domain/vhost aware, it'll pay dividends later.

This methodology doesn't apply to everything.  You may not want to mix presentation layer services with your business logic services unless they're both pretty lightweight (though since you can squeeze 32GB of RAM in 1U these days, it may not be a big deal).  You'll probably want to run your database on your front-end web servers even less.  If you have extra cycles turn burn, this may be a way to get better utilization out of your machines, but this doesn't work for everything.  When it comes down to it, you (should) know your application best.

Another point: make sure you understand how your application behaves.  Benchmarking will save you when you suddenly need to know how many more servers you need to buy.  Do certain things make the memory footprint expand dramatically?  Do some hits generate a lot of extra CPU load (maybe those URLs will need their own pool soon)

In my experience getting in this habit will make your life easier.  If you've half a dozen domains and half a dozen servers, taking half the farm down for maintenance behind the scenes will become much more trivial.