A major milestone in WebGUI 8 development was reached this week: A dry-run of
the WebGUI 8 upgrade was successfully run against the plainblack.com database.
This means the only thing remaining from releasing an alpha 8.0.0 is updating
all the custom code on http://plainblack.com and
http://webgui.org. As always, plainblack.com and
webgui.org will be the first sites running the latest bleeding-edge version of
WebGUI (unless one of you wants to beat me to the punch).
This month, I also gave a presentation to Madison.PM
about building applications in WebGUI
8, a quick introduction to
Assets and an overview of the most important changes to how they work. The
slides are available at
http://preaction.github.com/ and the code
samples are linked at the end.
On an unrelated topic, I really enjoyed using
S5 to build my slides,
SHJS to highlight the code inside, and Github
Pages to host the whole thing. I plan on doing the
same for all my presentations: They look good, readable without a special
program, editable without a special program, anyone can fork and update my
presentations, and they're served by a nice, fast, free host.
The Server Is Down
- No it isn't, I didn't get paged.
- Wait a minute, why didn't I get paged?
- FUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU--
- CHALLENGE ACCEPTED
Diagnosis
The client reported that the site sometimes took more than a minute to load.
Doesn't respond very slowly to me, and the pager is only primed to ping me if
there is a sustained downtime (hiccups are not something I want to wake up for
every night at 3:00am).
Strangely, load hovered around 7 most of the time, only spiking to 13 every
few minutes. With a 16-core processor, this was well within operating
parameters, if just a little worrisome. Nothing in the log files.
Oops, now I get a slow page load. Takes 30 seconds to load a page. Refresh
again, and the page loads just fine. Clear browser cache, and the page still
loads just fine.
top kept MySQL at the top of the CPU list. Not surprising, as this server is
the master database server for a two node cluster. So I keep an eye on top as
I poll mysql for its process list.
A pattern emerges: The load spikes and server goes unresponsive when this happens:
This table shows 12 different processes are trying to update the same cache
location (process ID 2-3, 5-8, 10, 12-13, 18, 23, and 26). Because of MyISAM's
table-level lock, any request to get from the cache has to wait for 12 REPLACE
INTO requests to complete. They've already taken 1 second, if each replace
takes 2 seconds, that's 24 seconds of non-responsive website.
These 12 processes all saw that the cache item had expired and are trying to
update it. This is called a "cache stampede". Only one of them needs to update
the cache, the rest are just wasting resources. Worse, they're doing all the
work to update the cache, which is much more expensive than getting the value
from the cache. If it's expensive enough, the site goes down hard.
Management
How can we stop the cache stampede? One way is to mildly randomize the actual
expiration date when checking if the cache is expired:
sub is_expired {
my ( $self, $key ) = @_;
my $expires = $self->get_expires( $key );
# Randomize the expiration by up to 5% +/-
# by first removing 5% and then adding 0-10%
$expires = $expires - ( $expires * 0.05 ) + ( $expires * 0.10 * rand );
# Compare against now
return $expires > time;
}
In this very simple case, if you are within 5% of the expiration time, you
have a chance to have an expired cache item. The chance grows as time passes,
reaching 50% at the actual expiration time, and 100% at 5% past the expiration
time.
Rather than add this expiration variance to our custom database cache, I
instead opted to move this site over to CHI, which has this protection
built-in.
my $cache = CHI->new(
driver => 'DBI',
namespace => 'localhost',
dbh => $dbh,
expires_variance => '0.10',
);
This stops the cache stampede, but we're still hitting the database a lot.
Remember we have two web nodes hitting one database node. The fewer database
hits we make, the better performance we can get without having to ask for more
hardware from the client (which takes time, and forms, and more forms, and
meetings, and forms, and more meetings, and probably some forms).
Because this is a distributed system, we need a distributed, synchronized
cache. We cannot use memcached, as WebGUI 7.x does not support it (but WebGUI
8 does). So for now we must use the database as our synchronized cache, but
what if we put a faster, local cache in front of the slower, synchronized
cache?
CHI has an awesome way to do this: Add an l1_cache
my $cache = CHI->new(
driver => 'DBI',
namespace => 'localhost',
dbh => $dbh,
expires_variance => '0.10',
l1_cache => {
driver => 'FastMmap',
root_dir => '/tmp/cache',
},
);
Now we're using FastMmap to share an in-memory cache between our web
processes, and if the L1 cache is expired or missing, we look for content from
the DBI cache. If that cache is missing or expired, we have a cache miss and
have to recompute the value.
Hurdles
I had to install the DB tables myself, which was not difficult, just
undocumented (bug report filed). MySQL only allows a 1000-byte key, and the
CHI::Driver::DBI tries to create a 600-character key. This is fine in the
Latin-1 charset, but MySQL complains if you're using UTF-8 by default.
The driver also tries to create a TEXT field to hold the cache value, but
MySQL expects a text field to hold characters in a known character set. After
noticing that my cache values were empty, I changed to a LONGBLOB.
The full create table statements are below:
-- primary cache table: chi_<namespace> --
CREATE TABLE IF NOT EXISTS `chi_localhost` (
`key` VARCHAR(255),
`value` LONGBLOB,
PRIMARY KEY ( `key` )
);
-- CHI metacache table --
CREATE TABLE IF NOT EXISTS `chi__CHI_METACACHE` (
`key` VARCHAR(255),
`value` LONGBLOB,
PRIMARY KEY ( `key` )
);
Results
The server is stable again! Spikes do not turn into out-of-control loads and
unresponsive server. We'll see how things go tomorrow during normal business
hours (the peak time for this site), but right now it looks like CHI has saved
the day!
By far the biggest change we've made in WebGUI 8 is the new Admin Console.
Though parts of it may look familiar, it has been completely rewritten from
the ground up to be a flexible, extensible, responsive JavaScript application
making calls to JSON services in Perl.
I could talk about how to use the admin interface, but I don't think that's
why you would read this blog, so instead I'm going to talk about how you can add functionality to it.
Continue reading What's New in WebGUI 8.0...
Caching is a tricky business. Having just one kind of cache won't work, because
the production environment will greatly determine the most efficient caching
system. A distributed production environment would be best-served with a
distributed cache. A smaller, single-server environment could use a simple
shared memory cache.
Enter Jonathan Swartz's CHI module, the greatest Perl module to provide a
unified caching interface. CHI is the DBI of caching: It presents an API, and
delegates to CHI::Driver modules to perform the heavy lifting. It
provides a layered caching system, allowing you to have a faster, more
volatile cache in front of a slower, more persistent cache. It also provides a
variable expiration time, preventing a "miss stampede" where all processes try
to recompute an expired cache item at the same time.
By integrating CHI cache into WebGUI, we have the ability to provide any
caching strategy that CHI can provide. We get Memcached, FastMmap, and DBI
drivers (and more drivers can be written).
I wrote a CHI cache driver for WebGUI 7.9 that we've been using on many of our
shared hosting servers. The performance increase using FastMmap through CHI
over the old Storable+DBI cache module is dramatic: 2-5 times faster with
CHI and FastMmap.
Using CHI in WebGUI
The fewer wrappers that WebGUI has around CPAN modules we use, the less code I
have to write, and the more features will be available to our users without
having to change WebGUI to use them.
To that end, you can write a section of the configuration file that gets
passed directly to CHI->new. Some massaging occurs to make sure a DBI cache
driver gets the right $dbh, but otherwise you can fully configure CHI directly
from the WebGUI config file:
# The new default cache for WebGUI, FastMmap
{
cache : {
driver : 'FastMmap',
root_dir : '/tmp/WebGUICache',
expires_variance : 0.5
}
}
# Set up a memcached cache with local memory in front
{
cache : {
driver : 'Memcached::libmemcached',
servers : [ '10.0.0.100:11211', '10.0.0.110:11211' ],
l1_cache : {
driver : 'Memory'
}
}
}
When you want to use the cache in your code, you can get a CHI object with
$session->cache. CHI's interface is sufficiently simple, with some fun tricks:
my $cache = $session->cache; # as read
my $value = $cache->get('cache_key');
if ( !$value ) {
$value = compute_value();
$cache->set( 'cache_key', $value );
}
# Combine get and set with intelligence
my $value = $cache->compute( 'cache_key', \&compute_value );
Future Plans
With a single unified cache that performs well and layers like CHI, we can
take our current stow and scratch APIs and move them to the cache. In the case
of stow, we remove a redundant API. In the case of scratch, we remove database
hits.
We've also been exploring cache-only sessions, instead of updating the session
every time a page is requested, updating the cache only, flushing to the
database (or not). The fewer DB calls we make per page, the better performance
will be.
Special thanks go out to Jonathan Swartz for such a wonderful solution.
Stay tuned for next time when I explore our new Admin Interface. Lots of
pretty and screenshots!
Following The Path
If you installed WebGUI 0.9.0 back in August of 2001 (the first public
release), you've had a stable upgrade path through WebGUI 7.10.8 (January
2011) and beyond. Plainblack.com has been through every upgrade for the last
10 years, a shining bastion to our upgradability.
A WebGUI 7.10 user would not even recognize a WebGUI 6.0 database, much less
the database used by the 1.x series, but slowly, gradually, our upgrade system
brought new features to every WebGUI site that wanted them.
The Ancient Way
Our old upgrade system was quite simple:
docs/upgrade_2.9.0-3.0.0.pl
docs/upgrade_3.0.0-3.0.1.sql
docs/upgrade_3.0.0-3.0.1.pl
Our upgrade.pl script would check for docs/upgrade_*, compare version numbers,
and then execute the .sql and .pl scripts in order until there were no more
upgrades left.
Because each .pl script was executed individually, there was a considerable
amount of boilerplate in each script (123 lines). Because there was only one script per
version, some scripts could get quite long. We had conventions to manage these
limitations, but it was still a bit of a mind-twist to write an upgrade
routine.
Later, when we moved to simultaneous beta and stable trees, it became even
more difficult to manage these huge upgrade scripts. Collecting the new
features from the beta tree to apply to the stable tree was a time-consuming
manual task that some poor coder had to perform, back hunched over a dimly-lit
screen in the wee hours of the night, testing and re-testing the upgrade to
make sure stable lived up to its expectations.
Though our upgrade system had performed admirably, it was time for a fresh
look at the problem.
The Modern Vision
The individual files for upgrades was working quite well, but didn't go far
enough. Our new upgrade system has one file per upgrade step. Each sub from an
old upgrade script would be one file in the new upgrade system. What's more,
additional file types would be supported:
$ ls share/upgrades/7.10.4-8.0.0/
addNewAdminConsole.pl
admin_console.wgpkg
facebook_auth.sql
migrateToNewCache.pl
moveMaintenance.pl
moveRequiredProfileFields.pl
So now, instead of a single file for an upgrade, we have an entire directory.
In this directory, the .pl files are scripts to be run, the .wgpkg files are
WebGUI assets to add to the site, the .sql files are SQL commands to run, and
any .txt files will be shown as a confirmation message to the user for gotchas
like "All your users have been logged out as a result of this upgrade. Deal
with it.".
So now, if you want to add your own custom upgrade routine, you just add
another file to the directory which means less worrying about conflicts. When
we need to build another new stable version release, we can just move the
unique upgrade files from beta to the new upgrade.
The best part of the new upgrade system is how the .pl scripts are written.
When you are in a .pl, you have a bunch of sugar to make the basic tasks much
easier.
# Old upgrade routine. Just another day in a session
sub migrateToNewCache {
my $session = shift;
print "\tMigrating to new cache " unless $quiet;
use File::Path;
rmtree "../../lib/WebGUI/Cache";
unlink "../../lib/WebGUI/Workflow/Activity/CleanDatabaseCache.pm";
unlink "../../lib/WebGUI/Workflow/Activity/CleanFileCache.pm";
my $config = $session->config;
$config->set("cache", {
driver => 'FastMmap',
expires_variance => '0.10',
root_dir => '/tmp/WebGUICache',
});
$config->set("hotSessionFlushToDb", 600);
$config->delete("disableCache");
$config->delete("cacheType");
$config->delete("fileCacheRoot");
$config->deleteFromArray("workflowActivities/None", "WebGUI::Workflow::Activity::CleanDatabaseCache");
$config->deleteFromArray("workflowActivities/None", "WebGUI::Workflow::Activity::CleanFileCache");
my $db = $session->db;
$db->write("drop table cache");
$db->write("delete from WorkflowActivity where className in ('WebGUI::Workflow::Activity::CleanDatabaseCache','WebGUI::Workflow::Activity::CleanFileCache')");
$db->write("delete from WorkflowActivityData where activityId in ('pbwfactivity0000000002','pbwfactivity0000000022')");
print "DONE!\n" unless $quiet;
}
If you're familiar with WebGUI session, this is pretty standard, but still
much boilerplate and convention. The new scripts remove boilerplate and
enforce what was once merely convention.
# New upgrade routine. migrateToNewCache.pl
use WebGUI::Upgrade::Script;
use Module::Find;
start_step "Migrating to new cache";
rm_lib
findallmod('WebGUI::Cache'),
'WebGUI::Workflow::Activity::CleanDatabaseCache',
'WebGUI::Workflow::Activity::CleanFileCache',
;
config->set("cache", {
'driver' => 'FastMmap',
'expires_variance' => '0.10',
'root_dir' => '/tmp/WebGUICache',
});
config->set('hotSessionFlushToDb', 600);
config->delete('disableCache');
config->delete('cacheType');
config->delete('fileCacheRoot');
config->deleteFromArray('workflowActivities/None', 'WebGUI::Workflow::Activity::CleanDatabaseCache');
config->deleteFromArray('workflowActivities/None', 'WebGUI::Workflow::Activity::CleanFileCache');
sql 'DROP TABLE IF EXISTS cache';
sql 'DELETE FROM WorkflowActivity WHERE className in (?,?)',
'WebGUI::Workflow::Activity::CleanDatabaseCache',
'WebGUI::Workflow::Activity::CleanFileCache',
;
sql 'DELETE FROM WorkflowActivityData WHERE activityId IN (?,?)',
'pbwfactivity0000000002',
'pbwfactivity0000000022',
;
done;
The first thing we do in our new upgrade script is use
WebGUI::Upgrade::Script. Now, instead of using the session for everything, we
have subs imported for various tasks. This means that many times we can run an
entire upgrade script without opening a WebGUI session, or creating a version
tag unnecessarily.
If we do need a session, or a version tag, they will be automatically assigned
relevant information describing what we're doing. When we're done, they will
be automatically cleaned up and committed. What once was done with
boilerplate, and subject to random deletion or subversion, is now enforced
policy.
In all other respects, a WebGUI upgrade script is a Perl script. You can add
modules, write subroutines, and do anything necessary to move WebGUI into the
future.
The Internet is always evolving. With the WebGUI 8 upgrade system, we've made
it easier to evolve with it.
Stay tuned for next time where I'll show off our CHI-based caching system.