Adventures in Debugging C/XS 2: Debugging Boogaloo

Tags:

... or "Ask Not To Whom The Pointer Points, It Points To Thee."

TL;DR: A pointer is not a reference. A pointer knows nothing about the data being pointed to. Returning multiple values requires actual work.

Everything went wrong when I wanted a string with a NUL character inside it. C strings are not Perl scalars, they don't know how long they are. So to mark the end of a string, C uses the NUL character, \0. The strcpy function will copy to your destination until the first \0 from your source. When you want to have a string with a \0 inside of it that does not mark the end of the string, you need to know exactly how long the string is. This is not difficult to do, but you also have to return that length from the function that creates your string.

C functions do not have more than one return value.

(char* buffer, int bufferSize) = get_string_with_nuls();
// You thought it could be that easy?

So in order for your function to result in more than one value, you have to pass in references to be used to fill in with actual values.

char* buffer;
int bufferSize = get_string_with_nuls( buffer );
// C programmers will already know what I did wrong here

Thinking like a Perl programmer, I thought I could just pass in the pointer to the function and the function could fill it with data. Two problems:

  1. I passed in the pointer itself, not a reference to the pointer: &buffer
  2. I did not initialize the pointer to anything.

A more correct way would be:

char* buffer = malloc( 128 * sizeof( char ) );
int bufferSize = get_string_with_nuls( &buffer );

But this suffers from another problem: I have to know beforehand how big my string is going to be and allocate that much memory beforehand.

The way I ended up succeeding was:

int bufferSize;
char* buffer = get_string_with_nuls( &bufferSize );

This way, get_string_with_nuls can handle the malloc with exactly the correct size and give it to me. I don't have to guess at a size beforehand.

Of course, a struct could do this better, or since I'm actually in C++, an object. I'll be planning a new API as soon as I confirm this one actually works and has proper tests (written in Perl, of course).

Adventures in Debugging C/XS

Tags:

Originally posted as: Adventures in Debugging C/XS on blogs.perl.org.

... or Why A Good Perl Developer Is Not Automatically A Good C Developer, the Story of C Programming via Google.

My tests failed, but only sometimes. I was building an XS module to interface with a C wrapper around a C++ library (wrapper unnecessary? probably). make test was failing with exit code 11. Some quick searching revealed that I had an intermittent segfault. Calling a function as_xml would fail with a SEGV in strlen(). This only happened in perl after as_xml when perl was making a SV out of the return value. This also only mainly happened during make test. Doing prove myself would succeed 19 times out of 20, where make test would fail 19 times out of 20. Worse, my C test program would never fail at all.

Continue reading Adventures in Debugging C/XS...

WebGUI 8 Status Report

Tags:

Originally posted as: WebGUI 8 Status Report on blogs.perl.org.

A major milestone in WebGUI 8 development was reached this week: A dry-run of the WebGUI 8 upgrade was successfully run against the plainblack.com database. This means the only thing remaining from releasing an alpha 8.0.0 is updating all the custom code on http://plainblack.com and http://webgui.org. As always, plainblack.com and webgui.org will be the first sites running the latest bleeding-edge version of WebGUI (unless one of you wants to beat me to the punch).

This month, I also gave a presentation to Madison.PM about building applications in WebGUI 8, a quick introduction to Assets and an overview of the most important changes to how they work. The slides are available at http://preaction.github.com/ and the code samples are linked at the end.

On an unrelated topic, I really enjoyed using S5 to build my slides, SHJS to highlight the code inside, and Github Pages to host the whole thing. I plan on doing the same for all my presentations: They look good, readable without a special program, editable without a special program, anyone can fork and update my presentations, and they're served by a nice, fast, free host.

CHI Saves The Day

Tags:

Originally posted as: CHI Saves The Day on blogs.perl.org.

The Server Is Down

  1. No it isn't, I didn't get paged.
  2. Wait a minute, why didn't I get paged?
  3. FUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU--
  4. CHALLENGE ACCEPTED

Diagnosis

The client reported that the site sometimes took more than a minute to load. Doesn't respond very slowly to me, and the pager is only primed to ping me if there is a sustained downtime (hiccups are not something I want to wake up for every night at 3:00am).

Strangely, load hovered around 7 most of the time, only spiking to 13 every few minutes. With a 16-core processor, this was well within operating parameters, if just a little worrisome. Nothing in the log files.

Oops, now I get a slow page load. Takes 30 seconds to load a page. Refresh again, and the page loads just fine. Clear browser cache, and the page still loads just fine.

top kept MySQL at the top of the CPU list. Not surprising, as this server is the master database server for a two node cluster. So I keep an eye on top as I poll mysql for its process list.

A pattern emerges: The load spikes and server goes unresponsive when this happens:

This table shows 12 different processes are trying to update the same cache location (process ID 2-3, 5-8, 10, 12-13, 18, 23, and 26). Because of MyISAM's table-level lock, any request to get from the cache has to wait for 12 REPLACE INTO requests to complete. They've already taken 1 second, if each replace takes 2 seconds, that's 24 seconds of non-responsive website.

These 12 processes all saw that the cache item had expired and are trying to update it. This is called a "cache stampede". Only one of them needs to update the cache, the rest are just wasting resources. Worse, they're doing all the work to update the cache, which is much more expensive than getting the value from the cache. If it's expensive enough, the site goes down hard.

Management

How can we stop the cache stampede? One way is to mildly randomize the actual expiration date when checking if the cache is expired:

sub is_expired {
    my ( $self, $key ) = @_;
    my $expires = $self->get_expires( $key );
    # Randomize the expiration by up to 5% +/-
    # by first removing 5% and then adding 0-10%
    $expires = $expires - ( $expires * 0.05 ) + ( $expires * 0.10 * rand );
    # Compare against now
    return $expires > time;
}

In this very simple case, if you are within 5% of the expiration time, you have a chance to have an expired cache item. The chance grows as time passes, reaching 50% at the actual expiration time, and 100% at 5% past the expiration time.

Rather than add this expiration variance to our custom database cache, I instead opted to move this site over to CHI, which has this protection built-in.

my $cache   = CHI->new(
    driver              => 'DBI',
    namespace           => 'localhost',
    dbh                 => $dbh,
    expires_variance    => '0.10',
);

This stops the cache stampede, but we're still hitting the database a lot. Remember we have two web nodes hitting one database node. The fewer database hits we make, the better performance we can get without having to ask for more hardware from the client (which takes time, and forms, and more forms, and meetings, and forms, and more meetings, and probably some forms).

Because this is a distributed system, we need a distributed, synchronized cache. We cannot use memcached, as WebGUI 7.x does not support it (but WebGUI 8 does). So for now we must use the database as our synchronized cache, but what if we put a faster, local cache in front of the slower, synchronized cache?

CHI has an awesome way to do this: Add an l1_cache

my $cache   = CHI->new(
    driver              => 'DBI',
    namespace           => 'localhost',
    dbh                 => $dbh,
    expires_variance    => '0.10',
    l1_cache            => {
        driver      => 'FastMmap',
        root_dir    => '/tmp/cache',
    },
);

Now we're using FastMmap to share an in-memory cache between our web processes, and if the L1 cache is expired or missing, we look for content from the DBI cache. If that cache is missing or expired, we have a cache miss and have to recompute the value.

Hurdles

I had to install the DB tables myself, which was not difficult, just undocumented (bug report filed). MySQL only allows a 1000-byte key, and the CHI::Driver::DBI tries to create a 600-character key. This is fine in the Latin-1 charset, but MySQL complains if you're using UTF-8 by default.

The driver also tries to create a TEXT field to hold the cache value, but MySQL expects a text field to hold characters in a known character set. After noticing that my cache values were empty, I changed to a LONGBLOB.

The full create table statements are below:

-- primary cache table: chi_<namespace> --
CREATE TABLE IF NOT EXISTS `chi_localhost` (
    `key` VARCHAR(255),
    `value` LONGBLOB,
    PRIMARY KEY ( `key` )
);

-- CHI metacache table --
CREATE TABLE IF NOT EXISTS `chi__CHI_METACACHE` (
    `key` VARCHAR(255),
    `value` LONGBLOB,
    PRIMARY KEY ( `key` )
);

Results

The server is stable again! Spikes do not turn into out-of-control loads and unresponsive server. We'll see how things go tomorrow during normal business hours (the peak time for this site), but right now it looks like CHI has saved the day!