2013-10-24

Nicholas Clark writes:

As per my grant conditions, here is a report for the August period.

Multiple cores are the future! And we need to enjoy this, because we aren't going to get any choice about it. The current hardware for the hot backup for perl5.git.perl.org has 24 cores, and even mobile phones are thinking about going quad-core.

The upshot of this is that the more that you can get the computer to do parallel, the faster the entire task will complete. On *nix the make has been able to run in parallel for quite a while, and back in 2008 I hacked away on TAP::Harness until we got the tests running in parallel too. (Sadly only on *nix - Microsoft's divergent implementation of socket() frustrates the same approach being used on Win32.)

Well, to be strictly accurate, the make has been able to run in parallel except when it can't, due to race conditions we were previously unaware of. (Just like the tests, actually.) We chip away at them as we find them, and I guess that the best we can hope for is "no known problems", and an increasingly long time since the last "unknown unknown" turned up.

Frustratingly there had been one "known problem" that we hadn't figured out a way to solve. SDBM_File contains a subdirectory in which is the source code to libsdbm.a, and the Makefile.PL for SDBM_File ends up generating a Makefile which has two rules, both of which recurse into ext/SDBM_File/sdbm/ and try to build the same target. Run in parallel, and you can end up with one process deleting libsdbm.a at just the same time that a second process is trying to link it, resulting in that second process bailing out with an error and the build failing. One rule was the default "build my subdirectory" rule generated by ExtUtils::MakeMaker. The second was a custom rule generated by the Makefile.PL to say "to build the dependency libsdbm.a, you need to do this..." and the difficulty was that the rule that we needed to keep was the second one.

What has eluded all of us for a very long time is that actually it's pretty easy to stop ExtUtils::MakeMaker generating that default rule - if you know how. So the fix turns out to be quite simple, once you know the secret, and now we think that this bug is finally fixed, and as no others have cropped up in the Makefile for a while, it's starting to look like the Makefile might be bug free and reliable in parallel.

If multiplying is good for cores, then it must be good for BSDs too, as they seem to be breeding like rabbits*. The problem with this is that each BSD ends up having a hints file to tell Configure about its peculiarities and foibles, those hints files keep getting bigger, and when (for example) Bitrig forks off OpenBSD, then the Bitrig hints file starts out as a near copy of OpenBSD's. (Although, of course, not identical, as inevitably something key will change, such as the version numbers)

The upshot of adding support for Bitrig was that now we had 3 hints files (OpenBSD, MirBSD and Bitrig) each of which contained a 59 line test to determine if the C library has broken 64 bit functions inherited from OpenBSD.

Triplication is never good, and as the code in question looked to be pretty much portable and likely to pass on any sane system, I looked into moving it into Configure, and hence running it everywhere.

Just for a change, this all went strangely smoothly. AIX and HP-UX, both of which have choked on these sorts of things before, built it just fine, as did the less problematic Solaris, FreeBSD and Linux. Various smoke-me smokers reported it clean on Darwin, NetBSD and Cygwin, and BinGOs manually tested on both 32 and 64 bit Cygwin without problems. So it went "in", and the code base got (net) 124 lines shorter.

It would be wonderful to parameterise all the BSDs into one master hints file and a set of options ("oh, it's basically FreeBSD-like, but they use git for a VCS, the NetBSD ports system, but default to ZFS"), but I suspect that that's more work than it's worth. Still, it remains that the hints files for *BSD are already more than twice as long as the Linux hints file, and sadly I don't think that that ratio is going to improve.

Less conclusive was my investigation into dl_unload_all_files(). This feature of DynaLoader dates back 13 years, to just before v5.6.0 was released, and the open ticket is of a similar vintage, "ID 20000402.003" in the old system (migrated to RT as #2968).

The problem that dl_unload_all_files() is attempting to solve is that in an embedded environment, the perl interpreter's exit may not always correspond to the process exit. Worse still, the perl interpreter might be be started a second time in the same process, and this ought to work. So to make everything work smoothly, it's important that when embedded, the interpreter frees up all resources that it was responsible for allocating.

One of the more obscure resources that had got forgotten was shared libraries that it had loaded, typically by DynaLoader when loading extensions dynamically. So dl_unload_all_files() was part of an attempt to have DynaLoader track all the shared objects that it loaded, and free them all at interpreter exit. Only it didn't quite work out, and despite the documentation claiming that it was active, it quietly got disabled unless DynaLoader.xs was compiled with DL_UNLOAD_ALL_AT_EXIT.

The summary is that "it's a mess".

The curious good news is that even building the 13-year-old code with the current gcc ASAN shows remarkably few problems. The quality of the core's code has been consistently good for over a decade.

But back to the bad news. Even for the single-threaded case, there are problems. The implementation of dl_unload_all_files() isn't robust. It uses perl_atexit() is called after object destruction, but before all the unblessed SVs gets destroyed. The problem is that this is too early for this use case - anything used later that has a pointer into something dynamically loaded instead has a pointer into garbage. For example anything that has tie magic will have the vtable point to static memory (in the sense that it's a static object allocated by the C compiler). If the tie was set up by XS code and that memory was in the shared object, that memory isn't there any more, but the SV still points to where it was. When the full global destruction happens it will attempt to call the relevant vtable routine to clear up the tied scalar, follow the pointer into the abyss and then crash (or worse). The actual problem that highlighted this was the OP free hook in File::Glob, which again is a pointer from something "owned" by the interpreter into memory "owned" by the shared object.

It might be possible to fix these global destructions problems by adding a hook that gets run much much later, after all the SVs are destroyed, pretty much as the last thing before perl_destruct() returns. However, this won't fix the multi-threaded case. It's clear that no-one is building with it with ithreads (or using it with Win32 pseudo-forks), because there is an explosion of fail if you do. The problem is that DynaLoader is using perl_atexit() to unmap shared libraries, and that's per interpreter exit. So as soon as an ithread is spawned, its exit triggers the unmapping of all shared objects, and the next time the parent thread tries to use one, kaboom!

I think to properly solve the "at interpreter exit" cleanup we'd really need to change dl_load_file() to internally track what it's been called for (tracking shared across threads), and (I think) how often, with ownership being upped on thread clone. At which point, interpreter exit is safe to call dl_unload_all_files(). But only really late, when the top level parent interpreter exits. And given that we've lived with this bug for over 13 years now, it doesn't seem to be a high priority to fix.

Something else that proved to have more "depth" than was initially obvious was a bug for threads::shared (RT #119089), where the test case demonstrated that shared references returned from subroutines could disappear on a return. The submitter's concise test case enabled Dave Mitchell to quickly determine that the reported problem had been fixed by a commit in the core C code back in January 2012, and Dave also suggested a work around to avoid the bug. Thus it seemed like case closed.

However, I had my suspicions about that change (6f48390ab209d16e), and whether it was just fixing the symptoms rather than the cause:

[perl #95548] Returned magical temps are not copied

return and leavesub, for speed, were not copying temp variables with a refcount of 1, which is fine as long as the fact that it was not copied is not observable.

With magical variables, that can be observed, so we have to forego the optimisation and copy the variable if it's magical.

This obviously applies only to rvalue subs.

What raised my concerns was that if assertions are enabled then the test case triggers an assertion failure on v5.14.0. If assertions are not enabled, it crashes. So I went looking back in history to find what introduced the assertion failure, it turns out that there's a heck of a lot more to this than first meets the eye.

So, what caused the assertions to start failing between v5.12 and v5.14? A simple bisect suggests that it's a commit to use cBOOL for some casts, which has nothing directly to do with threads shared, or (not?) copying values on subroutine return. It turns out after a bit of digging that it's this section of that commit matters:

deep in the routine that deals with get magic.

It's not obvious from the code above, but SvTEMP returns either 0 or 0x00080000, and in 2010, C
was typedef'd as C
, so the above code always set was_temp to 0, due to truncation. The cast was added to avoid a warning about the truncation. The (buggy) truncation itself was added when the type of was_temp was changed from a suitably large integer to a bool as part of this commit whose only description was "Adding const qualifiers" (ie the commit message doesn't even note that some variables' types were changed)

The variable in question is used to enable Perl_mg_get() to retain the SVs_TEMP bit set on values it is processing. This is needed because Perl_sv_2mortal() conflates "temporary" with "mortal" and automatically sets the SVs_TEMP flag bit on everything. Hence this code seeks to remove SVs_TEMP if Perl_sv_2mortal() set it. The upshot of the bug was that SVs_TEMP was always being removed, so code in subroutine entry that does not copy TEMPs is skipped because the value is becoming unmarked. Hence the values are always copied.

So, all that conceals what actually going on. When that truncation bug was first introduced, it didn't matter that TEMPs weren't copied. When that bug was fixed, it mattered a lot that TEMPs weren't copied. What changed in the middle?

Finding out that involved bisecting between the two commits, patching the bug described above. Porting/bisect.pl makes even this pretty easy, as it can automatically apply user supplied patches to the tree before building. The true trigger of the behaviour regression was a change in March 2010 to how magic is handled on elements from magical arrays and hashes, which fixed bugs most visible with pos. The key part of that change was 1 line which got rid of a copy of a value. Hence that change removes one place that copied the return value, and the change with the cast fix removes the other place, hence why both are needed to trigger the problem. And hence why the identified patch fixes it, by re-instating a copy.

So, I assumed that taking blead, and reverting the patch that fixes the test case, would cause the test case to fail again. This is easy to check, and so I verified this assumption.

The assumption is wrong. More bisecting revealed that subsequently this change also has an influence:

Magic flags harmonization.

because part of it removes the entire was_temp (micro?) optimisation logic mentioned above, meaning that values will always be copied.

(Remember, the bug became exposed because the value ended up not being copied. If it's copied once, or the twice because a copy is copied, then the bug is hidden. And it's this insensitivity to the number of copies that makes it such a crafty little critter)

So that explains the given test case. But the story is still not over.

Whilst temporary values should be copied (by default), LVALUEs returned from :lvalue subs should not. Knowing this, I could easily convert the test case to use :lvalue subroutines and make everything crash (v5.14.0, blead, all in between).

I believe that the problem is because threads::shared wrongly assumes that the proxy for a shared aggregate will always outlive any LVALUE proxies for its elements. This is generally true, but not for element proxies returned from :lvalue subroutines for aggregate proxies which go out of scope with the subroutine.

I believe that the fix is to change threads::shared to deal with cleanup differently, so that the last magic struct which frees up the relevant mg_obj is responsible for cleaning up shared space. This is a reasonably involved change on some complex code, and it's still pending waiting for a confident second opinion that I got it right and it's good to merge.

The full explanation is in the middle of RT #119089, and most easily viewed as https://rt.perl.org/rt3/Ticket/Attachment/1236123/643477/

Last month it was HP-UX doing something strange. This month AIX, as an innocent looking test had started failing on it, but not other Unix systems. Specifically, t/op/fork.t, which had not been modified, was now failing in two places, when previously it used to pass. This implies that some unrelated change had unfortunate side effects, but what? And why?

Fortunately this is just the sort of job that git bisect excels at, and now that most of it is automated it only takes small amount of setup effort to get the computer busy figuring out the answer. Which turned out to be a perfectly innocent looking commit to a harness function that t/op/fork.t was using which fixed a frustrating bug:

commit 684b0ecaad281760b04dd2da317ee0459cafebf6

Author: Tony Cook

Date: Tue Jul 16 14:57:20 2013 +1000

[perl #116190] feed an empty stdin to run_multiple_progs() programs

Two tests for -a were attempting to read stdin and blocking with the -a implies -n change.

So why does this cause a problem? And why only on AIX?

run_multiple_progs() runs its programs using backticks. (backticks turn out to be the only sufficiently portable way to easily capture output.) The code it calls already had a feature to feed things to STDIN, which for various reasons is done by using a pipeline from another perl process. So what that commit does is change the generated backtick command from

to

so that stdin is at EOF, and the thing we're testing can't end up hanging if it inadvertently reads STDIN.

It turns out that this causes fun on AIX with two tests, because on AIX /bin/sh is actually ksh, and ksh does pipes differently (with one fewer process). With sh, for the latter command line the sh process forks two children, which use exec to start the two perl processes. The parent shell process persists for the duration of the pipeline, and the second perl process starts with no children. With ksh (and zsh), the shell saves a process by forking a child for just the first perl process, and execing itself to start the second. This means that the second perl process starts with one child which it didn't create. This breaks the tests assume that wait (or waitpid) will only return information about processes started within the test. One can replicate this is on Linux:

Aside from who gets the child process, the behaviour is identical, because both structures end up with the exit status of the pipeline being that of the last command - in the ksh case because the last command is the parent, in the /bin/sh case because the sh process ensures that it calls exit() with the exit status of the child that it reaped.

Problem is that the tests in t/op/fork.t are sensitive to who gets the child process, because they are counting how many children get reaped. I thought about fixing the tests to make them immune to unexpected extra child processes, but then realised that it was probably easier to change the generated backtick command to be

given that all that mattered was that reads on STDIN avoid hanging, not that they return promptly return anything specific (ie eof instead of error).

I wasn't sure whether this approach would work on VMS, so I tested it there. To be sure that only things that I expected to change changed, I first ran the test suite unmodified. This revealed that t/lib/croak.t, lib/charnames.t and lib/warnings.t failed. These failures are relatively new, and all look like this:

ie the test runs a program, and whilst it is getting the output it expects, it no longer gets the exit status it expects. Craig Berry explained why:

The root cause of the problem on VMS is that command-line redirection is done by Perl and not by the shell. Tony's addition of the stdin parameter to runperl gives us the equivalent of:

The Perl process with an exit value of 0 is a child of the one that has an exit value of 2 so the final status we see in runperl is the exit value of the parent, not of the child. But the child is actually the test program whose exit value we're interested in and we don't get it this way.

Craig had already observed these failures resulting from Tony's (quite reasonable) change and had come up with a fix involving another pipe to capture the child's exit status, but because of the ugly complexity that entailed he'd been "sitting on it for a bit hoping to think of something better".

So it turned out that the solution of setting up STDIN as EOF actually got us two fixes for the price of one. Strictly, we haven't "solved" the incompatibility of the VMS pipe implementation when it comes to exit statuses, because instead we've avoided the tests relying on it. But as the VMS behaviour has been that way for over a decade now, it seems unlikely that it's going to change any time soon.

I need to thank Zefram for helping diagnose the problem.

Finally, did you know that ${MPEN} used to be treated as a synonym for ${OPEN}? No, and I didn't either. This was caused by a missing break; statement in Perl_gv_fetchpvn_flags(), in the check for variables starting with M. What should have happened is that if the name isn't MATCH it behaves like any other unused multi character control variable. What was going wrong was that it was falling through into the code for variables starting with ^O, so the check there for 'are the next 3 characters "PEN"' would pass, and hence the odd behaviour. Now fixed in blead.

And this came about as a side effect of a bug report that Carp::confess loses $! on Windows. Which it does, as it turns out, due to an unhelpful combination of Microsoft's atoi() function being buggy (it resets errno. That's naughty), and the regular expression match variables making a call to atoi() each time they are read. No, I didn't know that. That seems mighty inefficient, so I looked to fix it, hence I ended up also looking at the code for ${^MATCH}, and spotted the above problem.

As to $1 and friends...

Well, it transpires that since forever (perl 5.0 alpha 2) runtime lookup of the numbered match variables has been implemented by finding their name (as a string) and converting it to a number (the aforementioned call to atoi()) to get the array index to use. Which works, but sure doesn't feel like the most efficient way of doing it (even if it didn't also cause strange side effects due to buggy libraries). I think that it was done this way because it's the same approach as used by all the other punctuation variables - they all have magic type '\0' attached to them, and then the routines called for reads and writes on such variables uses switches based on the name to perform the correct action.

It happens that the "name" is stored as a pointer, length pair. It's also the case that nothing (else) relies on the name, and that the name is never NULL. All this is useful. We have an integer field (the "length") and no existing flag value of NULL. So it was a S.M.O.P. to change the representation such that for match variables the pointer is now NULL, and the length is used to store the array index, with the string to number conversion done once at compile time.

More pleasingly everything seemed to come together nicely. Other routines that handle magic (such as thread duping) already work smoothly if the length is non-zero even though the pointer is NULL, so nothing needed changing there. Not needing to store the name until runtime is a memory saving. Thanks to some refactoring Dave did for the regex engine, all the match variables are actually now implemented as integers - $& is 0, $' is -1, etc, which means that they too can be implemented this way, not just $1 upwards. Which means that the C code is 23 lines shorter as a result (although the object code is about the same size - you can't have it all).

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in http://perl5.git.perl.org/perl.git

RT #... is a bug in https://rt.perl.org/rt3/
CPAN #... is a bug in https://rt.cpan.org/Public/

ID YYYYMMDD.### is an bug number in the old bug system. The RT # is given afterwards. You can look up the old IDs at https://rt.perl.org/perlbug/

Hours

Activity

0.50

${^MPEN}

0.50

/[#$x]/

10.25

AIX t/op/fork.t

AIX t/op/fork.t (VMS runperl)

0.25

CPAN #88258, CPAN #88259, CPAN #88260

0.50

CPAN client and modules removed from core.

0.50

HP-UX 64bit

0.50

ID 20020301.005 (RT #8732)

0.50

LVALUE macros

9.50

NULL PL_curcop

0.50

PerlIO_sprintf, PerlIO_vsprintf

0.50

RT #107816

0.25

RT #116118

1.00

RT #116907

0.75

RT #118987

RT #118987 (SDBM_File's Makefile)

13.25

RT #119089

0.25

RT #119097

1.50

RT #119155

0.25

RT #119161

0.25

RT #119181

0.25

RT #119189

1.00

RT #119195

RT #119195, RT #119197

0.25

RT #119351

1.50

RT #119409

0.50

RT #119437

0.75

RT #119445

0.25

RT #119497

0.50

RT #119515

20.00

RT #2968

RT #2968 (dl_unload, etc)

RT #2968 (dl_unload, etc) ID 20000402.003

0.25

RT #55896

0.25

RT #92446

1.50

Test::PerlRun on VMS

0.25

VMS

0.50

\Q \E

9.00

atoi

atoi ($1, $2, $& etc)

atoi (RT #116118)

atoi (RT #81586)

0.25

bisect.pl

2.50

de-duping BSD Call Back Units

1.50

dl_dld.xs

0.50

global destruction

1.00

lib/perl5db.t

0.75

process, scalability, mentoring

22.50

reading/responding to list mail

2.50

runperl

2.00

t/re/pat_thr.t

0.25

threads::shared leaks

2.50

untangling PAUSE permissions

114.50 hours total

* credit Leon Timmermans for that analogy. It feels sadly apt :-(

Show more