Saturday, March 13, 2010

Comparing the Ruby/PHP/Python C Interpreters

The other day I went poking around the Ruby and PHP interpreters (the current stable versions). I hadn't looked inside PHP since the 4.x series and Ruby I had never checked out. Like CPython the internals of both PHP and Ruby look something like their resulting language, but in C. For each interpreter I just compiled it and looked at how core types and extension types were implemented.

Ruby 1.9.1


Ruby-the-language has lots of syntax and its core types are just as extensible at run time as classes written in ruby (you can monkey patch core types). The compile was clean and runs with -Wall, generating just a couple warnings. All the unit tests passed. The grammar is implemented with lex/yacc and the resulting parse.c file took 10 minutes to compile on my 1.5GHz machine. Did I mention the grammar is big?

There is no difference between ruby core types and extension types written in C. That is mostly true in python but ruby goes all the way. The C-struct that holds information about the ruby type has a hash map that contains all the type's methods - and I mean all of them. Here is the interface for adding a __add__ method (cBignum is the core integer type)

rb_define_method(rb_cBignum, "+", rb_big_plus, 1);

The "+" is the not-so-magic name for the addition operator. The type's hash uses "+" as a key that points to the value of the addition function. That is a beautiful interface compared to CPython, where you have to put the __add__ method in the right place in a struct[1]. As an optimization the "hash" is actually a list if the number of methods is small; method strings are interned and assigned a number - I'm not sure why this is faster than just keeping the hashkey on the string and always using a dict, but I assume someone benchmarked it.

PHP 5.2.13


[NB, I should have looked at the 5.3.x release but the 5.2.13 release was at the top of the homepage when I went looking]

I hadn't looked at PHP since the 4.x series (see my why I started using Python post). PHP has added some nice features since then, like namespaces, but the interpreter looks much the same. The compile uses a custom wrapper around gcc and is very spammy: a dozen -I include directories on each line for hundreds of C files. It does not use -Wall by default so if you want really really spammy turn that on. After compiling PHP I ran the unit tests and 7 failed[2]. All 7 had to do with bad conversions between signed and unsigned numbers (a negative signed int is a positive unsigned int). This is a production release so those failures are not confidence inspiring.

Like PHP-the-language the C interpreter makes a big distinction between core types and extension types. The core types are int, string, and list/hash (a hybrid). The C-struct is a union that has is either an integer, string, list/hash, or "resource" (everything else). Extension types can't do operator overloading so the interpreter has if/else clauses for handling the core types. Methods are added by registering them by resource number in a global registry.

Objects get passed around in the core as pointers to pointers, and sometimes as pointers to pointers to pointers. I'm not sure why, but this can't be good for speed.

Python 2.5+ 3.x



I'll lump all releases of Python after 2.5 together because the internals are very similar. The AST (abstract syntax tree) that the byte compiler uses was rewritten and simplified for the 2.5 release and there haven't been any big changes to the internals since then. The 3.x releases made some big simplifications, but they still use the same framework.

Like Ruby, Python compiles cleanly and uses -Wall, generating few warnings. The test suite passes. Python doesn't make a distinction between core types and extension types: if you copied Objects/dictobject.c and renamed it "mydict" [insert dict joke here] you could ship it as a module and "import mydict". The only difference is that the byte compiler knows that when you type "d = {}" you mean "d = dict()".

The C-struct for python types is a bit more complicated than the ruby one. It has specific slots for all the magic methods like __add__ instead of keeping them in a hash map like it does for pure-python classes. Like PHP the execution loop does have some if/elses for core types like integer, but unlike PHP this is just a speed hack and not a requirement (I assume Ruby does something similar).

Conclusion



So there you have it. All three interpreters look much like their parent language once you get under the hood. I'd mention the perl interpreter too but it's been years since I dove into that one; but guess what? It looks like perl.

[1] python-dev has several threads about adding a similar simple interface. Someone just has to do the work (at PyCon Hastings said he's exploring it).
[2] I downloaded PHP 5.3.2 and the 7 test failures I saw are fixed, but I get 9 new and different failures.

PS, blogger hates H4 tags. Why the extra newline?

Friday, March 5, 2010

Some Odd Observations as a PyCon Speaker

1) Your answer to the first question after your talk will be simple, neat, and wrong.
2) That question will have been asked by Larry Hastings.

Extrapolation from my experience might fail in your particular circumstances because Larry isn't omnipresent. It might fail for my talks too: the pycon video archives only go back to last year.

NB "wrong" in the sense of "less than optimally correct." I included a fuller answer on my published slides both years. Which no one will see.

Tuesday, March 2, 2010

PyCon Wrapup II: Python Stuff

[People stuff is trees-and-forest, so here is a post on what was done about Python at PyCon]

The talks were good and the 5x (as opposed to 4x) tracks didn't seem to hurt. Worst case: any talk you missed you can watch on pycon.blip.tv. Speakers were aware that their talks would be recorded so using laser pointers (instead of highlights or spoken words) is going away. This is a sideways move - laser pointers were useful right up until they weren't.

The Language Summit was far more boring than last year. Python 3.x issues are mostly settled from the core-dev standpoint so the big issues were disutils (how, and at what level should python packages care about packaging) and alternate implementations. Unladen Swallow was the talk of the town not because they are the first alternate implementation of Python but because they are the first implementation that plans to ship with benefits and no tradeoffs. Did I say no tradeoffs? It was unanimus that both disutils2 and Unladen Swallow would be integrated once the tradeoffs were wholly positive. Who can't get behind that?

PyCon sprints were smaller for python-dev this year but I couldn't tell if that was true for other major groups like Twisted and Django (the rooms were more broken up this year). I can say for certain the python-dev sprint had both fewer regulars and fewer pure-newbies; One local EE postgrad wanted to help out and simultaneously tried to make me care that his badge and his name didn't match; He went by "Cedric" but his badge said something else (I believe he was Caribbean). I have a long standing amusement with first names: my birth certificate doesn't say "Jack," Titus Brown's doesn't start with "Titus," and Alex Martelli's doesn't say "Alex." For that matter Guido's may say "Guido" but he doesn't care how you pronounce it. In fact all the groups I'm a participant and care about most don't care who you are legally, and don't ask for ID at the door (to riff on my last post, caring about legal ID is a "negative trust" cue).

The PSF (Python Software Foundation) exists to serve two different classes of users: end-users and People-who-hire-end-users. To put it differently the PSF is a single purpose organization that wants more users both from the bottom-up and top-down. The bottom-up stuff has been easier to organize in the form of "your-locality-here" Cons. While the PSF wants to help people to do more of that they also want to aid the corporate users who have an interest in Python. Getting companies to spend money and organize sprints has happened quite successfully before, but very irregularly (see the Need for Speed sprint). Quite happily I can say that if half the events that were spit-balled at PyCon come to be then sprints will be even more prolific in the near future (both bottom-up and top-down) and they will be just as free but even more topic specific (2to3 porting, hardcore dev stuff). At least four groups have intentions to do an event in Boston, for instance.

The CPython bug tracker has 2000 outstanding issues (a mix of bug reports/feature requests/doc requests). A new status field named "languishing" was added because there are a lot of bugs that have +1/-1 comments by core-devs but no resolution; it is a classic "middle school dance"* deadlock where no-one feels they have authority and is just waiting for someone else to pronounce. AMK and I closed about 20 of these during sprints (some applied, some rejected) but there are still a ton of these bugs outstanding. They just need a champion (for or against) to get resolution. Alex Gaynor recently did a post about who-gets-what-commit-rights on various python projects (not including python-dev). Python-dev can be disfunctional because there are 120 committers and everyone assumes there is someone else who knows better for any particular bug (the "middle school dance").

Steve Holden and Michael Foord both have the audio bug: imagine what camera crazy people spend on digital SLR gear but apply that to audio eq. Between the two they captured tens of hours of audio at PyCon and some of that should start showing up soon. Editing is the hard part in making raw into general interest so maybe four or five hours of that will appear for general publication. The blackmail snippets are easy to produce; if any exists you've already received it and the adjoining demands (Foord sensibly priced his at slightly less than a trans-Atlantic plane ticket; Holden has yet to publish a price list).

* "middle school dance" is comp-sci jargon for a deadlock where party A is waiting for party B to do something and the reverse. The allusion is to boys standing on one side of the gym waiting for the girls to ask them to dance and the girls standing on the other side, etc.