Tuesday, May 15, 2007

The Europa Train - Blessing or Curse?

A bit of a controversial topic today...

I've always had mixed feelings about the simultaneous releases of Eclipse. It's definitely great to be able to go to one place and download a family pack of nuggets of Eclipse goodness (mmm.... nuggets...), and know that they all work together. Each year the release makes a big splash in the media too, which is great for Eclipse and everyone involved. That kind of publicity will likely make Eclipse.org the victim of its own self generated Denial Of Service attack as the download servers are brought to their knees by the throngs of downloads. That is definitely the kind of popularity and success that we all want.

One thing I've never liked about it though is all the peer pressure to make sure the train is on time. Don't get me wrong, I love having the train on time just as much as everyone else. It's nice to be able to know when the train is going to arrive so that you can plan your ride. Your boss is not going to be happy if you tell him you have absolutely no clue when you will be in to work because you don't know when the train is coming. But, if you knew that parts of the train were broken, wheels were missing, etc., wouldn't you want the mechanics to take the time to fix it, rather than it breaking down mid trip, forcing you to whip out your multi-tool and a pack of gum and MacGuyvering a solution? (BTW, I hereby declare MacGuyvering to be a word.)

My point is, the train sometime feels less like a train and more like a juggernaut. Last year for Callisto, we shipped a CDT 3.1.0 that had a lot of problems with it. CDT 3.1 was a big release that had a lot of new functionality, and it's inevitable that when you deliver a lot of new content there are going to be lots of problems, through absolutely no fault of the committers and contributors. Even good code inevitably has a certain amount of bug, and so it follows to at least a certain extent that the more new code you put in, the more bugs there will be. As some for-instances, there were a lot of problems with the indexer, and search was pretty much broken at the time. Yet, we shipped anyway, because the train had to be on time. It's often said to ISVs that consume CDT that they shouldn't take the dot-zero because it will be buggy, instead take the dot-one. I hate to say it, but really IMHO CDT 3.1.0 should have been baked for a while longer.

I don't think this is the greatest situation. Individual projects ought to be able to hold up the train if they need to. However, along the lines of a recent post by Doug Schaefer, I think Eclipse gets itself sometimes into a situation where it's a victim of its own hype. The date for the release is picked a year in advance. We spend so much time in that year hyping the release that by the time we start getting into the bug fixing cycle, there is already so much pressure to release on time that we couldn't hold things up if we wanted to. We shouldn't be releasing things that we know for a fact that people shouldn't be using. Speaking in practical terms, a dot-zero release is never going to be flawless, but if you're shipping with major functionality broken, or with crippling bugs that preclude widespread adoption, then I think the purpose for the release has been somehow lost sight of. A release that can't really be used is somewhat pointless.

Now, I'm sure someone is going to reply to this and say something along the lines of "the release is at the whims of the committers", and that we really have the power to hold things up. Technically it may be true by the letter of the process, but if you really believe that I suggest you try it and see what happens. Short of the Platform or JDT being horribly busted, I'm pretty sure you will get voted down.

I think that what needs to happen is that the release needs to be bug count driven. This process is not flawless either, but the idea is that you do what we do on the CDT milestones, and we don't do the build until all the bugs targeted for that milestone are fixed. When we reach Zarro Boogs, then it's Go Time. Sure, nefarious people can play games still by spuriously marking bugs RESOLVED - INVALID, or by playing games with severeties and target milestones or what have you, but I think that on the whole the idea works. This way, the date of the release is driven from the bottom up by the committers, and is not imposed on them from above for them to deal with after the fact. Sure, you still need to give a rough estimate to people as to when they can expect something (e.g. "Summer 2008"), and the committers shouldn't be given license to delay as long as they please without reason, but at least then there is some flexibility built into the plan.

Don't misread what I am saying here either. CDT 4.0, which is coming out on the Europa train, is shaping up to be both the most feature rich and yet most robust version of CDT yet. I'm not currently anticipating a recurrance of what happened on 3.1.0, and I would definitely recommend to users of previous versions of CDT that they move up to 4.0 and take advantage of the scores of bug fixes and new features it includes. And, I also think having a release train is on the whole a good thing. But, I think there are some things we can do so make sure everyone gets a say in how the train operates.

I'm curious to see how next year's release will unfold...

Monday, May 7, 2007

Language Extensibility In CDT 4.0

This screenshot is really cool. But why?


We'll get back to the screenshot I promise.

Well, CDT 4.0 RC0 just went out the door last week, marking our first feature complete build for Europa. Confusingly enough our next build this week is going to be marked as M7, so we have the odd situation where we have a milestone build after our first RC, but the team felt it was important to keep the naming convention constistent with the Europa build of which we will be a part of, and we didn't want users getting confused about which build of CDT to use with Europa M7.

There are a few cool new features that my team here at IBM have been working on that a number of ISVs and other language tools authors are going to hopefully find useful. It's always been fairly easy to add support to CDT for compiling different languages via CDT's Managed Build and Standard Make projects, but we've been working recently to make it easier to integrate new C-like languages into the Core so all those cool features like search, open declaration, and content assist all work.

For a while now, it's been possible to contribute definitions for new languages into the CDT core. Circa CDT 3.1, we added an extension point to CDT to allow you to contribute new languages via the ILanguage interface, and to map those ILanguages to an Eclipse content type. Each ILanguage has methods it must provide that let you parse a file and get an Abstract Syntax Tree (AST) out of it as result. Once you have an AST all those cool features I mentioned eariler start working, provided you use CDT's DOM AST APIs.

This worked great for clients such as the Photran project (who do the Fortran language IDE integration for Eclipse), but it was a bit problematic if you actually wanted to override what the language was for C and C++ files. CDT would look for extensions to the extension point, but would stop looking once it found the first one for any given content type (I'm simplifying things but this is how it would appear from the user's point of view). Hence, there was no deterministic way to make sure that your language was the one that got used.

In CDT 4.0 we've now added the concept of language mappings to the workbench preferences and the project properties. What this means is that the user can go in and change which language is mapped to which content type, even down to the level of the individual file.

The language mapping feature is great for those that compile the same project on multiple platforms with different compilers that all support slightly different variations of C or C++. Now if you have a build configuration for each platform you can set the language mappings on each configuration individually, and your code will be parsed properly in each configuration (provided that you have an ILanguage to handle those scenarios). It's also great for embedded vendors, as most of them tend to have slightly different variations on the C programming language to enable you to do some cool things like handle interrupts, etc. This way they can define their own ILanguage which can handle these differences.

Another big thing we've been working on is making it easier to create the parsers for those language variants. The most frequently encountered use case for this stuff are the use cases belonging to people like those embedded vendors I mentioned. For the most part the language they are implementing is nearly identical to C or C++, and they just need to add a couple of keywords or a few new constructs. Up until now they've pretty much had to write a whole new parser for that from the ground up. The GNU C and C++ parsers that are bundled with CDT are lean, mean, hardcoded state machines, and they are pretty difficult to get your head around if you are brave enough to crack open the code; difficult enough that most people that want to integrate a new language variant into CDT pretty much gave up right there. Don't get me wrong, those parsers are great at what they do (and without them we'd have been parserless for years), but they were designed with peformance in mind, and not readability or maintainability. If you tried to extend from the concrete classes in order to modify the behaviour of the parser you'd end up overridding big gnarly functions that do most of the work, and so if we ever fixed a bug in the original parser it probably wouldn't trickle down to your code unless you looked for it and cut & paste it into the parser you created.

Enter the new parsers that my team has been working on. One of our core requirements from our internal customers here at IBM was support for new language variants. Since we knew we were going to have several variants to support over the next few years, it seemed like a worthwhile investment to create some kind of extensible parser framework. To keep things "simple" we started with C. The goal was to create a basic C parser based on the ISO C99 standard, and to make it reuseable to support other language variants. In theory then language implementers would get C parsing for "free" and could concentrate on just defining the delta of their language compared to the base language.

It seemed natural for us to to use a parser generator for this. Parser generators take as input a grammar which specifies the rules of a language, and from that grammar it generates a parser that can handle that language. Just having a grammar will let you recognize whether a set of input abides by the rules of the language, but generally you want to do more than that. Typically as well you would define semantic actions in your grammar that do interesting things, which in our case was build up an AST with CDT's DOM AST APIs, so that once the language was parsed all those cool tools I mentioned earlier could recognize the structure of the code and do Cool Things(TM) with it.

So, what we did was create a C99 parser using the LPG parser generator, which has semantic actions in it to build up an AST for CDT. LPG is a parser generator built by some folks at IBM Research, which is being used for the parser in Eclipse's JDT, as well as for the SAFARI IDE Generator. The cool thing about LPG is that it has a notion of language inheritance. What this means is that if you take our C99 grammar file, you can do the equivalent of a #include in your own grammar for your own language to pull in our grammar. You can then add new rules or overrride our rules as you see fit, i.e. you get C parsing "for free".

The results of this were pretty amazing. One of our requirements which we got from the Eclipse Parallel Tools Platform people was to support a new programming language coming out, Unified Parallel C, which is a variant of C for massively parallel applications. The language adds new keywords and constructs which allow you to control the parallelization and synchronization of your program. By including the C99 grammar in our UPC grammar, we were able to get UPC working in a matter of days. Time to go back to our screenshot of the CDT editor, with a UPC file open:

There's a whole lot of cool stuff going on there:
  • Syntax highlighting of new keywords (upc_forall)
  • Outline view works
  • Content assist is finding constructs in the code
  • Content assist is working on constructs that are not normally legal C!!! It's a subtle point, but take a look at where the caret is in the upc_forall statement. This construct takes four expressions, not the usual three that your plain old ordinary for loop takes. Yet, content assist in that fourth expression just plain works!
Doing all this with the old parser would have taken a long time and been very error prone. I would definitely say that thusfar this effort has been a resounding success.

After CDT 4.0 is out the door we're going to start looking at doing some more interesting things with parsers.

  • Firstly, we want to write a GNU C language variant on top of our C99 parser and see how that stacks up against the existing GNU parser in CDT in terms of correctness and performance. We're already re-using all of the parser JUnits on our parser, so I already have a warm fuzzy about correctness. If the speed is good enough then I would love to replace the old parser with one based on ours because then it will be a lot easier to maintain.
  • Secondly, we're going to start tackling C++. Parsing C++ properly is a very difficult problem, given all the ambiguities in the language itself. I know personally of teams of people using bottom-up parsing techniques to parse C++ so I know it can be done (LPG is bottom-up too), but we have to figure out how feasible this is to do with LPG. Luckily we have a good line of communication with the LPG authors, and they are keen to see LPG being used successfully on C++, so if we encounter any roadblocks hopefully we can work together to smash them down.
The future for language support in CDT is looking very bright :-)