Visualization of Ruby's Grammar

Posted by Nick Sieger Fri, 27 Oct 2006 16:48:00 GMT

As part of the momentum surrounding the Ruby implementer’s summit, I have decided to take on a pet project to understand Ruby’s grammar better, with the goal of contributing to an implementation-independent specification of the grammar. Matz mentioned during his keynote how parse.y was one of the uglier parts of Ruby, but just how ugly?

Well, judge for yourself. Below is a grammar dependency graph generated using ANTLRWorks and GraphViz. The steps I took are as follows. I took parse.y, stripped all C definitions, code and actions from it to give a bare YACC definition. Next, I did the equivalent of gsub(/[kt]([A-Z]+)/, '1') (since ANTLR’s convention is to have lexer tokens named starting with a capital letter). I then used the Bison-to-ANTLR converter to generate an ANTLR 2.x grammar, which I hand-modified to produce a v3 grammar. Opening the resulting grammar in ANTLRWorks allows you to generate a DOT file from which GraphViz can then generate a jpeg image. I’ve also included visualizations of the Java 1.5 and Javascript (ECMAScript) grammars for comparison.

I haven’t even begun to absorb all the meanings from this picture, but one stark difference between Ruby and the other two is the node in the middle of the picture with a high concentration of outgoing edges. That node is called primary in the grammar definition, and it is probably one of the reasons that Ruby syntax is so flexible and forgiving. A primary node’s direct children apparently represent a large portion of the syntax, and explain why in Ruby a single statement can either be a literal, a method invocation (or series of them), a standalone expression (such as a < b), all the way up to larger syntactic groupings such as if ... else ... end and begin ... rescue ... end, among many others.

Ruby

Java 1.5

Generated from Java 1.5 grammar on antlr.org

Javascript

Generated from ECMAScript grammar on antlr.org

Posted in ruby | Tags ruby | 34 comments | no trackbacks

Comments

Leave a response

nuke edit
evan said 34 minutes later:

This is really fascinating. Like you, I’m not sure what it means yet, though. Ruby seems like more of a general snarl compared to Java (3 or 4 smaller snarls) or ECMAScript (with its cascades of link-poor nodes).
nuke edit
Olle Jonsson said about 2 hours later:

Fellow grammar tourist here. Awesome snapshots from your language safari. Javascript comes out looking really ordered, here. Thanks for taking the time to make these images.
nuke edit
Nick said about 2 hours later:

“Snarly” is a good word to describe Ruby’s grammar. Now we just have to untangle it.
nuke edit
Pat Eyler said about 4 hours later:

Great stuff Nick! I can’t wait to see your next steps!
nuke edit
Adam said about 6 hours later:

Cool. Just wondering, do you have similar diagrams for C and/or Python? Or know where some might be found?

(Or, shudder, perl)
nuke edit
Michael Trier said about 6 hours later:

Nice work. Very interesting stuff indeed.
nuke edit
Mike O'Dell said about 9 hours later:

“Now we just have to untangle it.”

What makes people think, a priori, the structure implies there is a problem of some kind?

Just because the graph is complex doesn’t mean anything bad in and of itself. The generality of the language syntax, where a construction can be used in many different contexts, naturally gives rise to a production graph that has lots and lots of links. This is not inherently bad!

There may be other problems that really need cleaning up, like ambiguities which YACC sometimes “resolves” in ways which have surprising implications, but the high degree of non-planarity is not a concrete indictment.

“Just because you fixed it doesn’t mean it was broken to start with.”

very nice analytical work all told.
```
  cheers,
  -mo
```
nuke edit
Nick said about 11 hours later:

@Adam: check out this and this. I can’t even imagine what a Perl grammar would look like :)

@Mike: I never said Ruby’s grammar was broken. The “untangle” comment for me really was aimed more at delving into intricacies of the grammar to gain understanding. Other than that, I agree with you.
nuke edit
malcontent said about 18 hours later:

These are wonderful. Every fan of a language should make one for theirs and we should gather them someplace.

I am amazed at how much cleaner the python tree is considering it has most of the functionality of ruby.
nuke edit
Giles Morant said about 18 hours later:

The graphs for C and Python are very different to the ECMAscript, Ruby and Java.

To be honest though, I’m not sure what I’m looking at.

Is the ideal to be lots of many-linked items? Fewer items? Long lines of items?
nuke edit
phil said 1 day later:

“Bison-to-ANTLR converter” - had no idea that such a beast existed. Seems that there was a project to convert Ruby’s yacc parser to ANTLR, does this mean that it’s trivial to convert the Ruby parser over to ANTLR?

The ultimate goal of many of these projects is to be able to parse Ruby in Ruby - really looking forward to that.
nuke edit
Bruce Perens said 1 day later:

The centrality of the primary node in the parser indicate that it is complex and multivalent. I think the main implication of that is that the lexical analyzer does a lot of work on that node that does not appear in the parser.

Language gurus used to prefer languages that parsed simply - for example ones that were pure LR(1). It surely makes it easier to explain how the language is parsed. Ruby’s 8000-line parser would offend language purists, if there were any left. But over time we seem to have become more accepting of this degree of complexity, and nobody seems to care greatly about explaining how Ruby is parsed as long as it works.

Then again, Ruby doesn’t seem to have confusion-producing elements like C’s declarations. Certainly the parser doesn’t get in the way of the programmer.

Bruce
nuke edit
quag said 1 day later:

This is a little bit of a joke. I think the equivalent graph for Io is http://www.quag.geek.nz/io/message.png . I imagine that lisp, scheme and a few other languages would have similar graphs.
nuke edit
Jules said 1 day later:

Io has more syntax than that (literals for example), and so has Lisp.

Forth’s syntax is the simplest: words. Literals are handled from inside the language.
nuke edit
AC said 1 day later:

Really FOOLISH. Know and think about “pseudo-simplicity”
nuke edit
Michael Shigorin said 1 day later:

Thanks, rather interesting!
nuke edit
Robinganemccalla@gmail.com said 1 day later:

Is there any way I can get a higher resolution version of the image? It looks nice but I can’t actually see the text.
nuke edit
JD said 1 day later:

Interesting stuff. Being a comp.lang hobbyist, I’m curious at seeing the EBNF, C++, C#, and myriad of other language grammars that exist graphed out this way. (Yes I mean EBNF as language grammar itself)... Also I agree with the previous poster who indicated that “just because it’s complex doesn’t mean it’s inherently bad” (paraphrasing)

later
- JD
nuke edit
JD said 1 day later:

Oh yeah and chiming in with robin... making these already made graphs available in some format like SVG, PDF ...etc -- where zooming in to get a clear read on the textis possible -- would be a great addition to already great work...

-- JD
nuke edit
Ed Borasky said 2 days later:

That is interesting ... it seems like the existence of an IDE for Antlr grammars would be a good reason to start using Antlr as the tool set of choice for Ruby implementations ... anyone have an contrasting opinion?

You might want to post this to the ruby-core and YARV mailing lists.
nuke edit
Bil Kleb said 2 days later:

JD and Robin: follow the image links to Flickr, and you’ll find hi resolution originals. --Bil
nuke edit
Nick said 2 days later:

Rather than spend more time generating however many formats, I’ve uploaded the .dot files, so you can use GraphViz to generate whatever format you’d like.

ruby.dot c.dot java.dot ecmascript.dot python.dot

As Bil said, the originals are big enough to read the text, but possibly not hi-res enough for printing, if you want to do such a thing. Generating a pdf or svg from the dot file should serve you well in that respect.
nuke edit
rue said 2 days later:

Ed, take a peek at http://xruby.com/default.aspx. He supposedly has a full Ruby grammar written for ANTLR.
nuke edit
Robert Feldt said 2 days later:

Very nice visualisations!

One thing that should be noted about Ruby’s grammar is that the lexer is (mildly) context-dependent. So the full complexity of the grammar(s) is not captured in these diagrams. Also this is why the Ruby grammar is not LALR(1)/LL(1) or some other of the simple grammar classes.
nuke edit
tdphuc@centrum.cz said 4 days later:

download
nuke edit
jer said 27 days later:

Jules wrote: Io has more syntax than that (literals for example), and so has Lisp.

Literals in Io are messages just like any other, they’re not syntactically special. The simple fact that they have quotes for string literals or whatnot is just sugar and not strictly required.
nuke edit
Harald Korneliussen said 27 days later:

I wonder what the Ada 2005 syntax tree would look like... or one of the newer Fortrans, or C++. Interesting that Ruby’s grammar isn’t LALR(1)/LL(1). That confirmed my suspicion that modern languages aren’t always. I think the first language I came across that made me wonder “Won’t this require a rather more complex parser?” was SML. But for all I know I was wrong there.
nuke edit
Jules said 27 days later:

jer wrote: The simple fact that they have quotes for string literals or whatnot is just sugar and not strictly required.

Doh, but it’s in the language so it counts as extra syntax. You cannot implement string literals from inside the Io language (not 100% sure though).
nuke edit
eric said 27 days later:

NICK, could you also do ada 2005, D, and possibly a C++ to compare with your c??:) thanks man, hope to see D and Ada 2005 especially:D!!
nuke edit
wert said 29 days later:

“But over time we seem to have become more accepting of this degree of complexity,”

No it is just that knowledgeable people are not heard anymore in all the noise that the fan-boys make. Take for example ruby it is a total piece of shit but only because there is a lot of noise about it do people think it is good.
nuke edit
occam said 29 days later:

With the Ruby graph, it seems to me that the congested primary means that there’s less context to remember about using various items in the language. So, more of the language is a “first class citizen”. In other words, you can do almost anything from anywhere in Ruby which is probably why it seems so elegant and powerful (and concise).

Javascript seems very ordered graphically, but also very obstructionist since there are some very long paths to get to certain items. Lots of order, but lots of red tape too.

Java seems like somewhere in the middle.

That’s my first intuition from these graphs.
nuke edit
Joe said 32 days later:

And Tcl?
nuke edit
Brendan Eich said 56 days later:

occam: long paths simply mean precedence hierarchy, not “red tape” that obstructs programmers. Expressions using the minimal parenthesization work as in C (and Java, and ...).

Another wrinkle that does not make red tape for users: the “noIn” parameterization of productions used by ECMA-262, to forbid “in” expressions in the head of “for” loops. This forks the expression grammar, and uglies up the graph considerably, but again it is no hardship for users.

I’ve noted in keynotes about JavaScript this year that I was under orders from Netscape management to “make it look like Java” in 1995. If I had my druthers, it would have looked more like Self, Logo, Smalltalk, or even HyperTalk. In that elseworld, the graph for JS’s grammer would be simpler, for sure.

Many languages claim C as an ancestor; lots of hackers know (most of) C’s syntax (and gcc nags us to overparenthesize when we mix bitwise and equality ops, since the precedence is misordered in that part of the grammar).

JS is one such C-derived language. Nothing much to do about it at this point. It has its pluses and minuses.

/be
nuke edit
Anonymous Coward said 94 days later:

Yes, what Brendan said. The biggest difference between these graphs is that some of them, like ANSI C, have straight lines of nodes, and some, like JavaScript, have loopy-looking “chains” involving about twice as many nodes. However, this is totally an artifact of the particular grammar used by the artist, and it doesn’t have anything to do with the languages C and JavaScript themselves. (That is, you could easily rewrite the C grammar to produce loopy chains, or the JavaScript grammar to produce straight lines.)

The culprits are JavaScript’s foo, fooNoIn, fooTail, and fooNoTail productions, where C just has foo. Rewriting the grammar would get rid of those productions, but it would add complications to the yacc file that don’t show up in the diagram.

In short, the diagrams may look pretty to some people (not me), but they don’t say anything about the underlying language standards, and they don’t even give much useful information about the artist’s particular grammar files.