Custom syntax highlighting with Jekyll and Pygments

19th June 2014

Jekyll is a great tool for statically generating websites. Among its many neat features is a convenient template syntax for performing automatic syntax highlighting of code in your pages and posts:

<p>Check out this great code I wrote:</p>
{% highlight cpp %}
int void(int argc, const char* argv[]) {
	return 0;
}
{% endhighlight %}

Jekyll makes use of Pygments, a Python library for performing syntax highlighting. Pygments takes some code and marks it up according to its syntax. It tends to give only a rough approximation of the true syntax, and it doesn't necessarily account for every feature of a language. This means that if you're as meticulous as I am, you may not be completely happy with the results.

To customize Jekyll's syntax highlighting, you need to install a modified build of Pygments. You'll probably need to know a little Python to do this, but you may be able to get by with just copying existing code. The documentation for Python's regular expressions module may come in useful.

pygments.rb

Since Jekyll is a Ruby application, it actually uses pygments.rb, a Ruby wrapper for Pygments. In fact, pygments.rb contains a copy of the Pygments source within its own repository, so if you want to modify the syntax highlighting behavior, you need only modify pygments.rb.

Grab yourself a copy of the pygments.rb source. If you want to keep a copy of your changes on GitHub, I recommend forking tmm1/pygments.rb on GitHub and then cloning your fork. If not, you can just clone the pygments.rb repository:

$ git clone pygments.rb
$ cd pygments.rb

The Pygments source is stored in this repository in vendor/pygments-main/. The lexers, which you'll want to modify to customize your syntax highlighting, are in vendor/pygments-main/pygments/lexers/. For example, if you want to modify the C++ lexer, you'll find it as the class CppLexer in compiled.py in that directory.

How Pygments' lexers work

The Pygments developers have a helpful tutorial on writing lexers. In brief, most lexers inherit from RegexLexer, which allows you to specify the syntax highlighting rules with regular expressions. Each lexer has a few properties that give the lexer's name, aliases that can be used to identify the lexer (such as in the {% highlight %} tag), and associated file extensions and mimetypes.

The most important property of a lexer is its tokens. These define the rules for syntax highlighting. The tokens property maps states to a list of rules that can be applied while in that state. When it begins parsing your code, it starts in the 'root' state. Each rule is a triple including a regular expression to match, token names to apply to that match, and a new state to transition to. The new state can be omitted to stay in the same state, or '#pop' can be used to return to the previous state.

For example, take a look at the 'root' state rules from the PHP lexer:

        'root': [
            (r'<\?(php)?', Comment.Preproc, 'php'),
            (r'[^<]+', Other),
            (r'<', Other)
        ],

In the PHP lexer's root state, if it finds an opening PHP tag (<?php or <?), it marks this code as a Comment.Preproc token and moves into the 'php' state. After this, only rules specified under the 'php' state will be applied.

The tokens that you can use are listed in vendor/pygments-main/pygments/token.py. Each token has a corresponding short name, which is used as the CSS class name for that token. For example, the opening PHP tag will be given the cp class name corresponding to the Comment.Preproc token. If no existing token is suitable for a lexer you're writing, you can simply add one to this list.

Pygments will simply apply the appropriate rules, consuming the text that matches the regular expressions at each step, until the entire input has been processed. The result is some marked up code, with <span> elements wrapping each token with corresponding class values. Customizing the existing lexers is usually as simple as adding to or modifying the existing rules.

Adding a lexer

If you want to highlight a language that is not supported by Pygments, you can simply add your own lexer. It is easiest to use an existing lexer as a starting point. Just inherit from RegexLexer, add the name, aliases, filenames, and mimetypes properties as appropriate, and then write your own tokens rules.

You can add your lexer to one of the existing files in vendor/pygments-main/pygments/lexers or add your own file. Be sure to add the name of your lexer class to the file's __all__ variable. Once you've done this, your lexer needs to be registered with the library. This registry is stored in _mapping.py, which can be generated by running make mapfiles from within vendor/pygments-main/:

pygments.rb$ (cd vendor/pygments-main && make mapfiles)

Installing your custom pygments.rb

For Jekyll to use your customised pygments.rb, you need to build and install it. You can do that easily with the following commands:

pygments.rb$ gem build pygments.rb.gemspec
pygments.rb$ sudo gem install pygments.rb-VERSION.gem

Replace VERSION with the version of pygments.rb that you're building. Just look for the .gem file.