Markdown and Pygments

This blog is mainly being written as Markdown text stored in a database, and I thought it would be nice to add the ability to use Pygments to add syntax highlighting to various bits of code within the entries.

There are some DjangoSnippets entries on how to do this, notably #360 which first runs text through Markdown to generate HTML and then BeautifulSoup to extract parts marked up in the original pre-Markdown text as <pre class="foo">...</pre> to be run through Pygments and then re-inserted back into the overall Markdown-generated HTML.

The problem with this is that the text within <pre>...</pre> needs to valid HTML with things like: e_mail='<foo@bar.edu>' escaped as e_mail='&lt;foo@bar.edu>', otherwise BeautifulSoup thinks in that example that you have a screwed up <foo> tag and tries to fix that up.

Making sure all the <, &, and other characters special to HTML are escaped within a large chunk of code misses out on the convenience of using Markdown. I decided to go with an arrangement in which regular Markdown code blocks are used, but if the first line begins with pygments:<lexer>, then that block is pygmentized.

So if I enter something like:

Here is some code

    pygments:python
    if a < b:
        print a

It ends up as:


Here is some code

if a < b:
    print a

What I came up with is this derivative of Snippet #360

from htmlentitydefs import name2codepoint
from HTMLParser import HTMLParser
from markdown import markdown
from BeautifulSoup import BeautifulSoup
from pygments.lexers import LEXERS, get_lexer_by_name
from pygments import highlight
from pygments.formatters import HtmlFormatter

# a tuple of known lexer names
_lexer_names = reduce(lambda a,b: a + b[2], LEXERS.itervalues(), ())

# default formatter
_formatter = HtmlFormatter(cssclass='source')    

class _MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.text = []
    def handle_data(self, data):
        self.text.append(data)
    def handle_entityref(self, name):
        self.text.append(unichr(name2codepoint[name]))

def _replace_html_entities(s):
    """
    Replace HTML entities in a string
    with their unicode equivalents.  For
    example, '&amp;' is replaced with just '&'

    """
    mp = _MyParser()
    mp.feed(s)
    mp.close()
    return u''.join(mp.text)  

def markdown_pygment(txt):
    """
    Convert Markdown text to Pygmentized HTML

    """
    html = markdown(txt)
    soup = BeautifulSoup(html)
    dirty = False
    for tag in soup.findAll('pre'):
        if tag.code:
            txt = tag.code.renderContents()
            if txt.startswith('pygments:'):
                lexer_name, txt = txt.split('\n', 1)
                lexer_name = lexer_name.split(':')[1]
                txt = _replace_html_entities(txt)
                if lexer_name in _lexer_names:
                    lexer = get_lexer_by_name(lexer_name, stripnl=True, encoding='UTF-8')
                    tag.replaceWith(highlight(txt, lexer, _formatter))
                    dirty = True
    if dirty:
        html = unicode(soup)

    return html