This blog is mainly being written as Markdown text stored in a database, and I thought it would be nice to add the ability to use Pygments to add syntax highlighting to various bits of code within the entries.
There are some DjangoSnippets entries on how to do this, notably #360 which first runs text through Markdown to generate HTML and then BeautifulSoup to extract parts marked up in the original pre-Markdown text as <pre class="foo">...</pre> to be run through Pygments and then re-inserted back into the overall Markdown-generated HTML.
The problem with this is that the text within <pre>...</pre>
needs to valid HTML with things like:
e_mail='<email@example.com>' escaped as
e_mail='<firstname.lastname@example.org>', otherwise BeautifulSoup
thinks in that example that you have a screwed up
<foo> tag and tries to fix that up.
Making sure all the
&, and other characters special to HTML
are escaped within a large chunk of code misses out on the
convenience of using Markdown. I decided to go with an arrangement
in which regular Markdown code blocks are used, but if
the first line begins with
pygments:<lexer>, then that block is pygmentized.
So if I enter something like:
Here is some code pygments:python if a < b: print a
It ends up as:
Here is some code
if a < b: print a
What I came up with is this derivative of Snippet #360
from htmlentitydefs import name2codepoint from HTMLParser import HTMLParser from markdown import markdown from BeautifulSoup import BeautifulSoup from pygments.lexers import LEXERS, get_lexer_by_name from pygments import highlight from pygments.formatters import HtmlFormatter # a tuple of known lexer names _lexer_names = reduce(lambda a,b: a + b, LEXERS.itervalues(), ()) # default formatter _formatter = HtmlFormatter(cssclass='source') class _MyParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.text =  def handle_data(self, data): self.text.append(data) def handle_entityref(self, name): self.text.append(unichr(name2codepoint[name])) def _replace_html_entities(s): """ Replace HTML entities in a string with their unicode equivalents. For example, '&' is replaced with just '&' """ mp = _MyParser() mp.feed(s) mp.close() return u''.join(mp.text) def markdown_pygment(txt): """ Convert Markdown text to Pygmentized HTML """ html = markdown(txt) soup = BeautifulSoup(html) dirty = False for tag in soup.findAll('pre'): if tag.code: txt = tag.code.renderContents() if txt.startswith('pygments:'): lexer_name, txt = txt.split('\n', 1) lexer_name = lexer_name.split(':') txt = _replace_html_entities(txt) if lexer_name in _lexer_names: lexer = get_lexer_by_name(lexer_name, stripnl=True, encoding='UTF-8') tag.replaceWith(highlight(txt, lexer, _formatter)) dirty = True if dirty: html = unicode(soup) return html