XML in Python

A few notes on XML in Python.

After looking around (see links at bottom) I've come to the conclusion that, for now, the simplest, yet mostly successful, way to safely include input from various places is with xml.sax.saxutils and encode('ascii', 'xmlcharrefreplace'), like so:

>>> from xml.sax import saxutils
>>> xml_snippet = u'ö < på > & бб лн'
>>> xml_snippet
u'\xf6 < p\xe5 > & \u0431\u0431 \u043b\u043d'
>>> saxutils.escape(xml_snippet)
u'\xf6 &lt; p\xe5 &gt; &amp; \u0431\u0431 \u043b\u043d'
>>> saxutils.escape(xml_snippet).encode('ascii', 'xmlcharrefreplace')
'&#246; &lt; p&#229; &gt; &amp; &#1073;&#1073; &#1083;&#1085;'

The conversion of > to &gt; is not required, but I prefer it, so I'm glad it's the default in saxutils. Also, we could encode to iso-8859-1 instead of ascii, but, as Mr. Lundh points out in [3], it's still safer to use ascii.

links

[1] http://www.xml.com/pub/a/2005/06/15/py-xml.html [2] http://www.diveintopython.org/xml_processing/unicode.html [3] http://online.effbot.org/2003_10_01_archive.htm#20031016 [4] http://boodebr.org/main/python/all-about-python-and-unicode

keywords: python, XML, unicode created 2007-04-24 last modified 2009-01-12