Python XML parsing

At work we’re using DocBook for our product documentation. We have a tool for spell-checking our documents.

  1. It uses Document Type Definition (DTD) for validation and entity declarations.
  2. The DTD and other referenced files should be cached locally by using an XML catalog.
  3. For mis-spelled words the tool should prints the line and column number.

Solving this with Python is not easy for obscure reasons.

  1. Is you use xml.sax you get line and column numbers from the Locator. But the default parser expat has no support for XML Catalog.

  2. libxml2 has support for XML catalog, but is not installable from PIP: It requires libxml2-dev to be installed. At least you can use it to setup the catalog and then use it with a custom resolver like show below.

  3. lxml is based on libxml2 itself, but only provides an interface to get the line via sourceline. The column number is not exposed.

If you find a solution, please tell me.

My old solution was to use libxml2 with the above mentioned PIP drawback like this:

from os import environ
from os.path import dirname, join
from xml.sax import handler, make_parser
import libxml2  # type: ignore

class DocBookResolver(handler.EntityResolver):
   CATALOG = join(dirname(__file__), "catalog.xml")

    def __init__(self) -> None:
        catalog = environ.setdefault("XML_CATALOG_FILES", self.CATALOG)
        libxml2.loadCatalog(catalog)

    def resolveEntity(self, publicId: str, systemId: str) -> str:
        return libxml2.catalogResolve(publicId, systemId) or systemId


parser = make_parser()
resolver = DocBookResolver()
parser.setEntityResolver(resolver)
Written on December 19, 2020