Python XML parsing
At work we’re using DocBook for our product documentation. We have a tool for spell-checking our documents.
- It uses Document Type Definition (DTD) for validation and entity declarations.
- The DTD and other referenced files should be cached locally by using an XML catalog.
- For misspelled words the tool should prints the line and column number.
Solving this with Python is not easy for obscure reasons.
-
Is you use xml.sax you get line and column numbers from the
Locator
. But the default parser expat has no support for XML Catalog. -
libxml2 has support for XML catalog, but is not installable from PyPI: It requires
libxml2-dev
to be installed. At least you can use it to setup the catalog and then use it with a custom resolver like show below. -
lxml is based on
libxml2
itself, but only provides an interface to get the line via sourceline. The column number is not exposed.
If you find a solution, please tell me.
My old solution was to use libxml2
with the above mentioned PIP drawback like this:
from os import environ
from os.path import dirname, join
from xml.sax import handler, make_parser
import libxml2 # type: ignore
class DocBookResolver(handler.EntityResolver):
CATALOG = join(dirname(__file__), "catalog.xml")
def __init__(self) -> None:
catalog = environ.setdefault("XML_CATALOG_FILES", self.CATALOG)
libxml2.loadCatalog(catalog)
def resolveEntity(self, publicId: str, systemId: str) -> str:
return libxml2.catalogResolve(publicId, systemId) or systemId
parser = make_parser()
resolver = DocBookResolver()
parser.setEntityResolver(resolver)