Python XML parsing
At work we’re using DocBook for our product documentation. We have a tool for spell-checking our documents.
- It uses Document Type Definition (DTD) for validation and entity declarations.
- The DTD and other referenced files should be cached locally by using an XML catalog.
- For mis-spelled words the tool should prints the line and column number.
Solving this with Python is not easy for obscure reasons.
libxml2 has support for XML catalog, but is not installable from PIP: It requires
libxml2-devto be installed. At least you can use it to setup the catalog and then use it with a custom resolver like show below.
If you find a solution, please tell me.
My old solution was to use
libxml2 with the above mentioned PIP drawback like this:
from os import environ from os.path import dirname, join from xml.sax import handler, make_parser import libxml2 # type: ignore class DocBookResolver(handler.EntityResolver): CATALOG = join(dirname(__file__), "catalog.xml") def __init__(self) -> None: catalog = environ.setdefault("XML_CATALOG_FILES", self.CATALOG) libxml2.loadCatalog(catalog) def resolveEntity(self, publicId: str, systemId: str) -> str: return libxml2.catalogResolve(publicId, systemId) or systemId parser = make_parser() resolver = DocBookResolver() parser.setEntityResolver(resolver)