Unified Document Processing Model - Confession 55

2015.05.17 00:05:51

Index

A good while ago Robert Strandh approached me about a project idea. Namely that of a form of unified document processor, that could be used for any document format. His argument was that basically any new document format boils down to a question of syntax, while everything else about a document would get reimplemented anew for each new format syntax, generating a huge amount of wasted effort. What if there was a unified processing model that a syntax parser could be plugged on top of?

This is an idea that I put quite a bit of thought into, but I simply do not have the time resources to allocate to it right now. Regardless, I thought that my ideas would at least be interesting enough to share with the world that someone more capable than I may perhaps pick up on them, or at least be inspired by them. So, without further ado, here is my proposal:

When a parser scans a document, it transforms it into a graph of directives. This step is done as directly as possible, almost 1:1 mapping the document to the graph structure without any further modification or reordering. After this step is done, we have a representation of the document in our memory. The next question is what to do with it, and that's where our library steps in. The core idea of mine boils down to a system of transformers and a flow graph.

The library consists of a very simple model: For any action, the user specifies a set of resulting directives that he would like to have. In the example case of having wanting to output HTML, you could supply a list of supported HTML tags in the form of directives. The library then needs to figure out how to boil down our document into one that only consists of directives in our desired set (if that is even possible).

Figuring this out is done through the implicit flow graph represented by the transformers. A transformer takes an input directive and returns one or more output directives. Accompanied with that it has a suitability number –comparable to a priority– that declares how preferable it is to use this transformer to translate a directive. Depending on the resulting set, transformers might very well need to be chained together in very complex manners, even allowing for almost complete restructuring of the document.

Using this system however offers a lot of flexibility and reusability. Since you only have to write a transformer once, it will be available for use for any document format you might want to input or output henceforth. Depending on the support of the output format, the set of directives is limited, but then the system could dummy out unsupported directives through less-suitable transformers that translate it into multiple directives that give the same visual appearance in the end.

Driving this to the ultimate point, this could essentially replace TeX and comparable models in their entirety, by providing transformers that can boil everything down to PostScript instructions or similar. At the same time it would still be just as usable for source-to-source translations, as the suitability numbers in the directives would see to it that the most high-level transformation possible would always be picked.

Once a document has been transformed, it is then rendered according to a format specification that describes how a directive is turned into the format's required data, thus ensuring that an output is possible for all formats as well.

Of course, while this system is rather simple, it would prove a huge effort to provide the necessary amount of transformers and directives to make this work well. Still, I think investing something into this idea could prove very worthwhile, since once the base system is done, adding on new input and output formats would be a very simple and almost trivial task, and would even allow for a consistent and predictable formatting of inexistent features in certain formats.

That's all I have to say on this for now. Who knows, maybe I'll get the time to actually get started on this some day, but it certainly isn't right now, and not any time soon either. I'm much too busy with university, drawing, and other programming projects that are more urgent in some fashion. If you have any comments on this approach, do let me know.

Written by shinmera