unified-doc: a unified document renderer for content
In my previous job at Roam, I worked on a variety of NLP tools that used novel JS implementations to render and support annotations for text/html documents. We created some really cool stuff, but a number of challenges came up:
- Custom implementations do not always maintain high fidelity with the source content.
- Customizing the rendered document is difficult, depending on how far it deviates from the source content. Canvas-based solutions are basically impossible to customize.
- Annotation algorithms are coupled with the rendering implementations developed, and are usually fairly complex.
- Annotation algorithms are not perfect and do not work when performing substring calculations on content that may cover incomplete DOM nodes, hence breaking the split HTML.
- For new content types, we find ourselves needing to implement a new custom parser/annotator/renderer.
- Writing a custom parser is hard...
- And so I started researching...
I learned about
unified a month ago and was really excited about the ecosystem's potential of solving the problems mentioned above. I do not have a CS background, so getting ramped up on the basic ideas of parsers, syntax trees (and trees in general) took some time. After browsing through tons of utilities and reading up about the basic principles in
unified, I was confident that the
unified approach will solve all the separate problems above in a single 'unified' way!
- Parsing, annotating, rendering are decoupled operations in the
- There is already established support for parsing
htmlcontent types into a unified
- We can now write the annotation algorithm as a
hast-util-annotate) so that the only transformation is done to the syntax tree.
- Some careful thought is made about the annotation algorithm, which leads to the following requirements:
- It should be decoupled from the rendering process.
- It should not affect the layout of the source document.
- It should be a
pure additiveoperation to the
hasttree for annotated nodes.
- It should be simple and declarative.
- It should support common annotation interactions (click, hover).
- Based on the requirements, the
tldr;for the algorithm is to apply annotations on only
textnodes, and add semantic
<mark />tags that can be customized through CSS/styles. Nothing else about the source document changes.
- Rendering the
hasttree is easily accomplished with already available libraries such as
- Supporting new content types and compilers/renderers is easily done with plugins that are decoupled.
Knowledge is unified abstractly across humanity. We all share common goals of acquiring, understanding, storing, and sharing knowledge. Content represents the physical manifestation of storing knowledge, and it is accomplished with many digital formats in the modern computing age.
Various softwares act on content types to parse, process, and render the underlying data for human consumption. Many solutions try to be interoperable, but are largely limited by the lack of a common interface across content types and programs. These solutions can be largely described as API interactions between software, and not as interactions with the actual content. The
unified initiative addresses this problem by representing content in unified syntax tress where programs can work closely with the structured content data.
unified-doc is a document renderer, with associated utilities, that uses the
unified ecosystem to support a unified way to render supported content types into HTML-based markup. It represents content as structured data, and preserves fidelity of the original source content in the rendered document, all at the same time supporting powerful features that enrich the document (e.g. annotations). Outputting HTML markup allows interoperable ways to view and enrich the document with standard web technologies.
It has been an exciting and fun month learning about
unified and applying it to old problems that were unsolved. The possibilities here are endless and I am excited to see what we can do with this ecosystem. The day when we are able to parse/transform/render popular proprietary content types (e.g.