What is atlas_doc_parser?¶
Introduction¶
atlas_doc_parser is a Python library for parsing and transforming Atlassian Document Format (ADF) - the rich text format used in Confluence pages and Jira issue fields.
The Problem¶
Atlassian products like Confluence and Jira store rich text content in a proprietary JSON format called ADF (Atlassian Document Format). When you retrieve a Confluence page or Jira ticket via API, the body content is returned as a complex nested JSON structure.
For example, a simple paragraph with bold text looks like this in ADF:
{
"type": "doc",
"version": 1,
"content": [
{
"type": "paragraph",
"content": [
{
"type": "text",
"text": "Hello ",
},
{
"type": "text",
"text": "world",
"marks": [{"type": "strong"}]
}
]
}
]
}
Working directly with this raw JSON is cumbersome. You need to:
Navigate deeply nested structures
Handle dozens of different node and mark types
Deal with optional attributes and edge cases
Convert to human-readable formats for processing
The Solution¶
atlas_doc_parser provides a clean abstraction layer:
Parse: Deserialize ADF JSON into strongly-typed Python dataclasses
Navigate: Work with an Abstract Syntax Tree (AST) structure with proper types
Transform: Convert to other formats (currently Markdown, more to come)
from atlas_doc_parser.nodes.node_doc import NodeDoc
# Parse ADF JSON into Python objects
doc = NodeDoc.from_dict(adf_json)
# Now you have a proper AST
for node in doc.content:
if node.type == "paragraph":
for child in node.content:
print(child.text)
# Convert to Markdown
markdown = doc.to_markdown()
Why Markdown?¶
In the era of AI and Large Language Models, knowledge stored in Confluence and Jira needs to be accessible for:
AI Training: Feed structured documentation to language models
RAG Systems: Build retrieval-augmented generation pipelines
Knowledge Bases: Create searchable, AI-friendly documentation
Content Migration: Move content between platforms
Markdown is the lingua franca for these use cases - it’s plain text, universally supported, and preserves document structure while remaining human-readable.
Future Directions¶
While to_markdown() is the primary transformation today, the architecture supports adding more conversions:
to_html()- Direct HTML outputto_rst()- ReStructuredText for Sphinx documentationto_docx()- Microsoft Word documentsto_pdf()- PDF generationCustom transformations via visitor pattern
The core value is the parsed AST - once you have structured Python objects, you can transform them to any format you need.