What is atlas_doc_parser?¶

Introduction¶

atlas_doc_parser is a Python library for parsing and transforming Atlassian Document Format (ADF) - the rich text format used in Confluence pages and Jira issue fields.

The Problem¶

Atlassian products like Confluence and Jira store rich text content in a proprietary JSON format called ADF (Atlassian Document Format). When you retrieve a Confluence page or Jira ticket via API, the body content is returned as a complex nested JSON structure.

For example, a simple paragraph with bold text looks like this in ADF:

{
    "type": "doc",
    "version": 1,
    "content": [
        {
            "type": "paragraph",
            "content": [
                {
                    "type": "text",
                    "text": "Hello ",
                },
                {
                    "type": "text",
                    "text": "world",
                    "marks": [{"type": "strong"}]
                }
            ]
        }
    ]
}

Working directly with this raw JSON is cumbersome. You need to:

Navigate deeply nested structures
Handle dozens of different node and mark types
Deal with optional attributes and edge cases
Convert to human-readable formats for processing

The Solution¶

atlas_doc_parser provides a clean abstraction layer:

Parse: Deserialize ADF JSON into strongly-typed Python dataclasses
Navigate: Work with an Abstract Syntax Tree (AST) structure with proper types
Transform: Convert to other formats (currently Markdown, more to come)

from atlas_doc_parser.nodes.node_doc import NodeDoc

# Parse ADF JSON into Python objects
doc = NodeDoc.from_dict(adf_json)

# Now you have a proper AST
for node in doc.content:
    if node.type == "paragraph":
        for child in node.content:
            print(child.text)

# Convert to Markdown
markdown = doc.to_markdown()

Why Markdown?¶

In the era of AI and Large Language Models, knowledge stored in Confluence and Jira needs to be accessible for:

AI Training: Feed structured documentation to language models
RAG Systems: Build retrieval-augmented generation pipelines
Knowledge Bases: Create searchable, AI-friendly documentation
Content Migration: Move content between platforms

Markdown is the lingua franca for these use cases - it’s plain text, universally supported, and preserves document structure while remaining human-readable.

Future Directions¶

While to_markdown() is the primary transformation today, the architecture supports adding more conversions:

to_html() - Direct HTML output
to_rst() - ReStructuredText for Sphinx documentation
to_docx() - Microsoft Word documents
to_pdf() - PDF generation
Custom transformations via visitor pattern

The core value is the parsed AST - once you have structured Python objects, you can transform them to any format you need.