What is atlas_doc_parser? ============================================================================== Introduction ------------------------------------------------------------------------------ ``atlas_doc_parser`` is a Python library for parsing and transforming `Atlassian Document Format (ADF) `_ - the rich text format used in Confluence pages and Jira issue fields. The Problem ------------------------------------------------------------------------------ Atlassian products like Confluence and Jira store rich text content in a proprietary JSON format called ADF (Atlassian Document Format). When you retrieve a Confluence page or Jira ticket via API, the body content is returned as a complex nested JSON structure. For example, a simple paragraph with bold text looks like this in ADF:: { "type": "doc", "version": 1, "content": [ { "type": "paragraph", "content": [ { "type": "text", "text": "Hello ", }, { "type": "text", "text": "world", "marks": [{"type": "strong"}] } ] } ] } Working directly with this raw JSON is cumbersome. You need to: - Navigate deeply nested structures - Handle dozens of different node and mark types - Deal with optional attributes and edge cases - Convert to human-readable formats for processing The Solution ------------------------------------------------------------------------------ ``atlas_doc_parser`` provides a clean abstraction layer: 1. **Parse**: Deserialize ADF JSON into strongly-typed Python dataclasses 2. **Navigate**: Work with an Abstract Syntax Tree (AST) structure with proper types 3. **Transform**: Convert to other formats (currently Markdown, more to come) .. code-block:: python from atlas_doc_parser.nodes.node_doc import NodeDoc # Parse ADF JSON into Python objects doc = NodeDoc.from_dict(adf_json) # Now you have a proper AST for node in doc.content: if node.type == "paragraph": for child in node.content: print(child.text) # Convert to Markdown markdown = doc.to_markdown() Why Markdown? ------------------------------------------------------------------------------ In the era of AI and Large Language Models, knowledge stored in Confluence and Jira needs to be accessible for: - **AI Training**: Feed structured documentation to language models - **RAG Systems**: Build retrieval-augmented generation pipelines - **Knowledge Bases**: Create searchable, AI-friendly documentation - **Content Migration**: Move content between platforms Markdown is the lingua franca for these use cases - it's plain text, universally supported, and preserves document structure while remaining human-readable. Future Directions ------------------------------------------------------------------------------ While ``to_markdown()`` is the primary transformation today, the architecture supports adding more conversions: - ``to_html()`` - Direct HTML output - ``to_rst()`` - ReStructuredText for Sphinx documentation - ``to_docx()`` - Microsoft Word documents - ``to_pdf()`` - PDF generation - Custom transformations via visitor pattern The core value is the parsed AST - once you have structured Python objects, you can transform them to any format you need.