Export / Import Functionality¶

Exports¶

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

Imports¶

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

Export Format¶

OpenContracts Export Format Specification¶

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations "burned in" to the PDF documents

data.json Format¶

The data.json file contains a JSON object with the following fields:

annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.
doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.
text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.
corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:
- id (int): ID of the corpus
- title (string)
- description (string)
- icon_name (string): Filename of the corpus icon image
- icon_data (string): Base64 encoded icon image data
- creator (string): Email of the corpus creator
- label_set (string): ID of the labelset used by this corpus
label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:
- id (int)
- title (string)
- description (string)
- icon_name (string): Filename of the labelset icon
- icon_data (string): Base64 encoded labelset icon data
- creator (string): Email of the labelset creator

OpenContractDocExport Format¶

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

doc_labels (list[string]): List of document label names applied to this doc
labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
title (string): Document title
content (string): Full text content of the document
description (string): Description of the document
pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
page_count (int): Number of pages in the document

OpenContractsAnnotationPythonType Format¶

Represents an individual text annotation, with fields:

id (string): Optional ID
annotationLabel (string): Name of the label for this annotation
rawText (string): Raw text content of the annotation
page (int): 0-based page number the annotation is on
annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType

OpenContractsSinglePageAnnotationType Format¶

Represents the annotation data for a single page:

bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
rawText (string): Raw text of the annotation on this page

BoundingBoxPythonType Format¶

Represents a bounding box with fields:

top (int)
bottom (int)
left (int)
right (int)

TokenIdPythonType Format¶

References a PAWLS token by page and token index:

pageIndex (int)
tokenIndex (int)

PawlsPagePythonType Format¶

Represents PAWLS parse data for a single page:

page (PawlsPageBoundaryPythonType): Page boundary info
tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page

PawlsPageBoundaryPythonType Format¶

Represents the page boundary with fields:

width (float)
height (float)
index (int): Page index

PawlsTokenPythonType Format¶

Represents a single PAWLS token with fields:

x (float): X-coordinate of token box
y (float): Y-coordinate of token box
width (float): Width of token box
height (float): Height of token box
text (string): Text content of the token

AnnotationLabelPythonType Format¶

Defines an annotation label with fields:

id (string)
color (string): Hex color for the label
description (string)
icon (string): Icon name
text (string): Label text
label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL

Example data.json¶

{
  "annotated_docs": {
    "document1.pdf": {
      "doc_labels": ["Contract", "NDA"],
      "labelled_text": [
        {
          "id": "1",
          "annotationLabel": "Effective Date",
          "rawText": "This agreement is effective as of January 1, 2023",
          "page": 0,
          "annotation_json": {
            "0": {
              "bounds": {
                "top": 100,
                "bottom": 120,
                "left": 50,
                "right": 500
              },
              "tokensJsons": [
                {
                  "pageIndex": 0,
                  "tokenIndex": 5
                },
                {
                  "pageIndex": 0,
                  "tokenIndex": 6
                }
              ],
              "rawText": "January 1, 2023"
            }
          }
        }
      ],
      "title": "Nondisclosure Agreement",
      "content": "This Nondisclosure Agreement is made...",
      "description": "Standard mutual NDA",
      "pawls_file_content": [
        {
          "page": {
            "width": 612,
            "height": 792,
            "index": 0
          },
          "tokens": [
            {
              "x": 50,
              "y": 100,
              "width": 60,
              "height": 10,
              "text": "This"
            },
            {
              "x": 120,
              "y": 100,
              "width": 100,
              "height": 10,
              "text": "agreement"
            }
          ]
        }
      ],
      "page_count": 5
    }
  },
  "doc_labels": {
    "Contract": {
      "id": "1",
      "color": "#FF0000",
      "description": "Indicates a legal contract",
      "icon": "contract",
      "text": "Contract",
      "label_type": "DOC_TYPE_LABEL"
    },
    "NDA": {
      "id": "2",
      "color": "#00FF00",
      "description": "Indicates a non-disclosure agreement",
      "icon": "nda",
      "text": "NDA",
      "label_type": "DOC_TYPE_LABEL"
    }
  },
  "text_labels": {
    "Effective Date": {
      "id": "3",
      "color": "#0000FF",
      "description": "The effective date of the agreement",
      "icon": "calendar",
      "text": "Effective Date",
      "label_type": "TOKEN_LABEL"
    }
  },
  "corpus": {
    "id": 1,
    "title": "Example Corpus",
    "description": "A sample corpus for demonstration",
    "icon_name": "corpus_icon.png",
    "icon_data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==",
    "creator": "user@example.com",
    "label_set": "4"
  },
  "label_set": {
    "id": "4",
    "title": "Example Label Set",
    "description": "A sample label set",
    "icon_name": "label_icon.png",
    "icon_data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==",
    "creator":  "user@example.com"
  }
}

This data.json file includes:

One annotated document (document1.pdf) with two document labels ("Contract" and "NDA") and one text annotation for the "Effective Date"
Definitions for the two document labels ("Contract" and "NDA") and one text label ("Effective Date")
Metadata about the exported corpus and labelset, including Base64 encoded icon data

The PAWLS token data and text content are truncated for brevity. In a real export, the pawls_file_content would include the complete token data for each page, and content would contain the full extracted text of the document.

Let me know if you have any other questions!