Export / Import Functionality¶
Exports¶
OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:
You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.
Imports¶
If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS
), you'll see an import action when you click the action button on the corpus page.
Export Format¶
OpenContracts Export Format Specification¶
The OpenContracts export is a zip archive containing: 1. A data.json
file with metadata about the export 2. The original PDF documents 3. Exported annotations "burned in" to the PDF documents
data.json Format¶
The data.json
file contains a JSON object with the following fields:
-
annotated_docs
(dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document. -
doc_labels
(dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels. -
text_labels
(dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels. -
corpus
(OpenContractCorpusType): Metadata about the exported corpus, with fields:id
(int): ID of the corpustitle
(string)description
(string)icon_name
(string): Filename of the corpus icon imageicon_data
(string): Base64 encoded icon image datacreator
(string): Email of the corpus creatorlabel_set
(string): ID of the labelset used by this corpus
-
label_set
(OpenContractsLabelSetType): Metadata about the label set, with fields:id
(int)title
(string)description
(string)icon_name
(string): Filename of the labelset iconicon_data
(string): Base64 encoded labelset icon datacreator
(string): Email of the labelset creator
OpenContractDocExport Format¶
Each document in annotated_docs
is represented by an OpenContractDocExport object with fields:
doc_labels
(list[string]): List of document label names applied to this doclabelled_text
(list[OpenContractsAnnotationPythonType]): List of text annotationstitle
(string): Document titlecontent
(string): Full text content of the documentdescription
(string): Description of the documentpawls_file_content
(list[PawlsPagePythonType]): PAWLS parse data for each pagepage_count
(int): Number of pages in the document
OpenContractsAnnotationPythonType Format¶
Represents an individual text annotation, with fields:
id
(string): Optional IDannotationLabel
(string): Name of the label for this annotationrawText
(string): Raw text content of the annotationpage
(int): 0-based page number the annotation is onannotation_json
(dict): Maps page numbers to OpenContractsSinglePageAnnotationType
OpenContractsSinglePageAnnotationType Format¶
Represents the annotation data for a single page:
bounds
(BoundingBoxPythonType): Bounding box of the annotation on the pagetokensJsons
(list[TokenIdPythonType]): List of PAWLS tokens covered by the annotationrawText
(string): Raw text of the annotation on this page
BoundingBoxPythonType Format¶
Represents a bounding box with fields:
top
(int)bottom
(int)left
(int)right
(int)
TokenIdPythonType Format¶
References a PAWLS token by page and token index:
pageIndex
(int)tokenIndex
(int)
PawlsPagePythonType Format¶
Represents PAWLS parse data for a single page:
page
(PawlsPageBoundaryPythonType): Page boundary infotokens
(list[PawlsTokenPythonType]): List of PAWLS tokens on the page
PawlsPageBoundaryPythonType Format¶
Represents the page boundary with fields:
width
(float)height
(float)index
(int): Page index
PawlsTokenPythonType Format¶
Represents a single PAWLS token with fields:
x
(float): X-coordinate of token boxy
(float): Y-coordinate of token boxwidth
(float): Width of token boxheight
(float): Height of token boxtext
(string): Text content of the token
AnnotationLabelPythonType Format¶
Defines an annotation label with fields:
id
(string)color
(string): Hex color for the labeldescription
(string)icon
(string): Icon nametext
(string): Label textlabel_type
(LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL
Example data.json¶
{
"annotated_docs": {
"document1.pdf": {
"doc_labels": ["Contract", "NDA"],
"labelled_text": [
{
"id": "1",
"annotationLabel": "Effective Date",
"rawText": "This agreement is effective as of January 1, 2023",
"page": 0,
"annotation_json": {
"0": {
"bounds": {
"top": 100,
"bottom": 120,
"left": 50,
"right": 500
},
"tokensJsons": [
{
"pageIndex": 0,
"tokenIndex": 5
},
{
"pageIndex": 0,
"tokenIndex": 6
}
],
"rawText": "January 1, 2023"
}
}
}
],
"title": "Nondisclosure Agreement",
"content": "This Nondisclosure Agreement is made...",
"description": "Standard mutual NDA",
"pawls_file_content": [
{
"page": {
"width": 612,
"height": 792,
"index": 0
},
"tokens": [
{
"x": 50,
"y": 100,
"width": 60,
"height": 10,
"text": "This"
},
{
"x": 120,
"y": 100,
"width": 100,
"height": 10,
"text": "agreement"
}
]
}
],
"page_count": 5
}
},
"doc_labels": {
"Contract": {
"id": "1",
"color": "#FF0000",
"description": "Indicates a legal contract",
"icon": "contract",
"text": "Contract",
"label_type": "DOC_TYPE_LABEL"
},
"NDA": {
"id": "2",
"color": "#00FF00",
"description": "Indicates a non-disclosure agreement",
"icon": "nda",
"text": "NDA",
"label_type": "DOC_TYPE_LABEL"
}
},
"text_labels": {
"Effective Date": {
"id": "3",
"color": "#0000FF",
"description": "The effective date of the agreement",
"icon": "calendar",
"text": "Effective Date",
"label_type": "TOKEN_LABEL"
}
},
"corpus": {
"id": 1,
"title": "Example Corpus",
"description": "A sample corpus for demonstration",
"icon_name": "corpus_icon.png",
"icon_data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==",
"creator": "user@example.com",
"label_set": "4"
},
"label_set": {
"id": "4",
"title": "Example Label Set",
"description": "A sample label set",
"icon_name": "label_icon.png",
"icon_data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==",
"creator": "user@example.com"
}
}
This data.json
file includes:
- One annotated document (
document1.pdf
) with two document labels ("Contract" and "NDA") and one text annotation for the "Effective Date" - Definitions for the two document labels ("Contract" and "NDA") and one text label ("Effective Date")
- Metadata about the exported corpus and labelset, including Base64 encoded icon data
The PAWLS token data and text content are truncated for brevity. In a real export, the pawls_file_content
would include the complete token data for each page, and content
would contain the full extracted text of the document.
Let me know if you have any other questions!