Product:

Get started

Viewer

Basic operations

Learn more

Annotation

MS Office

Generate via template

Conversion

Smart Data Extraction

Augmenting LLMs with Smart Data Extraction

PDF/A

Accessibility

Forms

Create

Page manipulation

PDF Editing

OCR

Digital signature

Comparison

Semantic Comparison

Bookmark

Optimization

Layer (OCG)

Redaction

Security

Portfolio

Low-level PDF API

Changelogs

Semantic Compare on Server/Desktop

Overview

Apryse's semantic comparison feature enables the visualization of textual differences between two related PDF documents. The text processing is based on natural reading order, highlighting the differences as colored annotations.

The comparison is always performed between two versions of a document. The older version is called the Before file (document 1), while the new version is the After file (document 2).

A difference is defined as a consecutive block of text that was inserted, deleted or modified. Differences always come in pairs. When some text is deleted from the Before side, a corresponding placeholder is inserted into the After side to indicate the position where the text was deleted from.

Similarly, when some text is inserted into the After side, a corresponding placeholder is generated for the Before side to help identify its location of insertion.

Finally, when content is modified, the difference comes out as a pair of annotations consisting of a deletion on the Before side and an insertion on the After side.

These difference annotations are labeled with a unique identifier, so they can be paired up side-by-side at the application level. We'll learn more about this later.

When entire lines are inserted or deleted, the placeholder at the opposite end is pictured as a horizontal line.

Usage

The HighlightTextDiff method takes two PDF documents as input, one being the Before (1) and the other one being the After (2) document. It compares them to find any differences, then overlays the highlight annotations on top of the input documents, which can in turn be saved to files.

1// Start with a PDFDoc (open source documents to compare)
2using (PDFDoc doc1 = new PDFDoc("compare_before.pdf"))
3using (PDFDoc doc2 = new PDFDoc("compare_after.pdf"))
4{
5    // Create an options object
6    TextDiffOptions options = new TextDiffOptions();
7
8    // Compare and highlight text differences in doc1 and doc2
9    PDFDoc.HighlightTextDiff(doc1, doc2, options);
10
11    // Save highlighted PDFs
12    doc1.Save("diff_before.pdf", SDFDoc.SaveOptions.e_incremental);
13    doc2.Save("diff_after.pdf", SDFDoc.SaveOptions.e_incremental);
14}

1// Start with a PDFDoc (open source documents to compare)
2PDFDoc doc1("compare_before.pdf");
3PDFDoc doc2("compare_after.pdf");
4
5// Create an options object
6TextDiffOptions options;
7
8// Compare and highlight text differences in doc1 and doc2
9PDFDoc::HighlightTextDiff(doc1, doc2, &options);
10
11// Save highlighted PDFs
12doc1.Save("diff_before.pdf", pdftron::SDF::SDFDoc::e_incremental);
13doc2.Save("diff_after.pdf", pdftron::SDF::SDFDoc::e_incremental);

1// Start with a PDFDoc (open source documents to compare)
2doc1 := NewPDFDoc("compare_before.pdf")
3doc2 := NewPDFDoc("compare_after.pdf")
4
5// Create an options object
6options := NewTextDiffOptions()
7
8// Compare and highlight text differences in doc1 and doc2
9PDFDocHighlightTextDiff(doc1, doc2, options)
10
11// Save highlighted PDFs
12doc1.Save("diff_before.pdf", uint(SDFDocE_incremental))
13doc2.Save("diff_after.pdf", uint(SDFDocE_incremental))

1// Start with a PDFDoc (open source documents to compare)
2PDFDoc doc1 = new PDFDoc("compare_before.pdf");
3PDFDoc doc2 = new PDFDoc("compare_after.pdf");
4
5// Create an options object
6TextDiffOptions options = new TextDiffOptions();
7
8// Compare and highlight text differences in doc1 and doc2
9PDFDoc.highlightTextDiff(doc1, doc2, options);
10
11// Save highlighted PDFs
12doc1.save("diff_before.pdf", SDFDoc.SaveMode.INCREMENTAL, null);
13doc2.save("diff_after.pdf", SDFDoc.SaveMode.INCREMENTAL, null);
14
15// Dispose PDFDoc objects
16doc1.close();
17doc2.close();

1async function main() {
2    // Start with a PDFDoc (open source documents to compare)
3    const doc1 = await PDFNet.PDFDoc.createFromURL("compare_before.pdf");
4    const doc2 = await PDFNet.PDFDoc.createFromURL("compare_after.pdf");
5
6    // Create an options object
7    const options = await PDFNet.PDFDoc.createTextDiffOptions();
8
9    // Compare and highlight text differences in doc1 and doc2
10    await doc1.highlightTextDiff(doc2, options);
11}
12PDFNet.runWithCleanup(main);

1// Start with a PDFDoc (open source documents to compare)
2PTPDFDoc *doc1 = [[PTPDFDoc alloc] initWithFilepath: @"compare_before.pdf"];
3PTPDFDoc *doc2 = [[PTPDFDoc alloc] initWithFilepath: @"compare_after.pdf"];
4// Create an options object
5PTTextDiffOptions *options = [[PTTextDiffOptions alloc] init];
6// Compare and highlight text differences in doc1 and doc2
7[PTPDFDoc HighlightTextDiff: doc1 doc2: doc2 options: options];
8// Save highlighted PDFs
9[doc1 SaveToFile: @"diff_before.pdf" flags: e_ptincremental];
10[doc2 SaveToFile: @"diff_after.pdf" flags: e_ptincremental];

Note that HighlightTextDiff is a static method within PDFDoc. The method returns the number of differences found, where each contiguous block of change is considered a single difference. If the two documents are identical, the function returns 0 and no annotations are added.

When the input PDFs already contain annotations or widgets, they are first flattened. When the function finishes, all annotations represent actual differences of text.

Facing Pages

It is easy to see that the more words and paragraphs you keep inserting into the document, the longer the After version will become compared to the Before PDF. You can reach a point where page N in Before is no longer matching up with page N in After. If you were to display Before and After next to each other, you could start displaying unrelated content, and that can be quite confusing.

Yet in certain situations, you can assume that pages generally line up between Before and After, especially with short documents with few changes between them.

We actually offer two separate APIs, AppendTextDiff and HighlightTextDiff. Depending on the situation, one might be preferable to the other.

Consider that you have two versions of an input document, Before (yellow paper) and After (cyan paper):

The AppendTextDiff function generates a single output document where the Before and After pages are merged in an alternating order (page 1 of each, followed by page 2 of each, and so on). In this case, you are only saving a single output file, and you can use a single WebViewer in Double Page mode (also known as Two Page View in Acrobat), so that the same page numbers from the two versions are next to each other.

This is most suitable when you know that Before and After have approximately the same number of pages and the differences are on the small side. When one document is longer than the other, blank filler pages are automatically inserted, so that the last few pages will have a blank pair.

The advantage is that it's easy to use a single WebViewer and switch it into Double Page view, and a single output file contains both versions in a compact format. However, the Before and After sides can't be scrolled independently, and the left and right pages could get out of sync very soon.

The HighlightTextDiff function treats the two inputs independently, and inserts the highlights directly into the Before and After documents. In this case, you are saving two output files that require two WebViewer controls side-by-side. This way the two documents can scroll independently, so that insertions and deletions line up perfectly, even when the page numbers are way out of sync.

We've already seen sample code for HighlightTextDiff, here are some samples for AppendTextDiff:

1// Start with a PDFDoc (open source documents to compare)
2using (PDFDoc output = new PDFDoc())
3using (PDFDoc doc1 = new PDFDoc("compare_before.pdf"))
4using (PDFDoc doc2 = new PDFDoc("compare_after.pdf"))
5{
6    // Create an options object
7    TextDiffOptions options = new TextDiffOptions();
8    // Compare and highlight text differences in doc1 and doc2
9    output.AppendTextDiff(doc1, doc2, options);
10    // Save highlighted PDF
11    output.Save("diff.pdf", SDFDoc.SaveOptions.e_incremental);
12}

1// Start with a PDFDoc (open source documents to compare)
2PDFDoc output;
3PDFDoc doc1("compare_before.pdf");
4PDFDoc doc2("compare_after.pdf");
5// Create an options object
6TextDiffOptions options;
7// Compare and highlight text differences in doc1 and doc2
8output.AppendTextDiff(doc1, doc2, &options);
9// Save highlighted PDF
10output.Save("diff.pdf", pdftron::SDF::SDFDoc::e_incremental);

1// Start with a PDFDoc (open source documents to compare)
2output := NewPDFDoc()
3doc1 := NewPDFDoc("compare_before.pdf")
4doc2 := NewPDFDoc("compare_after.pdf")
5// Create an options object
6options := NewTextDiffOptions()
7// Compare and highlight text differences in doc1 and doc2
8output.AppendTextDiff(doc1, doc2, options)
9// Save highlighted PDF
10output.Save("diff.pdf", uint(SDFDocE_incremental))

1// Start with a PDFDoc (open source documents to compare)
2PDFDoc output = new PDFDoc();
3PDFDoc doc1 = new PDFDoc("compare_before.pdf");
4PDFDoc doc2 = new PDFDoc("compare_after.pdf");
5// Create an options object
6TextDiffOptions options = new TextDiffOptions();
7// Compare and highlight text differences in doc1 and doc2
8output.appendTextDiff(doc1, doc2, options);
9// Save highlighted PDF
10output.save("diff.pdf", SDFDoc.SaveMode.INCREMENTAL, null);
11// Dispose PDFDoc objects
12doc1.close();
13doc2.close();
14output.close();

1async function main() {
2    // Start with a PDFDoc (open source documents to compare)
3    const output = await PDFNet.PDFDoc.create();
4    const doc1 = await PDFNet.PDFDoc.createFromURL("compare_before.pdf");
5    const doc2 = await PDFNet.PDFDoc.createFromURL("compare_after.pdf");
6    // Create an options object
7    const options = await PDFNet.PDFDoc.createTextDiffOptions();
8    // Compare and highlight text differences in doc1 and doc2
9    await output.appendTextDiff(doc1, doc2, options);
10}
11PDFNet.runWithCleanup(main);

1// Start with a PDFDoc (open source documents to compare)
2PTPDFDoc *output = [[PTPDFDoc alloc] init];
3PTPDFDoc *doc1 = [[PTPDFDoc alloc] initWithFilepath: @"compare_before.pdf"];
4PTPDFDoc *doc2 = [[PTPDFDoc alloc] initWithFilepath: @"compare_after.pdf"];
5// Create an options object
6PTTextDiffOptions *options = [[PTTextDiffOptions alloc] init];
7// Compare and highlight text differences in doc1 and doc2
8[PTPDFDoc HighlightTextDiff: doc1 doc2: doc2 options: options];
9// Save highlighted PDFs
10[doc1 SaveToFile: @"diff_before.pdf" flags: e_ptincremental];
11[doc2 SaveToFile: @"diff_after.pdf" flags: e_ptincremental];

Highlight Colors

Designers will be happy to learn that the Before and After annotation colors are customizable. Both the RGB value and the opacity can be adjusted. An opacity of 1.0 (100% opaque) can be too dark in combination with certain colors, in which case 0.5 (50% semi-transparent) may work better. An opacity value of 0.0 (full transparency) is completely invisible, so it makes sense to stay above values of 0.15.

The colors can be configured via the TextDiffOptions object. SetColorA and SetOpacityA control the Before document's annotation color. SetColorB and SetOpacityB adjust the After document's annotation color. Finally, the options object is passed to HighlightTextDiff as the third argument.

1// Create an options object
2TextDiffOptions options = new TextDiffOptions();
3// Before color is 100% red, 25% opacity
4options.SetColorA(new ColorPt(1.0, 0.0, 0.0));
5options.SetOpacityA(0.25);
6// After color is 100% blue, 25% opacity
7options.SetColorB(new ColorPt(0.0, 0.0, 1.0));
8options.SetOpacityB(0.25);
9// Compare and highlight text differences
10PDFDoc.HighlightTextDiff(doc1, doc2, options);

1// Create an options object
2TextDiffOptions options;
3// Before color is 100% red, 25% opacity
4options.SetColorA(ColorPt(1.0, 0.0, 0.0));
5options.SetOpacityA(0.25);
6// After color is 100% blue, 25% opacity
7options.SetColorB(ColorPt(0.0, 0.0, 1.0));
8options.SetOpacityB(0.25);
9// Compare and highlight text differences
10PDFDoc::HighlightTextDiff(doc1, doc2, &options);

1// Create an options object
2options := NewTextDiffOptions()
3// Before color is 100% red, 25% opacity
4options.SetColorA(NewColorPt(1.0, 0.0, 0.0))
5options.SetOpacityA(0.25)
6// After color is 100% blue, 25% opacity
7options.SetColorB(NewColorPt(0.0, 0.0, 1.0))
8options.SetOpacityB(0.25)
9// Compare and highlight text differences in doc1 and doc2
10PDFDocHighlightTextDiff(doc1, doc2, options)

1// Create an options object
2TextDiffOptions options = new TextDiffOptions();
3// Before color is 100% red, 25% opacity
4options.setColorA(new ColorPt(1.0, 0.0, 0.0));
5options.setOpacityA(0.25);
6// After color is 100% blue, 25% opacity
7options.setColorB(new ColorPt(0.0, 0.0, 1.0));
8options.setOpacityB(0.25);
9// Compare and highlight text differences
10PDFDoc.highlightTextDiff(doc1, doc2, options);

1async function main() {
2    // Create an options object
3    const options = await PDFNet.PDFDoc.createTextDiffOptions();
4    // Before color is 100% red, 25% opacity
5    options.setColorA({R: 1, G: 0, B: 0});
6    options.setOpacityA(0.25);
7    // After color is 100% blue, 25% opacity
8    options.setColorB({R: 0, G: 0, B: 1});
9    options.setOpacityB(0.25);
10    // Compare and highlight text differences
11    await doc1.highlightTextDiff(doc2, options);
12}
13PDFNet.runWithCleanup(main);

1// Create an options object
2PTTextDiffOptions *options = [[PTTextDiffOptions alloc] init];
3// Before color is 100% red, 25% opacity
4[options SetColorA: [[PTColorPt alloc] initWithX: 1.0 y: 0.0 z: 0.0 w: 0.0]];
5[options SetOpacityA: 0.25];
6// After color is 100% blue, 25% opacity
7[options SetColorB: [[PTColorPt alloc] initWithX: 0.0 y: 0.0 z: 1.0 w: 0.0]];
8[options SetOpacityB: 0.25];
9// Compare and highlight text differences in doc1 and doc2
10[PTPDFDoc HighlightTextDiff: doc1 doc2: doc2 options: options];

Exclusion Zones

Sometimes the need arises to exclude certain areas from text differencing. Most typically, headers and footers can disrupt the flow of the logical content, which often shows up as fake differences. Another example may be an advertisement that should not be a part of the actual content, either.

For those reasons the semantic comparison API offers a way of setting up exclusion zones where text is not considered for comparison, so any differences will be ignored.

1// Create an options object
2TextDiffOptions options = new TextDiffOptions();
3// Exclude footer area from page 1
4RectCollection exclusion = new RectCollection();
5exclusion.AddRect(new Rect(0, 0, 612, 72));
6options.AddIgnoreZonesForPage(exclusion, 1);
7// Compare and highlight text differences
8PDFDoc.HighlightTextDiff(doc1, doc2, options);
9}

1// Create an options object
2TextDiffOptions options;
3// Exclude footer area from page 1
4RectCollection exclusion;
5exclusion.AddRect(Rect(0, 0, 612, 72));
6options.AddIgnoreZonesForPage(exclusion, 1);
7// Compare and highlight text differences
8PDFDoc::HighlightTextDiff(doc1, doc2, &options);

1// Create an options object
2options := NewTextDiffOptions()
3// Exclude footer area from page 1
4exclusion := NewRectCollection()
5exclusion.AddRect(NewRect(0.0, 0.0, 612.0, 72.0))
6options.AddIgnoreZonesForPage(exclusion, 1)
7// Compare and highlight text differences in doc1 and doc2
8PDFDocHighlightTextDiff(doc1, doc2, options)

1// Create an options object
2TextDiffOptions options = new TextDiffOptions();
3// Exclude footer area from page 1
4RectCollection exclusion = new RectCollection();
5exclusion.addRect(new Rect(0, 0, 612, 72));
6options.addIgnoreZonesForPage(exclusion, 1);
7// Compare and highlight text differences
8PDFDoc.highlightTextDiff(doc1, doc2, options);

1async function main() {
2    // Create an options object
3    const options = await PDFNet.PDFDoc.createTextDiffOptions();
4    // Exclude footer area from page 1
5    const exclusion = [{ x1: 0, y1: 0, x2: 612, y2: 72 }];
6    options.addIgnoreZonesForPage(exclusion, 1);
7    // Compare and highlight text differences
8    await doc1.highlightTextDiff(doc2, options);
9}
10PDFNet.runWithCleanup(main);

1// Create an options object
2PTTextDiffOptions *options = [[PTTextDiffOptions alloc] init];
3// Exclude footer area from page 1
4PTPDFRectCollection *exclusion = [[PTPDFRectCollection alloc] init];
5[exclusion AddRect: [[PTPDFRect alloc] initWithX1: 0 y1: 0 x2: 612 y2: 72 * 4]];
6[options AddIgnoreZonesForPage: exclusion page_num: 1];
7// Compare and highlight text differences in doc1 and doc2
8[PTPDFDoc HighlightTextDiff: doc1 doc2: doc2 options: options];

Note that rectangles use PDF coordinates (measured in points, 1 point = 1/72 inch; origin is the page's bottom-left corner).

Annotation Metadata

We mentioned earlier that differences always come in pairs, and each carries a unique numeric identifier starting at the number 1 by increments of 1. Highlight annotations sharing the same identifier in both documents correspond to each other.

In addition, each annotation also carries information about the type of difference it represents, which can be either insertion, deletion or edit.

Usually each difference is represented by a single annotation object, which may consist of one or more rectangles. However, in certain situations an insertion or deletion may wrap across page boundaries. In those cases a single difference can consist of more than one highlight annotation, one per page, all instances sharing the same identifier and difference type.

The identifier and type information are stored as metadata within the annotation object under two custom keys:

TextDiffID: a unique number shared between the two PDF documents.
TextDiffType: may be either insert, delete or edit.

The easiest way to retrieve this information is via the Annot.GetCustomData method:

1// Get page 1
2Page page1 = doc1.GetPage(1);
3// Get first annotation
4Annot annot1 = page1.GetAnnot(0);
5// Get custom data TextDiffID
6string id = annot1.GetCustomData("TextDiffID");
7// Get custom data TextDiffType
8string type = annot1.GetCustomData("TextDiffType");

1// Get page 1
2Page page1 = doc1.GetPage(1);
3// Get first annotation
4Annot annot1 = page1.GetAnnot(0);
5// Get custom data TextDiffID
6UString id = annot1.GetCustomData("TextDiffID");
7// Get custom data TextDiffType
8UString type = annot1.GetCustomData("TextDiffType");

1// Get page 1
2Page page1 = doc1.getPage(1);
3// Get first annotation
4Annot annot1 = page1.getAnnot(0);
5// Get custom data TextDiffID
6String id = annot1.getCustomData("TextDiffID");
7// Get custom data TextDiffType
8String type = annot1.getCustomData("TextDiffType");

1// Get page 1
2PTPage *page1 = [doc1 GetPage: 1];
3// Get first annotation
4PTAnnot* annot1 = [page1 GetAnnot: 0];
5// Get custom data TextDiffID
6NSString *id = [annot1 GetCustomData: @"TextDiffID"];
7// Get custom data TextDiffType
8NSString *type = [annot1 GetCustomData: @"TextDiffType"];

Note that the identifier comes out as a string, but it can be interpreted as a number.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales

Product:

Product:

Semantic Compare on Server/Desktop

Overview

Usage

Facing Pages

Highlight Colors

Exclusion Zones

Annotation Metadata

On this page