Some test text!
Web / Guides
To extract text from a PDF document.
async function main() {
const doc = await PDFNet.PDFDoc.createFromURL(filename);
const page = await doc.getPage(1);
const txt = await PDFNet.TextExtractor.create();
const rect = await page.getCropBox();
txt.begin(page, rect); // Read the page.
// Extract words one by one.
let line = await txt.getFirstLine();
for (; (await line.isValid()); line = (await line.getNextLine()))
{
for (word = await line.getFirstWord(); (await word.isValid()); word = (await word.getNextWord()))
{
// await word.getString();
}
}
}
PDFNet.runWithCleanup(main);
Read a PDF File sample
Full sample code which illustrates the basic text extraction capabilities.
To extract text from under an annotation in the document.
async function main() {
const doc = await PDFNet.PDFDoc.createFromURL(filename);
const page = await doc.getPage(1);
const annotation = await page.getAnnot(0);
const txt = await PDFNet.TextExtractor.create();
const rect = await page.getCropBox();
txt.begin(page, rect); // Read the page.
const textData = await txt.getTextUnderAnnot(annotation);
}
PDFNet.runWithCleanup(main);
When we use the ElementReader
class to read elements from a PDF document, we are often faced with data that is partial. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. We could potentially end up with two elements - "T" and "his is a sample sentence.". This is possible because in a PDF document, text objects are not always cleanly organized into words sentences, or paragraphs. The ElementReader
class will return Element
objects exactly as they are defined in the PDF page content stream.
An element of type e_text
directly corresponds to a Tj
element in the PDF document. Each e_text
element represents a text run, which represents a sequence of text glyphs that use the same font and graphics attributes. Say, if there is a single word, whose letters are each presented with a different font, then each letter would be a separate text run. You may also encounter text runs that contain multiple words separated by spaces. The PDF format does not guarantee that the text will be presented in reading order.
All this just goes to say that attempting to use an ElementReader
to extract text data from a PDF document is not guaranteed to return data in the order expected (reading order). The most straightforward approach to extract words and text from text-runs is using the pdftron.PDF.TextExtractor
class, as shown in the TextExtract
sample project - TextExtract Sample
TextExtractor will assemble words, lines, and paragraphs, remove duplicate strings, reconstruct text reading order, etc. Using TextExtractor
you can also obtain bounding boxes for each word, line, or paragraph (along with style information such as font, color, etc). This information can be used to search for corresponding text elements using ElementReader
.
PDF documents do not contain high level structures and to find this data requires the help of AI deep learning methods.
Utilizing Apryse.AI, we can extract tables, text, and reading order from existing PDF documents in the form of various outputs. It can also identify articles and forms, where the results are shown as annotations on the PDF.
The production version of Apryse.AI is a docker installation which can be deployed on-premise. Please use the REST API endpoints below to trial the software for demo purposes only.
Please visit Apryse.AI to learn more about using artificial intelligence for document understanding.
Or read our related blog on PDF article extraction.
The REST API demo is a post request to https://ai-serve.pdftron.com/recog/predict
. It will provide zipped data for its response.
Here's are example code snippets for uploading a PDF to the online demo using the API endpoint:
// XHR example
file = new File([fileData], 'mypdf.pdf');
const xhttp = new XMLHttpRequest();
const endpoint = 'https://ai-serve.pdftron.com/recog/predict';
xhttp.open('POST', endpoint, true);
xhttp.setRequestHeader("File-Name", originalName || 'mypdf.pdf')
xhttp.setRequestHeader("Output-JSON", "true");
xhttp.send(originalFile);
xhttp.onreadystatechange = () => this.handleResp(xhttp, originalFile, 'local'); // handle the zipped response data
// Fetch example
fetch('https://ai-serve.pdftron.com/recog/predict',
{
method: 'POST',
body: originalFile,
headers:
{
"File-Name": originalName || 'mypdf.pdf',
"Output-JSON": "true"
}
})
.then(resp => resp.blob())
.then(this.handleResp) // handle the zipped response data
The REST API demo is a post request to https://ai-serve.pdftron.com/segment/predict
. It will provide zipped data for its response.
Here's are example code snippets for uploading a PDF to the online demo using the API endpoint:
// XHR example
file = new File([fileData], 'mypdf.pdf');
const xhttp = new XMLHttpRequest();
const endpoint = 'https://ai-serve.pdftron.com/segment/predict';
xhttp.open('POST', endpoint, true);
xhttp.setRequestHeader("File-Name", originalName || 'mypdf.pdf')
xhttp.send(originalFile);
xhttp.onreadystatechange = () => this.handleResp(xhttp, originalFile, 'local'); // handle the zipped response data
// Fetch example
fetch('https://ai-serve.pdftron.com/segment/predict',
{
method: 'POST',
body: originalFile,
headers:
{
"File-Name": originalName || 'mypdf.pdf'
}
})
.then(resp => resp.blob())
.then(this.handleResp) // handle the zipped response data
The REST API demo is a post request to https://ai-serve.pdftron.com/recog/predict
. It will provide zipped data for its response.
Here's are example code snippets for uploading a PDF to the online demo using the API endpoint:
// XHR example
file = new File([fileData], 'mypdf.pdf');
const xhttp = new XMLHttpRequest();
const endpoint = 'https://ai-serve.pdftron.com/recog/predict';
xhttp.open('POST', endpoint, true);
xhttp.setRequestHeader("File-Name", originalName || 'mypdf.pdf');
xhttp.setRequestHeader("Task", "form");
xhttp.setRequestHeader("Output-XFDF", "true");
xhttp.setRequestHeader("Output-JSON", "true");
xhttp.send(originalFile);
xhttp.onreadystatechange = () => this.handleResp(xhttp, originalFile, 'local'); // handle the zipped response data
// Fetch example
fetch('https://ai-serve.pdftron.com/recog/predict',
{
method: 'POST',
body: originalFile,
headers:
{
"File-Name": originalName || 'mypdf.pdf',
"Task": "form",
"Output-XFDF": true,
"Output-JSON": true
}
})
.then(resp => resp.blob())
.then(this.handleResp) // handle the zipped response data
Below you can find a list of accepted headers to the API.
Enable to force OCR on a document
true or false
Language used in the document for object content recognition.
Any language code, eg: eng, fra
The page to start table recognition at.
The page number to start recognition at, eg 1, 2
The page to end table recognition at.
The page number to end recognition at, eg 1, 2
Set to true to output HTML.
true or false
Set to true to output a docx file.
true or false
Set to true to output a XLSX file.
true or false
Set to true to return a XFDF output.
true or false
Set to true to return a JSON output.
true or false
Get the answers you need: Chat with us