Some test text!
Python / Guides / PDF format
In this section, we present the basic structure of a PDF document. For details, please refer to the PDF Reference Manual. Below is a listing of a very simple PDF document. It displays a "Hello World" string on a single page.
0000 %PDF-1.4 0001 1 0 obj << 0002 /Parent 5 0 R 0003 /Resources 3 0 R 0004 /Contents 2 0 R 0005 >> 0006 endobj 0007 2 0 obj 0008 << 0009 /Length 51 0010 >> 0011 stream 0012 BT 0013 /F1 24 Tf 0014 1 0 0 1 260 330 Tm 0015 (Hello World)Tj 0016 ET 0017 endstream 0018 endobj 0019 3 0 obj 0020 << 0021 /ProcSet \[/PDF/Text\] 0022 /Font <</F1 4 0 R >> 0023 >> 0024 endobj 0025 4 0 obj << 0026 /Type /Font 0027 /Subtype /Type1 0028 /Name /F1 0029 /BaseFont/Helvetica 0030 >> 0031 endobj 0032 5 0 obj 0033 << 0034 /Type /Pages 0035 /Kids \[ 1 0 R \] 0036 /Count 1 0037 /MediaBox \[0 0 612 714\] 0038 >> 0039 endobj 0040 6 0 obj 0041 << 0042 /Type /Catalog 0043 /Pages 5 0 R 0044 >> 0045 endobj 0046 xref 0047 0 7 0048 0000000000 65535 f 0049 0000000009 00000 n 0050 0000000103 00000 n 0051 0000000204 00000 n 0052 0000000275 00000 n 0053 0000000361 00000 n 0054 0000000452 00000 n 0055 trailer 0056 << 0057 /Size 7 0058 /Root 6 0 R 0059 >> 0060 startxref 0061 532
A PDF document consists of four sections:
Note that objects refer to each other using a notation like "5 0 R". The "R" stands for reference and uses the two preceding numbers to name a specific object and revision.
Therefore, the file body consists of a collection of objects, each object potentially referencing any or all objects, including itself. This set of nodes and directed references constitutes a graph. We could represent the "Hello World" sample file using the following abstract graph representation.
Each object in the graph is represented with an ellipse and each object cross reference is represented with an arrow.
Each PDF document must have a "Root" node. It must reference a "Catalog" node which must reference a "Pages" node. The "Pages" node further branches and points to each of the pages in the document. Note that a "Pages" node points to a group of pages whereas a "Page" node represents a single page.
The "Page" node references the page's "Contents" and the page's "Resources". The resource dictionary, in turn, references the "Fonts" used on the page. The resource dictionary can reference many other resource types, including Color Spaces, Patterns, Shadings, Images, Forms, and more. The page contents stream contains markup operators used to draw the page.
Each PDF document uses this basic object structure to represent a PDF document.
Before going into details of Apryse SDK SDF/COS object model, we should review the basics. For a detailed description of the SDF syntax and semantics, please refer to Chapter 3 (Syntax) of the PDF Reference Manual.
In PDF there are five atomic objects:
|Number||PDF provides two types of numeric object: integer and real.||1.03 612|
|Bool||Boolean objects are identified by the keywords true and false.||true false|
|Name||A name object is an atomic symbol uniquely defined by a sequence of characters. Names always begin with "/" and can contain letters and numbers and a few special characters.||/Font /Info /PDFNet|
|String||Strings are sequences of bytes enclosed in "(" and ")"||(Hello World!)|
|Null||The null object has a type and value that are unequal to those of any other object. Usually it refers to a missing object.||null|
Also, there are three compound objects:
|Array||An array object is a one-dimensional collection of objects arranged sequentially. Unlike arrays in typical computer languages, PDF arrays may be heterogeneous; that is, an array's elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays.||[ true /Name ]|
|Dictionary||A dictionary object is a map containing pairs of objects, known as the dictionary's entries. The first element of each entry is the key and the second element is the value. The key must be a name. The value can be any kind of object, including another dictionary.||<</key /value >>|
|Stream||A stream is essentially a dictionary followed by a sequence of bytes. PDF streams are always indirect objects and thus always may be shared.||1 0 obj << /Length 144 >> stream ........... endstream endobj|
Objects can be arbitrarily nested using the dictionary and array compounding operations.
All of the objects in the above tables are "direct objects" because they are not surrounded by "obj" and "endobj" keywords. The body of the PDF document is actually made up of a sequence of "indirect objects". An indirect object is created by taking a single direct object (whether it be atomic or compound) and enclosing it with the "nm obj" and "endobj" keywords (where n and m are non-negative integers).
Note that, because indirect objects are numbered and can be referenced by other objects, they can be shared — that is, referenced by more than one other object. However, since direct objects are not numbered, they can't be shared.
In the above PDF example, the object "3 0 obj" is an indirect object because the "obj" and "endobj" keywords wrap a dictionary object containing two entries.
3 0 obj << /ProcSet \[/PDF /Text\] /Font << /F1 4 0 R >> >> endobj
The "ProcSet" key is mapped to an array which is a direct object containing atomic direct objects. In a similar way, the "Font" key is mapped to a direct dictionary. On the other hand, "F1" in the inner dictionary is mapped to an indirect object with object number 4 and generation number 0. Because the "Font" key points to an indirect object, the same font resource can be shared across many different pages.
Get the answers you need: Support