TextExtractorGetAsXML Method (TextExtractorXMLOutputFlags) |
Get text content in a form of an XML string.
Namespace:
pdftron.PDF
Assembly:
pdftron (in pdftron.dll) Version: 255.255.255.255
Syntax public string GetAsXML(
TextExtractorXMLOutputFlags flags
)
Public Function GetAsXML (
flags As TextExtractorXMLOutputFlags
) As String
public:
virtual String^ GetAsXML(
[InAttribute] TextExtractorXMLOutputFlags flags
) sealed
function GetAsXML(flags);
Parameters
- flags
- Type: pdftron.PDFTextExtractorXMLOutputFlags
flags controlling XML output. For more
information, please see TextExtract::XMLOutputFlags.
Return Value
Type:
String The string containing XML output.
Remarks
XML output will be encoded in UTF-8 and will have the following
structure:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
<Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
<Word box="106.188, 708.075, 15.9318, 10.02"<SDK</Word>
<Word box="125.617, 708.075, 6.22242, 10.02"<is</Word>
...
</Line>
</Para>
</Flow>
</Page>
The above XML output was generated by passing the following union of
flags in the call to GetAsXML():
(TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)
In case 'xml_output_flags' was not specified, the default XML output
would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line<PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line<levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>
See Also