Some test text!

Search
Hamburger Icon

Python / FAQ / How to extract XML from a XFA PDF form?

How to extract XML from a XFA PDF form?

Suppose the XML data you want to extract is held inside the XFA Array, within the AcroForm dictionary. In order to extract all of the XFA data, you will need to iterate through this Array, and extract all of the content streams.

The following example shows how to extract the XML data at one specific index in the Array:

// Example code for extracting an xml string from the XFA form,
// and putting it back after an update.

PDFDoc doc = new PDFDoc(filename); 

//get the acroform dictionary
Obj acro_form = doc.GetAcroForm(); 

// This PDF document contains XFA forms... 
Obj obj = acro_form.FindObj("XFA"); 

//We will store the XML string in this byte array
byte[] buff = new byte[4000];
byte byteRawPre, byteDecodePre, byteRawPost, byteDecodePost; 

pdftron.Filters.Filter filter; 
pdftron.Filters.FilterReader fr; 

//The XFA entry in the PDF is an Array, so in this case,
//we want to read the xml string stored at the 5th index of the Array
filter = obj.GetAt(5).GetDecodedStream(); 
fr = new pdftron.Filters.FilterReader(filter); 
fr.Read(buff); 
//at this point, the xml string should be stored inside buff,
//and you can make whatever modifications you want

//Modify XML String HERE

//We create an indirect stream object, which will contain our 
//  newly modified XML string
Obj new_xmp_stm = doc.CreateIndirectStream(buff);

//The swap method allows us to switch all indirect references to the old stream,
//  to point to our newly created stream.
doc.GetSDFDoc().Swap(
    new_xmp_stm.GetObjNum(), 
    acro_form.Get("XFA").Value().GetAt(5).GetObjNum()
); 

doc.Save(output_filename, SDFDoc.SaveOptions.e_linearized); 
doc.Close();

Trial setup questions? Ask experts on Discord
Need other help? Contact Support
Pricing or product questions? Contact Sales