How to extract XML from a XFA PDF form?

Suppose the XML data you want to extract is held inside the XFA Array, within the AcroForm dictionary. In order to extract all of the XFA data, you will need to iterate through this Array, and extract all of the content streams.

The following example shows how to extract the XML data at one specific index in the Array:

C#

1// Example code for extracting an xml string from the XFA form,
2// and putting it back after an update.
3
4PDFDoc doc = new PDFDoc(filename); 
5
6//get the acroform dictionary
7Obj acro_form = doc.GetAcroForm(); 
8
9// This PDF document contains XFA forms... 
10Obj obj = acro_form.FindObj("XFA"); 
11
12//We will store the XML string in this byte array
13byte[] buff = new byte[4000];
14byte byteRawPre, byteDecodePre, byteRawPost, byteDecodePost; 
15
16pdftron.Filters.Filter filter; 
17pdftron.Filters.FilterReader fr; 
18
19//The XFA entry in the PDF is an Array, so in this case,
20//we want to read the xml string stored at the 5th index of the Array
21filter = obj.GetAt(5).GetDecodedStream(); 
22fr = new pdftron.Filters.FilterReader(filter); 
23fr.Read(buff); 
24//at this point, the xml string should be stored inside buff,
25//and you can make whatever modifications you want
26
27//Modify XML String HERE
28
29//We create an indirect stream object, which will contain our 
30// newly modified XML string
31Obj new_xmp_stm = doc.CreateIndirectStream(buff);
32
33//The swap method allows us to switch all indirect references to the old stream,
34// to point to our newly created stream.
35doc.GetSDFDoc().Swap(
36    new_xmp_stm.GetObjNum(), 
37    acro_form.Get("XFA").Value().GetAt(5).GetObjNum()
38); 
39
40doc.Save(output_filename, SDFDoc.SaveOptions.e_linearized); 
41doc.Close();

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales