How to search and replace text from a very large XML file

code

There are many ‘solutions’ that exist that will parse through a text file and other utilities that have specific functions designed to parse through an XML file. After trying several applications only one was capable of effeciently search and replace text within (in this case) an XML file that is over 300 MB in size. The whole process took less than one minute!

While these solutions did successfully complete the search and replace of text within a small test XML of only a few records, they all failed by either crashing or never completing the same task on a 300 MB file. Not only did these utilities fail, it took several minutes before failing.

  • XmlSearchReplaceExecutable 1.6 – a GUI utility that crashed after a few minutes.
  • Windows Powershell – A native utility built into some versions of Windows 7 and is downloadable for other Windows versions. The following script is to mimic the behavior of the popular Linux SED command. Never completed, failed.
get-content somefile.txt | %{$_ -replace "expression","replace"}

The only solution that worked for this example, SED for Windows. Sed (streams editor) isn’t really a true text editor or text processor. Instead, it is used to filter text, i.e., it takes text input and performs some operation (or set of operations) on it and outputs the modified text. Sed is typically used for extracting part of a file using pattern matching or substituting multiple occurances of a string within a file. This is part of a larger package called GnuWin.

For example, this is a sample of the Microsoft sample XML: books.xml

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.</description>
   </book>
</catalog>

To use SED for Windows the following two commands will work. The first command will change the word Computer to Guides, leave the original file in tact, and create a new file called New.xml. The second command will do an in place search and replace of Computer to Guides, overwritting the original XML called books.xml.

"c:\cygwin\bin\sed.exe" "s/<genre>Computer<\/genre>/<genre>Guides<\/genre>/" books.xml > New.xml
"c:\cygwin\bin\sed.exe" -i "s/<genre>Computer<\/genre>/<genre>Guides<\/genre>/" books.xml

Resource(s)
http://www.daniweb.com/software-development/shell-scripting/threads/202018/modifying-xml-files-using-sed