bulk - How can I perform a large number of different find/replaces?

Friday, 16 February 2018

bulk - How can I perform a large number of different find/replaces?

I have several times had a text document which I need to apply several hundred find/replaces on. These find/replaces do not follow a pattern which regex can be reasonably applied to, and need to be applied in order. Previously I've resorted to doing them by hand after much searching, but is there a better way?

Answer

Please correct me if I've misunderstood your question but from your description, I take it to mean you have a single (possibly very large) Ascii .txt document and that when you say the changes must be applied "in order", you mean you'd like to do the first search/replace on the entire document, then the second search/replace on the entire document and so on.

Perhaps the easiest solution would be to create file (call it sedscript) containing a sed script, one line per change. Here's an example. The g at the end means "global", i.e., replace all occurrences, not just the first occurrence on any given line.

s/foo/bar/g
s/hello/world/g
s/yellow/green/g
:

You could then run this as follows:

sed -f sedscript infile.txt > outfile.txt

If you're satisfied with the output, simply mv it back over the top of the input:

mv outfile.txt infile.txt

If you're on a Linux machine, that comes with sed. If you're on Windows, you can get sed (and mv) with either Cygwin or my own Hamilton C shell (including the free version).

Added:

Since you would also like to consider matches that span line ends, then, yes, one way to do that is to replace all the line ends with a special character or string, do the search/replace operations you intend, then put the line ends back when you're done.

The easiest way to do the line end conversions with sed is in separate pipeline stages. In my example here, I've replaced the \r\n sequences with a # but could be a completely arbitrary string (but it's easier if you can use a single character.)

sed 's/\r\n/#/' infile.txt | s -f sedscript | sed 's/#/\r\n/g' > outfile.txt

Inside your sedscript file, you'd then search/replace on both variations, with a space between the words or whatever you've replaced it with.

If you're able to use just a single character and don't need a multicharacter string to guarantee uniqueness, you can use \(...\) notation to create a tagged regular expression around [...] list of characters that might separate a word. Whatever it matches can be inserted into the replace string as \1.

Here's a screenshot how this might work.

Line breaks with sed

Notes

Friday, 16 February 2018