Tuesday, 14 August 2018

linux - Find repeated words in a text


One of most common typos is to repeat the same word twice, as as here. I need an automatic procedure to remove all the repeated words in a text file. This should not be a strange feature for a modern editor or spell-checker, for example I remember that MS Word introduced this feature several years ago! Apparently, the default spell-check on my OS (hun-spell) can't do this, as it only finds words not in the dictionary.


It would be OK to have a solution valid for a specific text editor editor for linux (pluma/gedit2 or Sublime-text) and a solution based on a bash script.



Answer



With GNU grep:


echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' |  grep -Eo '(\b.+) \1\b'

Output:



twice twice
as as
here here
123 123



Options:


-E: Interpret (\b.+) \1\b as an extended regular expression.


-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.


Regex:


\b: Is a zero-width word boundary.


.+: Matches one or more characters.


\1: The parentheses () mark a capturing group and \1 means use here the value from first capturing group.




Reference: The Stack Overflow Regular Expressions FAQ


No comments:

Post a Comment

Where does Skype save my contact's avatars in Linux?

I'm using Skype on Linux. Where can I find images cached by skype of my contact's avatars? Answer I wanted to get those Skype avat...