Tuesday, January 12, 2010

Some recent nontrivial searches

Doing a lot of CSV file analysis the last couple weeks. Some things I've needed to do:

  • [^X]: Find out if there are any fields that don't match the typical pattern. Here's one. In this file I noticed that most files 2nd field had the pattern of four numbers followed by the letter A. I wanted to see if there were any that had some other character at the end. When I used s/\d\{4}[^A], lo and behold there was indeed a nonstandard entry:
102,0949,0277A,1,0.00,"
102,0282,0282A,1,0.00,"
102,0284,0284B,1,0.00,"
102,0287,0287A,1,0.00,"
102,0288,0287A,1,0.00,"
102,0289,0289A,1,0.00,"

  • \1: Then I noticed that in a lot of entries there were field 2 and field 3 pretty much matched in most cases. I wanted to see which ones didn't match. At first it was easier to look for the matching entries with the following s/\v(\d{4}),\1\u,:
102,0949,0277A,1,0.00,"
102,0282,0282A,1,0.00,"
102,0284,0284B,1,0.00,"
102,0287,0287A,1,0.00,"
102,0288,0287A,1,0.00,"
102,0289,0289A,1,0.00,"

  • \@!: Its actually a little confusing, but it can be done. By using the zero width matcher \@! you can specify that it NOT match the previous but then you have to fill in some for the part that it will match. Strangely the \@! matcher doesn't respect the \v 'verymagic' setting, so you end up with a lot of slashes in the end result. And this is what it looks like \(\d\{4}\),\(\1\)\@!\d\{4}\u,:
102,0949,0277A,1,0.00,"
102,0282,0282A,1,0.00,"
102,0284,0284B,1,0.00,"
102,0287,0287A,1,0.00,"
102,0288,0287A,1,0.00,"
102,0289,0289A,1,0.00,"

No comments: