Counting the occurrences of a substring - the fastest way…
So what is the fastest way to count how often a substring occurs inside a string, e.g. how many dashes can be found in “1-4-7-8-37-5-7-8-3-42″? I identified five ways to make perl say “9″ to this challenge:
- my $size = (scalar(@{[$string =~ /-/g]}) + 1);
- my $size = scalar(split /-/, $string);
- my $size = (($string =~ s/-//g) + 1);
- my $size = (($string =~ tr/-//) + 1);
- my $size = 1; $size++ while $string =~ /-/g;
I would have assumed that number 1 would be the fastest way as it requires no modifcation of the string. A quick run with “cmpthese” from the Benchmark module revealed that my assumption was wrong:
Rate | match (1) | split (2) | while (5) | repl (3) | tr (4) | |
---|---|---|---|---|---|---|
match (1) | 51106/s | – | -90% | -93% | -94% | -95% |
split (2) | 495540/s | 870% | – | -36% | -45% | -47% |
while (5) | 771605/s | 1410% | 56% | – | -15% | -18% |
repl (3) | 908265/s | 1677% | 83% | 18% | – | -4% |
tr (4) | 941620/s | 1742% | 90% | 22% | 4% | – |
The table shows a clear advantage when counting the occurences via a while-loop (solution 5), count the replaced characters (solution 3) or truncating all occurences (solution 4). While “tr” was the fastest way during this test run, it wasn’t the best way on other benchmark runs or others systems. If you want to try it yourself you can do so with the attached test-script.
On November 26th, 2005 at 3:17 am
The fastest way to count the occurrence of a one character literal substring is using the tr///.
Why do you add one to the result when it will then calculate one more than the number of literals found?
code:
my $string=”1-4-7-8-37-5-7-8-3-42″;
my $size = $string =~ tr/-//;
print “size = $size”;
results:
size = 9
On November 26th, 2005 at 7:34 pm
Good point, I didn’t realize that error. But as far as I can see this would only make the advantage of tr/// bigger, so the overall result at least isn’t wrong.
On May 25th, 2012 at 7:56 am
Hello,
For matching a substring in a pretty long string, for me the replacement method works best.
“While” method took 40 secs to give count, while replacement method took 2 secs only:)