Finding the Index and Number of Occurrences of a Pattern in a String

It is a common task in programming to search for patterns in a given string. This is especially true in the field of Bio Informatics. For various reasons, the programmer may want to know the location, or index, that the pattern started at within the String. The following code below is an example of how to find the index and also the number of occurrences of a given pattern in Perl.

sub pattern_count {
    my ($text, $pattern) = @_;

    my $pattern_len = length($pattern);
    my $count = 0;
    my $pos;
    for ( my $i = 0; $i < length($text); $i++) {
        if ( $pattern eq substr($text, $i, $pattern_len ) ) {
            $count += 1;
            $pos = $i + 1;
            print "found $pattern at $pos\n";
        }
    }
    return $count;

}

my $num_occurrences = pattern_count('GATATATGCATATACTT', 'ATAT');

print "The number of times the pattern occurred is : " . $num_occurrences,"\n";


The String to search within is the first argument to the method, and the desired pattern is the second argument given.

The difficult part about using a regular expression for this task is that sometimes the desired pattern may overlap itself in occurrences. Since using the global flag for a regular expression will not pick up the overlapping occurrences, this implementation of walking the string is used.

Finding the Index and Number of Occurrences of a Pattern in a String

17 thoughts on “Finding the Index and Number of Occurrences of a Pattern in a String

  1. This one is probably not perfect either, but for huge strings probably a lot faster since it uses Perl’s regexp engine; the trick here is to use the pattern within a look-ahead assertion, which doesn’t actually match it, and a combination of \G and ^ to try to match again and again, moving one character ahead at a time.

    sub pattern_count
    {
    my ($text, $pattern) = @_;

    my $count = 0;
    my $pos;
    pos($text) = 0;
    while($text =~ m{\G(^.*?|.+?)(?=\Q$pattern\E)}g) {
    $count += 1;
    $pos = pos($text) + 1;
    print “found $pattern at $pos\n”;
    }
    return $count;
    }

    Liked by 1 person

    1. Hello,
      Awesome use of the look-ahead there. I compared the times for each method with the data provided in my post and here were the results:

      From my example:

      
      time perl patt.pl
      found ATAT at 2
      found ATAT at 4
      found ATAT at 10
      The number of times the pattern occurred is : 3
      
      real	0m0.003s
      user	0m0.001s
      sys	0m0.002s
      

      From yours:

      
      time perl blog.pl
      found ATAT at 2
      found ATAT at 4
      found ATAT at 10
      The number of times the pattern occurred is : 3
      
      real	0m0.006s
      user	0m0.003s
      sys	0m0.003s
      

      Also noting to myself that benchmarking on such a small set of data may be next to useless. And as they say, there are three kinds of lies: Lies, Damn Lies, and benchmarks 🙂

      Like

    2. Using lookahead, it’s not really necessary to use all those gyrations to force the match ahead one character at a time. This happens naturally, as follows:

      $ perl
      use feature "say";
      my $x = "ATATATATACATATA";
      my $s = "ATAT";
      while ($x =~ /(?=\Q$s\E)/g) {
        say pos($x);
      }
      ^D
      0
      2
      4
      10
      $
      

      Like

  2. Hi, there’s a faster way of doing this using index(), which according to my benchmarks is about 3-10 times faster (the longer the search strings, and the sparser the matches, the faster it is):

    sub pattern_count {
    my ($text, $pattern) = @_;
    my $count = 0;
    my $pos = 0;
    while (($pos = index($text, $pattern, $pos)) >= 0) {
    $count++;
    $pos++;
    print “found $pattern at $pos\n”;
    }
    return $count;
    }

    This still finds overlapping matches, but uses index() to search for the next match from a given position.

    Hope this is useful,
    Steven

    Like

  3. Oops – edit from my previous comment — I had = where I needed =~:

    You can also apply a list context to a match:
    my $string = ‘abc def abc’;
    my $find_str = ‘abc’;
    my $count = () = $string =~ /$find_str/g;
    # $count = 2

    Like

    1. Alas, the regex option does not work for overlapping matches (as mentioned in the last paragraph of the original post); try it with “ATATATATATA” and “ATA”. The string contains overlapping instances of the search patttern, hence the method of walking the index finds 5 occurrences, while the regex method finds only 3.

      Liked by 1 person

  4. John says:

    To expand (or contract) on Marek’s solution, if you don’t need the “Found at…” notifications then you can use the lookahead without the \G or the loop which should allow the regexp engine to loop through the text once.

    sub pattern_count
    {
    my ($text, $pattern) = @_;
    return scalar( () = $text =~ m{(?=\Q$pattern\E)}g );
    }

    This also avoids all of the variable declarations and initializations (and incrementations and other …ations) which can add up if you really need speed.

    Cheers!

    Liked by 1 person

  5. Joaquin Ferrero says:

    With index():

    sub pattern_count {
    my ($text, $pattern) = @_;

    my $count = 0;
    my $pos = -1;

    while (-1 != ($pos = index $text, $pattern, $pos+1)) {
    $count++;
    print “found $pattern at $pos\n”;
    }
    return $count;
    }

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s