朴素的模式匹配算法

主串为S=“abcdefgab”S=\text{“abcdefgab”},我们要匹配T=“abcdex”T=\text{“abcdex”}。定义两个指针 i,ji,j,分别指向字符串S,TS,T
最初,i,ji,j 指向两个字符串的头,将两个字符串的头对其,开始同步向后移动两个指针,当 i=j=6i=j=6 时,发现s[i]t[j]s[i]\ne t[j],这时就将 TT 相对 SS 整体向右移动一位,ii 指针不变,令 jj 回到 11 。此时发现 TT 的第一个字符就不匹配,继续进行如图的步骤。
朴素算法的最坏时间复杂度为O((nm+1)×m)O((n-m+1)\times m),这样的算法实在是不能接受的。

KMP算法概述

在计算机科学中,Knuth-Morris-Pratt\text{Knuth-Morris-Pratt} 字符串查找算法(简称为KMP\text{KMP}算法)可在一个主文本字符串 SS 内查找一个词 WW 的出现位置。此算法通过运用对这个词在不匹配时本身就包含足够的信息来确定下一个匹配将在哪里开始的发现,从而避免重新检查先前匹配的字符。
这个算法是由高德纳和沃恩·普拉特在1974年构思,同年詹姆斯·H·莫里斯也独立地设计出该算法,最终由三人于1977年联合发表。

朴素方法的浪费

在上述朴素匹配算法的例子中,TT 的首字母 ‘a’\text{‘a’} 与后面的串“bcdex”\text{“bcdex”} 中任意一个字符都不相等。也就是说,在第①步既然已经发现前五个字符相等,就意味着 TT 的首字符不可能与 SS 中的第22位到第55位的字符相等,那么②③④⑤的比较可以说是多余的。

利用已获得信息

来看下一个例子,S=“abcababca”S=\text{“abcababca”}T=“abcabx”T=\text{“abcabx”}

对于开始的判断,前55个字符相等,到 i=j=6i=j=6 时失配,由于已经知道 S[1,2,3,4,5]=T[1,2,3,4,5]S[1,2,3,4,5]=T[1,2,3,4,5]T[1]T[1]TT的第22位和第33位字符都不相等,所以步骤②③是多余的,故可以直接向右大幅度移动 TT ,即 i=4,j=1i=4,j=1,重新进行匹配。但是这时我们又可以发现,由于S[4,5]=T[4,5]=T[1,2]S[4,5]=T[4,5]=T[1,2],实际上可以直接令 i=6,j=3i=6,j=3T[1,2]S[4,5]T[1,2]和S[4,5]显然能匹配上,那么步骤④⑤也是多余的。
在朴素的模式匹配算法中,主串的指针 ii 是不断地回溯的,而在上述的匹配过程中,我们发现完全可以避免不必要的回溯。这就是KMP\text{KMP}算法的精髓,我们可以充分利用字符串前后缀的相似性,再利用已经匹配得到的信息,避免指针的来回横跳。
KMP\text{KMP}算法的时间复杂度为O(n+m)O(n+m)

定义数组 Next[j]\text{Next[j]}

给出定义

在一般情况下,Next[j]Next[j] 为前 j1j-1 个字符组成的字符串最长相等真前后缀加一。

求数组 Next[j]\text{Next[j]}

下面给一个计算 Next[j]Next[j] 的例子。

j 1 2 3 4 5 6 7
模式串T a b c a b x
Next[j] 0 1 1 1 2 3 1

就拿 j=6j=6 来说,T[1,2]=T[4,5]T[1,2]=T[4,5],故 Next[6]=3Next[6]=3。当 j=6j=6 失配时,TT 的头会移动到 S[4]S[4] ,这时 T[1,2]T[1,2]S[4,5]S[4,5] 显然能匹配上,因此我们直接将 jj 移动到 j=3j=3, 即 $j=Next[j] $,称之为回退。
GtBKzQ.gif

模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void get_Next()
{
int i=1,j=0;
Next[1]=0;
while(i<=lt)
{
//t[i]表示后缀,t[j]表示前缀
if(j==0||t[i]==t[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];//回退
}
}

查询子串

HDU 1686 Oulipo \looparrowright

Problem Description

The French author Georges Perec (1936–1982) once wrote a book, La disparition, without the letter ‘e’. He was a member of the Oulipo group. A quote from the book:
Tout avait Pair normal, mais tout s’affirmait faux. Tout avait Fair normal, d’abord, puis surgissait l’inhumain, l’affolant. Il aurait voulu savoir où s’articulait l’association qui l’unissait au roman : stir son tapis, assaillant à tout instant son imagination, l’intuition d’un tabou, la vision d’un mal obscur, d’un quoi vacant, d’un non-dit : la vision, l’avision d’un oubli commandant tout, où s’abolissait la raison : tout avait l’air normal mais…
Perec would probably have scored high (or rather, low) in the following contest. People are asked to write a perhaps even meaningful text on some subject with as few occurrences of a given “word” as possible. Our task is to provide the jury with a program that counts these occurrences, in order to obtain a ranking of the competitors. These competitors often write very long texts with nonsense meaning; a sequence of 500000500000 consecutive 'T’s is not unusual. And they never use spaces.
So we want to quickly find out how often a word, i.e., a given string, occurs in a text. More formally: given the alphabet $\{A, B, C, …, Z\}$ and two finite strings over that alphabet, a word WW and a text TT, count the number of occurrences of WW in TT. All the consecutive characters of WW must exactly match consecutive characters of TT. Occurrences may overlap.

Input

The first line of the input file contains a single number: the number of test cases to follow. Each test case has the following format:
One line with the word W, a string over $\{A,B, C, …, Z\}$, with 1W100001 ≤ |W| ≤ 10000 (here W|W| denotes the length of the string WW).
One line with the text TT, a string over $\{A, B, C, …, Z\}$, with WT1000000|W| ≤ |T| ≤ 1000000.

Output

For every test case in the input file, the output should contain a single number, on a single line: the number of occurrences of the word WW in the text TT.

Sample Input

3
BAPC
BAPC
AZA
AZAZAZA
VERDI
AVERDXIVYERDIAN
Sample Output
1
3
0

Translation

在字符串TT中找到子串WW,输出所有WW出现的位置(以第一个字符出现的位置为准)。

Idea

如上述介绍KMP\text{KMP}算法的模拟过程,跳过不必要的回溯。

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include<iostream>
#include<cstring>
#include<cstdio>
#define N 10003
using namespace std;
char a[N*100],b[N];
int Next[N];
int la,lb;
/*------获取数组Next------*/
void get_Next()
{
int i=1,j=0;
Next[1]=0;
while(i<=lb)
{
if(j==0||b[i]==b[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];
}
}
/*------------匹配------------*/
int kmp()
{
int i=1,j=1;
int ans=0;
while(i<=la&&j<=lb)
{
if(j==0||a[i]==b[j])
{
i++;
j++;
}
else j=Next[j];
if(j>lb) //指针指向字符串末尾的后一位,说明找到子串。
{
//找到一个子串后当作匹配失败,继续找下一个
j=Next[j];
ans++;
}
}
return ans;
}
void solve()
{
int i;
scanf("%s",b+1);
lb=strlen(b+1);
scanf("%s",a+1);
la=strlen(a+1);
get_Next();
printf("%d\n",kmp());
}
int main()
{
int t;
cin>>t;
while(t--) solve();
return 0;
}

字符串最小循环节

POJ 1961 Period \looparrowright

Description

For each prefix of a given string SS with NN characters (each character has an ASCII code between 9797 and 126126, inclusive), we want to know whether the prefix is a periodic string. That is, for each ii (2iN)(2 \leqslant i \leqslant N) we want to know the largest K>1K > 1 (if there is one) such that the prefix of SS with length ii can be written as AKA^K ,that is AA concatenated KK times, for some string AA. Of course, we also want to know the period KK.

Input

The input consists of several test cases. Each test case consists of two lines. The first one contains NN (2N1000000)(2\leqslant N \leqslant 1000 000) – the size of the string SS.The second line contains the string SS. The input file ends with a line, having the number zero on it.

Output

For each test case, output “Test case #” and the consecutive test case number on a single line; then, for each prefix with length ii that has a period K>1K > 1, output the prefix size ii and the period KK separated by a single space; the prefix sizes must be in increasing order. Print a blank line after each test case.

Sample Input

3
aaa
12
aabaabaabaab
0

Sample Output

Test case #1
2 2
3 3
Test case #2
2 2
6 2
9 3
12 4

Translation

求一个字符串中前缀的最小循环节,要求输出前缀长度和它的最小循环节循环次数(次数要求至少为2)。

Idea

给出定理:S[1,2,,i]S[1,2,\cdots ,i] 具有长度为 circle<icircle<i 的循环节的充要条件是 circlecircle 能整除 ii 并且 S[circle+1,circle+2,,i]=S[1,2,,icirle]S[circle+1,circle+2,⋯,i]=S[1,2,⋯,i−cirle]

对于一个长度为 ii 的前缀,该前缀的最后一个字符位于字符串的第 ii 位,Next[i+1]Next[i+1]描述了字符串S[1,2,,i]S[1,2,\cdots ,i]的相似性,Next[i+1]1Next[i+1]-1表示既是子串 s[1,2,,i]s[1,2,⋯,i] 的前缀,同时也是子串 s[1,2,,i]s[1,2,⋯,i] 的后缀的最长真前缀长度 ,即S[i(Next[i+1]1)+1,,i]=S[1,,Next[i+1]]S[i-(Next[i+1]-1)+1,\cdots,i]=S[1,\cdots,Next[i+1]]\RightarrowS[(iNext[i+1]+1)+1,,i]=S[1,,i(iNext[i+1])]S[(i-Next[i+1]+1)+1,\cdots,i]=S[1,\cdots,i-(i-Next[i+1])],若(iNext[i+1]+1)i(i-Next[i+1]+1)\mid i,那么S[1,,iNext[i+1]+1]S[1,\cdots,i-Next[i+1]+1]就是S[1,2,,i]S[1,2,\cdots ,i] 的最小循环节。
进一步地,如果 iNext[Next[i+1]]+1i−Next[Next[i+1]]+1 能整除 ii,那么 S[1,,iNext[Next[i+1]]+1]S[1,⋯,i−Next[Next[i+1]]+1] 就是 S[1i]S[1⋯i] 的次小循环元。依次类推,我们还可以找出 S[1i]S[1⋯i] 所有可能的循环节。

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include<iostream>
#include<cstdio>
#include<cstring>
#define N 1000004
using namespace std;
int len;
char s[N];
int Next[N];
void get_Next()
{
int i=1,j=0;
while(i<=len)
{
if(j==0||s[i]==s[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];
}
}
void solve()
{
scanf("%s",s+1);
get_Next();
int i;
for(i=2;i<=len;i++)
{
int circle=i-Next[i+1]+1;
//判定最小循环节
if(!(i%circle)&&circle<i) printf("%d %d\n",i,i/circle);
}
putchar('\n');
}
int main()
{
int times=0;
while(~scanf("%d",&len)&&len)
{
printf("Test case #%d\n",++times);
solve();
}
return 0;
}

统计每个前缀的出现次数

HDU 3336 Count the string \looparrowright

Problem Description

It is well known that AekdyCoin is good at string problems as well as number theory problems. When given a string ss, we can write down all the non-empty prefixes of this string. For example,s=“abab”s=\text{“abab”}.The prefixes are: “a”,“ab”,“aba”,“abab”\text{“a”},\text{“ab”}, \text{“aba”}, \text{“abab”}.For each prefix, we can count the times it matches in ss. So we can see that prefix “a”\text{“a”} matches twice, “ab”\text{“ab”} matches twice too, “aba”\text{“aba”} matches once, and “abab”\text{“abab”} matches once. Now you are asked to calculate the sum of the match times for all the prefixes. For “abab”, it is 2+2+1+1=62 + 2 + 1 + 1 = 6.
The answer may be very large, so output the answer mod 1000710007.

Input

The first line is a single integer TT, indicating the number of test cases.
For each case, the first line is an integer nn (1n200000)(1\leqslant n \leqslant 200000), which is the length of string ss. A line follows giving the string ss. The characters in the strings are all lower-case letters.

Output

For each case, output only one number: the sum of the match times for all the prefixes of ss mod 1000710007.

Sample Input

1
4
abab

Sample Output

6

Translation

给定一个字符串 ss(1s200000)(1\leqslant|s|\leqslant 200000 )所有前缀在字符串 ss 中出现次数的总和,答案对 1000710007 取模。

Idea

考虑位置 ii 对应的 Next[i+1]1Next[i+1]-1。根据定义,这意味着字符串 ss 一个长度为 Next[i]1Next[i]-1 的前缀在位置 ii 出现并以 ii 为右端点,同时不存在一个更长的前缀满足前述定义。与此同时,更短的前缀可能以该位置为右端点。此时需要寻找下一个更小的长度为 kkk<Next[i+1]1k<Next[i+1]-1 的前缀,该长度的前缀同时也是一个右端点为 ii 的后缀。因此以位置 ii 为右端点,有长度为 Next[i+1]1Next[i+1]-1 的前缀,有长度为 Next[Next[i+1]]1Next[Next[i+1]]-1 的前缀,有长度为 Next[Next[Next[i+1]]]1Next[Next[Next[i+1]]]-1 的前缀,等等,直到长度变为 00

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include<iostream>
#include<cstdio>
#include<cstring>
#define N 200004
#define ll long long
using namespace std;
const ll mod=10007;
int len;
char s[N];
int Next[N];
void get_Next()
{
int i=1,j=0;
while(i<=len)
{
if(j==0||s[i]==s[j])
{
i++;
j++;
Next[i]=j;
}
else j=Next[j];
}
}
void solve()
{
scanf("%d%s",&len,s+1);
get_Next();
ll ans=len;
int i;
for(i=1;i<=len;i++)
{
int temp=Next[i+1]-1;
while(temp)//迭代直至为0
{
ans=(ans+1)%mod;
temp=Next[temp+1]-1;
}
}
printf("%lld\n",ans);
}
int main()
{
int t;
cin>>t;
while(t--) solve();
return 0;
}