AC自动机器算法 | 显淳碎记

AC自动机（Aho-Corasick Automaton），是由 Alfred V. Aho 和 Margaret J. Corasick 于 1975 年提出的一种多模式字符串匹配算法。它可以在线性时间内同时匹配多个模式串（pattern strings）在一个文本串（text）中出现的所有位置。

🧠 核心思想

AC自动机 = 字典树（Trie） + 失败指针（Failure Function） + 输出函数（Output Function）

字典树（Trie）：用于存储所有模式串。
失败指针（fail）：类似 KMP 的 next 数组，当匹配失败时跳转到“最长后缀匹配”的状态。
输出函数：记录从当前状态可以输出哪些完整的模式串。

🔧 构建步骤

步骤 1：构建 Trie 树

将所有模式串插入到 Trie 中。每个节点代表一个字符状态，从根节点出发。

例如，模式串集合：{"he", "she", "his", "hers"}

       (root)
      /  |   \
     h   s    h
    /    |     \
   e     h      i
  /      |       \
(he)    e         s
        /         |
     (she)       (his)
                 /
                r
               /
              s
             /
          (hers)

括号中的表示单词结束节点。

步骤 2：构建失败指针（BFS）

使用广度优先搜索（BFS） 为每个节点设置 fail 指针：

根节点的 fail 指向 null 或自己。
根的直接子节点 fail 指向根。
对于其他节点 u，设其父节点为 p，字符为 c：
- 从 p 的 fail 节点开始，沿着 fail 链查找是否存在子节点 c。
- 如果找到节点 v，则 u.fail = v；
- 否则继续跳 fail，直到根，若根也没有 c，则 u.fail = 根。

实际实现中，常使用队列完成 BFS 层次遍历。

步骤 3：构建输出函数（可选优化）

每个节点可以维护一个 output 列表，记录所有以该节点结尾或通过 fail 链能到达的模式串。

简单做法：匹配时沿 fail 链回溯，收集所有 output。
优化做法：在构建 fail 时，将 fail 节点的 output 合并到当前节点（称为“输出链接”）。

🔍 匹配过程

从文本串第一个字符开始，从 Trie 根节点出发：

当前字符 c：
- 若当前节点有子节点 c，则跳转到该子节点；
- 否则，沿 fail 指针回退，直到找到有 c 子节点的节点或回到根。
每到一个节点，检查其 output（或沿 fail 链），报告所有匹配的模式串。
继续处理下一个字符。

整个匹配过程时间复杂度为 O(n + m + z)，其中： - n：文本长度 - m：所有模式串总长度 - z：匹配结果总数

✅ 举个例子

模式串：{"abc", "bcd", "cde"}
文本串："abcde"

构建 Trie 后，加上 fail 指针。

匹配过程： - a → b → c：匹配 "abc" - c 后是 d：从 c 节点无 'd'，跳 fail，可能跳到根或其他节点 - 最终匹配到 "bcd"（位置 1-3）和 "cde"（位置 2-4）

💻 伪代码（构建 + 匹配）

class Node:
    def __init__(self):
        self.children = {}
        self.fail = None
        self.output = []  # 存储以该节点结尾的模式串
        self.word = ""    # 可选：记录完整单词

def build_ac_automaton(patterns):
    root = Node()
    # Step 1: Build Trie
    for pattern in patterns:
        node = root
        for c in pattern:
            if c not in node.children:
                node.children[c] = Node()
            node = node.children[c]
        node.output.append(pattern)
        # 或 node.word = pattern

    # Step 2: Build fail pointers (BFS)
    from collections import deque
    q = deque()
    for child in root.children.values():
        child.fail = root
        q.append(child)

    while q:
        current = q.popleft()
        for char, child in current.children.items():
            # Find fail node for child
            fail_node = current.fail
            while fail_node and char not in fail_node.children:
                fail_node = fail_node.fail
            child.fail = fail_node.children[char] if fail_node and char in fail_node.children else root
            # Optional: merge output
            child.output += child.fail.output
            q.append(child)
    return root

def ac_search(text, root):
    node = root
    matches = []
    for i, c in enumerate(text):
        while node and c not in node.children:
            node = node.fail
        if not node:
            node = root
        else:
            node = node.children[c]
        # Report all matches ending at position i
        for pattern in node.output:
            matches.append((i - len(pattern) + 1, pattern))
    return matches

🚀 应用场景

敏感词过滤（如聊天系统）
病毒特征码扫描
生物信息学中的序列匹配
日志关键词监控

📌 小结

特性	说明
时间复杂度	O(文本长度 + 模式串总长度 + 匹配数)
空间复杂度	O(模式串总长度 × 字符集大小)
优点	多模式同时匹配、高效、稳定
缺点	构建较复杂，内存占用较大（可用 Double Array Trie 优化）