Table of Contents

Lucas Wu

(1) Talking About Chinese Word Segmentation: Maximum Matching, Bidirectional Matching, Minimal Word Count

Chinese word segmentation refers to the process of dividing text into words, where the concatenated result equals the original text. Chinese word segmentation has always been a significant area in the NLP field. Most text mining tasks are based on it. However, Chinese differs from English, where words are separated by spaces, making English semantically less complex compared to Chinese.

There has always been a business demand for Chinese word segmentation. However, due to being busy with other projects, I had not studied it thoroughly before. Recently, I started exploring Chinese word segmentation algorithms. While the field is relatively mature, its performance for out-of-vocabulary words or specific domain texts is often unsatisfactory. To achieve better results, a combination of algorithms or the use of manual dictionaries is often necessary.

Chinese word segmentation can be divided into two main approaches:

  1. Dictionary-based segmentation: Examples include Forward Maximum Matching, Backward Maximum Matching, Bidirectional Maximum Matching, and Minimal Word Count.
  2. Character labeling segmentation: Examples include HMM (Hidden Markov Model).

Each approach has its strengths and weaknesses. For example, dictionary-based segmentation is very fast but struggles to identify out-of-vocabulary words. In contrast, HMM and Pkuseg can handle some out-of-vocabulary words but at the cost of reduced speed. In real-world applications, this trade-off can be critical. The optimal solution often involves combining the strengths of various methods. For instance, Jieba segmentation uses dictionary-based methods for known words and switches to HMM models for unknown words.

Dictionary-Based Segmentation

Dictionary-based segmentation algorithms involve querying a dictionary-like data structure. These algorithms have weak recognition capabilities for out-of-vocabulary words and struggle with understanding ambiguous phrases. For example, in the phrase “结婚的和尚未结婚的” (“married monks who are not yet married”), the ideal segmentation would be “结婚/的/和/尚未/结婚/的.” However, a dictionary-based method might segment it as “结婚/的/和尚/未/结婚/的.”

Nevertheless, dictionary-based segmentation is very fast, and there are mature and efficient solutions available. Many tools support efficient data structures and querying methods, such as Trie trees and Aho-Corasick automaton. For simplicity, we will explain a few basic algorithms.

Forward Maximum Matching Algorithm (FMM)

The Forward Maximum Matching (FMM) algorithm mimics human reading habits, i.e., recognizing text from left to right. The “maximum” refers to using the longest character length in the dictionary as the maximum matching width. Text is then split according to this width, and the extracted segments are checked against the dictionary.

If a segment matches the dictionary, it is recorded, and the starting position for the next segmentation is updated. If no match is found, the last character is removed from the segment, and the process repeats until a match is found. If a single character remains and still doesn’t match, that character is returned as is.

Example: For the text “中文分词算法” and a dictionary containing only the word “分词,” the algorithm works as follows:

  1. Set the maximum matching width to 2 (the length of “分词”). The first segment “中文” is checked against the dictionary. No match is found, so “中” is checked, also without a match. Thus, “中” is returned.
  2. The next segment “文分” is checked. Similarly, “文” is returned after failing to find a match.
  3. The segment “分词” matches the dictionary, so it is recorded.
  4. The segment “算法” is processed similarly, returning “算” and “法” as no matches are found.

The final segmentation result is ["中", "文", "分词", "算", "法"].

Code implementation:

sentence = '中文分词算法'  # Input sentence
cutList = ['分词']  # Dictionary

start = 0  # Starting position
maxWidth = len(max(cutList, key=len))  # Maximum matching width from dictionary
cut_result = []  # To store segmentation results

while start <= len(sentence):  # Continue until the start position exceeds sentence length
    end = start + maxWidth  # Calculate the end position
    word = sentence[start:end]  # Extract the segment
    while word:
        if word in cutList:  # If the segment matches the dictionary
            cut_result.append(word)
            start += len(word) - 1  # Update start position
            break
        if len(word[:-1]) == 0:
            cut_result.append(word)  # Add unmatched character
            break
        word = word[:-1]  # Remove last character and retry
    start += 1

print(cut_result)
# ['中', '文', '分词', '算', '法']

Backward Maximum Matching Algorithm

The Backward Maximum Matching (BMM) algorithm works in the opposite direction of the Forward Maximum Matching (FMM) algorithm. For example, while FMM starts segmenting the text “中文分词算法” from “中文” with a matching width of 2, BMM starts from “算法.”

In addition to reversing the direction, the way unmatched segments are handled is also reversed. In FMM, the last character of the segment is removed if no match is found, while in BMM, the first character is removed. For instance, after segmenting “中文,” if no match is found, “文” is removed, and “中” is checked. In BMM, after segmenting “算法,” if no match is found, “算” is removed, and “法” is checked.

Compared to FMM, BMM can resolve certain ambiguous phrases better. For instance, in the sentence “他说的确实在理” (“What he said indeed makes sense”), the ideal segmentation includes the word “确实” (“indeed”). The FMM result might be “他/说/的确/实/在理,” while the BMM result is “他/说/的/确实/在理,” which is closer to the intended meaning.

However, neither approach is universally better. For example, if the desired segmentation is “的确,” BMM may fail to produce it.

Code implementation:

sentence = '他说的确实在理'  # Input sentence
cutList = ['的确', '确实']  # Dictionary of words
start = len(sentence)  # Set the starting position to the last character of the sentence
maxWidth = len(max(cutList, key=len))  # Maximum matching width from dictionary
cut_result = []  # To store segmentation results

while start > 0:  # Continue until the starting position is greater than 0
    end = start - maxWidth  # Calculate the end position
    word = sentence[end: start]  # Extract the segment
    while word:  # Check the extracted segment
        if word in cutList:  # If the segment matches the dictionary
            cut_result.insert(0, word)  # Insert at the beginning of the result
            start -= len(word)  # Update the start position
            break
        if len(word) == 1:
            cut_result.insert(0, word)  # Add the unmatched character
            break
        word = word[1:]  # Remove the first character and retry
    start -= 1  # Move to the previous position
cut_result.insert(0, sentence[0])  # Add the remaining first character to the result

print(cut_result)
# ['他', '说', '的', '确实', '在', '理']

Bidirectional Maximum Matching Algorithm

The Bidirectional Maximum Matching (BMM) algorithm compares the results of FMM and BMM and selects the better one based on predefined rules:

  1. Return the result with the fewest words.
  2. If the word count is the same, choose the result with fewer single-character words.
  3. If both are identical, prioritize the BMM result.

Since BMM essentially combines the outputs of FMM and BMM, its computational cost is approximately double that of the other two methods.

Minimal Word Count Algorithm

The Minimal Word Count algorithm, also known as the Minimal Segmentation algorithm, aims to produce the segmentation with the fewest words. The process involves sorting the dictionary by word length in descending order and matching the longest words first. If no match is found, it moves to shorter words sequentially.

For example, for the sentence “独立自主和平等互利的原则” (“Principles of independence, equality, and mutual benefit”), FMM might produce the result “独立自主/和平/等/互利/的/原则,” while the Minimal Word Count algorithm would produce “独立自主/和/平等互利/的/原则.”

Code implementation:

sentence = '独立自主和平等互利的原则'  # Input sentence
cutList = ['独立自主', '平等互利', '独立', '自主', '和平', '平等', '互利', '原则']  # Dictionary of words
cutList = sorted(cutList, key=len, reverse=True)  # Sort dictionary by word length in descending order
for cut in cutList:
    if ('/%s' % cut not in sentence and '%s/' % cut not in sentence and cut in sentence):
        sentence = sentence.replace(cut, '/%s/' % cut)

print(sentence)
# /独立自主/和/平等互利/的/原则

References

  1. CNBlogs Article
  2. Zhihu Article
  3. Matrix67 Blog
  4. Kexue Blog
(完)