哈希算法题目：设计一个基于哈希的倒排索引系统

字数 818 2025-11-02 00:38:37

哈希算法题目：设计一个基于哈希的倒排索引系统

题目描述
设计一个倒排索引系统，用于支持文档检索功能。系统需要实现以下操作：

addDocument(docId, content)：添加一个文档，docId是文档的唯一标识符，content是文档的文本内容（由空格分隔的单词组成）
search(keyword)：搜索包含某个关键词的所有文档ID列表

要求：系统需要高效处理大量文档，并能快速返回搜索结果。

解题过程

步骤1：理解倒排索引的基本概念
倒排索引是搜索引擎的核心数据结构，它将文档中的单词映射到包含该单词的文档ID集合。与正排索引（文档ID→内容）相反，倒排索引建立了"单词→文档ID"的映射关系。

基本结构：

键（Key）：单词（经过归一化处理）
值（Value）：包含该单词的文档ID集合

步骤2：设计基本数据结构
我们需要一个哈希表来存储倒排索引：

键：字符串（单词）
值：集合（文档ID）

这样可以实现O(1)时间复杂度的单词查找，然后快速返回对应的文档ID集合。

步骤3：实现addDocument操作

class InvertedIndex:
    def __init__(self):
        # 哈希表：单词 → 文档ID集合
        self.index = {}
    
    def addDocument(self, docId, content):
        # 1. 将内容分割成单词列表
        words = content.split()
        
        # 2. 对每个单词进行处理
        for word in words:
            # 归一化处理：转为小写，确保大小写不敏感
            normalized_word = word.lower()
            
            # 3. 更新倒排索引
            if normalized_word not in self.index:
                # 如果单词第一次出现，创建新的集合
                self.index[normalized_word] = set()
            
            # 将文档ID添加到该单词对应的集合中
            self.index[normalized_word].add(docId)

步骤4：实现search操作

def search(self, keyword):
    # 1. 对搜索关键词进行同样的归一化处理
    normalized_keyword = keyword.lower()
    
    # 2. 在哈希表中查找该关键词
    if normalized_keyword in self.index:
        # 返回文档ID列表（排序后更友好）
        return sorted(list(self.index[normalized_keyword]))
    else:
        # 如果关键词不存在，返回空列表
        return []

步骤5：处理边界情况和优化

去重处理：使用set自动处理重复的文档ID
大小写不敏感：统一转为小写
标点符号处理：可以添加预处理步骤去除标点

import string

def addDocument(self, docId, content):
    # 移除标点符号
    translator = str.maketrans('', '', string.punctuation)
    content = content.translate(translator)
    
    words = content.split()
    for word in words:
        normalized_word = word.lower()
        if normalized_word not in self.index:
            self.index[normalized_word] = set()
        self.index[normalized_word].add(docId)

步骤6：完整实现代码

import string

class InvertedIndex:
    def __init__(self):
        self.index = {}
    
    def addDocument(self, docId, content):
        # 预处理：移除标点符号
        translator = str.maketrans('', '', string.punctuation)
        content = content.translate(translator)
        
        words = content.split()
        
        for word in words:
            normalized_word = word.lower()
            if normalized_word not in self.index:
                self.index[normalized_word] = set()
            self.index[normalized_word].add(docId)
    
    def search(self, keyword):
        normalized_keyword = keyword.lower()
        if normalized_keyword in self.index:
            return sorted(list(self.index[normalized_keyword]))
        return []

步骤7：测试示例

# 创建倒排索引系统
index_system = InvertedIndex()

# 添加文档
index_system.addDocument(1, "hello world this is a test")
index_system.addDocument(2, "world of programming is fun")
index_system.addDocument(3, "test your programming skills")

# 搜索测试
print("搜索 'world':", index_system.search("world"))  # [1, 2]
print("搜索 'programming':", index_system.search("programming"))  # [2, 3]
print("搜索 'python':", index_system.search("python"))  # []

步骤8：性能分析和扩展

时间复杂度：
- addDocument: O(n)，其中n是文档中的单词数
- search: O(1)平均情况
空间复杂度：O(m×d)，其中m是唯一单词数，d是平均文档数/单词

扩展功能：

多关键词搜索（AND/OR操作）
词干提取（如"running"→"run"）
停用词过滤（移除"the", "a", "is"等常见词）
短语搜索和邻近搜索

这个基于哈希的倒排索引系统展示了如何利用哈希表高效处理文本检索问题，是搜索引擎和文档检索系统的核心组件。