CMU 11642 Search Engine - 搜索总览

Big Picture

作者 QIFAN 日期 2017-01-25
CMU 11642 Search Engine - 搜索总览

知识点

Heap’s Law

预测 vocabulary 的容量。

  • $V = KN^b$
    • V 不同单词数
    • K 系数。通常 $10 <= K <= 100$
    • b 系数。英语 $0.4 <= b <= 0.6$
    • N 单词总个数。

Zipf’s Law

预测某一个单词占全文比例。

  • $P(t)=\frac{ctf_t}{N}$
    • $ctf_t$ collection term frequency t 在文本中出现的频率
    • N 文本中的总单词数
  • 也可以简化成 $P(t_R) = \frac{A}{R}$
    • A 为系数,英语通常为 0.1
    • R 为单词频数的排名(由高到低)(rank)
    • 也就是说频数最高的单词占了总单词量的 10% 作业

Information needs and queries

需要的信息如“我想买一个吸尘器,不要太贵”
query 就是 “吸尘器 AND 不贵”

Query Tree

Query Tree

Unranked Boolean

笔记里介绍了一下。与后面的 Ranked Boolean 一样是 exact-match(精确匹配)。搜索结果经常有一大串,unranked 就是不管相关性或者出现频率,只要找到了 query 中的关键词就加入结果。最后一般通过日期排序。现在只有少量如 WestLaw ,PubMed 这些专业性强的搜索引擎在使用,我猜应该是因为用户的确对结果没有偏好。

Ranked Boolean

和 unranked 对应,搜索结果会根据搜索分数进行排序,分数可以根据相关性、频数等。
分数计算:AND 是 min ,OR 是 max 。

Inverted List

每个单词都有一个 list 用于记录在文件中出现的位置。而每个存储在字典中的 item 及其在文档中的位置合起来叫做 posting 。
Inverted List
有三种 inverted list :

  • Binary Inverted List
    List 中只包括出现了 item 的 docID ,只能使用 AND , OR , AND-NOT ,FIELD 等 Boolean 运算符。
  • Frequency Inverted List
    List 中含出现 item 的 docID 及其出现频数。
  • Positional Inverted List
    List 中有 item 的 docID 及每一个 doc 中的位置。可使用 NEAR/n , SENTENCE/n , PASSAGE/n , WINDOW/n 。

Document representing

搜索引擎中一个文档的表现形式就像是一个”装满词的包“。大部分搜索引擎使用的是全文索引(full-text indexing) ,也有为了节省空间使用 free-text indexing (大概是部分索引的意思?)

TAAT(Term-at-a-Time)

按单词查找,查一个更新结果集。

  • Query: (a AND b) OR c
    1. 读取 a 的 inverted list
    2. 读取 b 的 inverted list
    3. AND 操作: 取 a 和 b 的交集
    4. 读取 c 的 inverted list
    5. OR 操作: 将第三步与第四步结果合并得到最后结果

时间复杂度是 size(a) + size(b) + size(c)

TAAT 优点:易理解,易实现,多余的操作很少
TAAT 缺点:因为要把所有的临时 inverted list 存在内存,处理过程的内存使用很不好控制,容易溢出,所以一般都用于较小的系统。

DAAT(Document-at-A-Time)

按照文档查找。通常使用在大规模系统中。因为一次只查询一个文档,内存可以控制。

Query Operators

  • AND
  • OR
  • NEAR / N
    两个 item 间相隔小于等于 N 个长度(一个长度通常代表一个单词)。顺序必须一样。
    query:search NEAR/N engine
    content: “search engine” (true)
  • WINDOW / N
    与 NEAR 十分相似,只是两个单词间顺序可以随意。

阅读笔记

阅读教材:Information Retrieval

CH 1 Boolean Retrieval

This chapter first introduces some key terms of the Information Retrieval system, such as Document, Dictionary, posting, inverted indexes and so on. Then it also introduces the definition of boolean retrieval and some examples of how it works.

I am always impressive of high efficiency of the search engine, it is really magic work to get information I need in such a short time. After reading chapter 1, I think the process of building inverted indexes is kind of similar to MapReduce. It reminds me of word counting program.

Boolean Retrieval is applied in the early IR systems and it is simple while is enough for many situations. One of the shortcomings is that it lacks the information of relevance. Any document would return in unsorted order no matter how many times the words appear. Also, I get to know the algorithm that used to insect the different parts in one query, it matters a lot with the searching efficiency. Compared to arbitrary boolean queries, it is more efficient to intersect the following query with current result which has been intersected with previous queries.

Ch 5.1 Statistical properties of terms in information retrieval

This chapter discusses about several rules when applying to the IR system.

  1. Rule of 30 tells a fact that around 30 common used words takes up 30 percent of the whole content.
  2. Heap’s law is a theorem for quickly estimating the size of the big collection.
  3. Zipf’s law reflects the word distribution in the document.

Those laws are not a hundred percent precise though, they are enough to serve as a ideal model for our daily use. And sometimes we may lose some information when compressing the dictionary, that is called lossy compression.

Ch 2.4 Propositional Indexes

We learned about inverted indexes in the previous chapters, and this section raises a question about the phrase queries, which implies that it is difficult to do the search when there are phrases included in the query. Then it introduces three approaches.
Biword index means that the dictionary stores the vocabularies in the form of two consecutive words rather than a single word. It helps to search any length of phrase by the AND operation of different biwords. But it is not common applied because the huge space requirezation. Although biword index separates tokens with nouns and functional words to decreases some space, it is still just too large.
There is another solution called positional indexes, which is quite similar to inverted indexes except that for every single item, it stores not only the documents’ ID but also the positions the item appears in that dococument. Based on that, we can first intersect the two items and then find qualified related positions as given in the query. It takes much less space, but the time cost is unaffordable.
Then here we have the combination idea of the above two solutions: Position indexes for some of the common queried biwords. It is very interesting to decide which biwords should be positioned. Some phrases are queried a lot, but they cannot boost the query even if they are positioned since either of item in the phrases alway reflect a valid result, just like the example given in the textbook as “Britney Spears”. So I assume that it would be better to position indexes those phrases which contains simple items.