# CMU 11642 Search Engine - 搜索总览

## Big Picture

CMU 11642 Search Engine - 搜索总览

## 知识点

### Heap’s Law

• $V = KN^b$
• V 不同单词数
• K 系数。通常 $10 <= K <= 100$
• b 系数。英语 $0.4 <= b <= 0.6$
• N 单词总个数。

### Zipf’s Law

• $P(t)=\frac{ctf_t}{N}$
• $ctf_t$ collection term frequency t 在文本中出现的频率
• N 文本中的总单词数
• 也可以简化成 $P(t_R) = \frac{A}{R}$
• A 为系数，英语通常为 0.1
• R 为单词频数的排名（由高到低）（rank）
• 也就是说频数最高的单词占了总单词量的 10% 作业

### Information needs and queries

query 就是 “吸尘器 AND 不贵”

### Inverted List

• Binary Inverted List
List 中只包括出现了 item 的 docID ，只能使用 AND ， OR ， AND-NOT ，FIELD 等 Boolean 运算符。
• Frequency Inverted List
List 中含出现 item 的 docID 及其出现频数。
• Positional Inverted List
List 中有 item 的 docID 及每一个 doc 中的位置。可使用 NEAR/n ， SENTENCE/n ， PASSAGE/n ， WINDOW/n 。

### TAAT(Term-at-a-Time)

• Query: (a AND b) OR c
1. 读取 a 的 inverted list
2. 读取 b 的 inverted list
3. AND 操作: 取 a 和 b 的交集
4. 读取 c 的 inverted list
5. OR 操作: 将第三步与第四步结果合并得到最后结果

TAAT 优点：易理解，易实现，多余的操作很少
TAAT 缺点：因为要把所有的临时 inverted list 存在内存，处理过程的内存使用很不好控制，容易溢出，所以一般都用于较小的系统。

### Query Operators

• AND
• OR
• NEAR / N
两个 item 间相隔小于等于 N 个长度（一个长度通常代表一个单词）。顺序必须一样。
query：search NEAR/N engine
content: “search engine” (true)
• WINDOW / N
与 NEAR 十分相似，只是两个单词间顺序可以随意。

## 阅读笔记

### CH 1 Boolean Retrieval

This chapter first introduces some key terms of the Information Retrieval system, such as Document, Dictionary, posting, inverted indexes and so on. Then it also introduces the definition of boolean retrieval and some examples of how it works.

I am always impressive of high efficiency of the search engine, it is really magic work to get information I need in such a short time. After reading chapter 1, I think the process of building inverted indexes is kind of similar to MapReduce. It reminds me of word counting program.

Boolean Retrieval is applied in the early IR systems and it is simple while is enough for many situations. One of the shortcomings is that it lacks the information of relevance. Any document would return in unsorted order no matter how many times the words appear. Also, I get to know the algorithm that used to insect the different parts in one query, it matters a lot with the searching efficiency. Compared to arbitrary boolean queries, it is more efficient to intersect the following query with current result which has been intersected with previous queries.

### Ch 5.1 Statistical properties of terms in information retrieval

This chapter discusses about several rules when applying to the IR system.

1. Rule of 30 tells a fact that around 30 common used words takes up 30 percent of the whole content.
2. Heap’s law is a theorem for quickly estimating the size of the big collection.
3. Zipf’s law reflects the word distribution in the document.

Those laws are not a hundred percent precise though, they are enough to serve as a ideal model for our daily use. And sometimes we may lose some information when compressing the dictionary, that is called lossy compression.

### Ch 2.4 Propositional Indexes

We learned about inverted indexes in the previous chapters, and this section raises a question about the phrase queries, which implies that it is difficult to do the search when there are phrases included in the query. Then it introduces three approaches.
Biword index means that the dictionary stores the vocabularies in the form of two consecutive words rather than a single word. It helps to search any length of phrase by the AND operation of different biwords. But it is not common applied because the huge space requirezation. Although biword index separates tokens with nouns and functional words to decreases some space, it is still just too large.
There is another solution called positional indexes, which is quite similar to inverted indexes except that for every single item, it stores not only the documents’ ID but also the positions the item appears in that dococument. Based on that, we can first intersect the two items and then find qualified related positions as given in the query. It takes much less space, but the time cost is unaffordable.
Then here we have the combination idea of the above two solutions: Position indexes for some of the common queried biwords. It is very interesting to decide which biwords should be positioned. Some phrases are queried a lot, but they cannot boost the query even if they are positioned since either of item in the phrases alway reflect a valid result, just like the example given in the textbook as “Britney Spears”. So I assume that it would be better to position indexes those phrases which contains simple items.