1.Lucene评分影响因子

  • 文档权重(document boost):索引期赋予某个文档的权重值
  • 字段权重(field boost):查询期赋予某个字段的权重值
  • 协调因子(coord):基于文档中词项个数的协调因子,一个文档命中了查询中的词项越多,得分越高
  • 逆文档频率(inverse document frequency):一个基于词项的因子,告诉评分公式该词项有多么罕见。逆文档频率越高,词项就越罕见。评分公式利用该因子,为包含罕见词项的文档加权
  • 长度范数(length norm):每字段的基于词项个数的归一化因子(在索引期被计算并存储在索引中)。一个字段包含的词项数越多,该因子的权重越低。这表示lucene评分公式更“喜欢”包含更少词项的字段
  • 词频(Term frequency):一个基于词项的因子,用来表示一个词项在某个文档中出现了多少次。词频越高,得分越高
  • 查询范数(Query norm):一个基于查询的归一化因子,等于查询中词项的权重平方和。查询范数使不同查询的得分能互相比较

2.Elasticsearch默认评分公式

Elasticsearch 5.0版本以前默认评分公式 TF-IDF,官方文档:https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

以上是 TF-IDF理论公式。

以上是Lucene 实际使用的评分公式,关于公式的详细介绍,可以看下官方文档。求和公式中每个加数由以下因子连乘所得:词频、逆文档频率、词项权重、长度范数。

Elasticsearch 5.0版本以后默认评分公式Okapi_BM25

bm25_equation

下面详细介绍下BM25算法公式。

Qi代表第i个查询term,

关于两者比较可以看看这篇文章,文本相似度:TF-IDF与BM25

从公式中我们可以得出以下评分规则:

  • 越罕见的词项被匹配上,文档得分越高。—- 重视罕见词项
  • 文档字段内容越短(包含更少的词项),文档得分越高。—- 重视短文档
  • 权重越高(索引期、查询期),文档得分越高.—- 加权处理得分更高

3.如何在ES中查看文档评分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
PUT /score 
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1
}
}
# 插入一个文档
POST /score/_doc/1
{
"name": "zhhades yuanbo"
}
# 查询
GET /score/_search
{
"query": {
"match": {
"name": "yuanbo"
}
}
}
# 查看评分
GET /score/_doc/1/_explain
{
"query": {
"match": {
"name": "yuanbo"
}
}
}
# 插入一个文档
POST /score/_doc/2
{
"name": "aulang lwa yuanbo"
}
# 查询
GET /score/_search
{
"query": {
"match": {
"name": "yuanbo"
}
}
}
# 查看评分
GET /score/_doc/2/_explain
{
"query": {
"match": {
"name": "yuanbo"
}
}
}
# 插入一个文档
POST /score/_doc/3
{
"name": "yuanbo"
}
# 查看评分过程
GET /score/_doc/3/_explain
{
"query": {
"match": {
"name": "yuanbo"
}l
}
}

GET /score/_search
{
"query": {
"match": {
"name": "yuanbo"
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
{
"_index": "score",
"_type": "_doc",
"_id": "1",
"matched": true,
"explanation": {
"value": 0.2876821,
"description": "weight(name:yuanbo in 0) [PerFieldSimilarity], result of:",
"details": [{
"value": 0.2876821,
"description": "score(freq=1.0), product of:",
"details": [{
"value": 2.2,
"description": "boost",
"details": []
}, {
"value": 0.2876821,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [{
"value": 1,
"description": "n, number of documents containing term",
"details": []
}, {
"value": 1,
"description": "N, total number of documents with field",
"details": []
}]
}, {
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
}, {
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
}, {
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
}, {
"value": 2.0,
"description": "dl, length of field",
"details": []
}, {
"value": 2.0,
"description": "avgdl, average length of field",
"details": []
}]
}]
}]
}
}
{
"_index": "score",
"_type": "_doc",
"_id": "2",
"matched": true,
"explanation": {
"value": 0.11955717,
"description": "weight(name:yuanbo in 0) [PerFieldSimilarity], result of:",
"details": [{
"value": 0.11955717,
"description": "score(freq=1.0), product of:",
"details": [{
"value": 2.2,
"description": "boost",
"details": []
}, {
"value": 0.13353139,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [{
"value": 3,
"description": "n, number of documents containing term",
"details": []
}, {
"value": 3,
"description": "N, total number of documents with field",
"details": []
}]
}, {
"value": 0.40697673,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
}, {
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
}, {
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
}, {
"value": 3.0,
"description": "dl, length of field",
"details": []
}, {
"value": 2.3333333,
"description": "avgdl, average length of field",
"details": []
}]
}]
}]
}
}

4.查询模板

  Elasticsearch 使用Mustache模板引擎来为查询模板生成可用的查询语句。以下是使用模板的两个demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
POST _scripts / caseNumber 
{
"script": {
"lang": "mustache",
"source": {
"query": {
"term": {
"caseNumber": {
"value": "{{caseNumber}}"
}
}
}
}
}
}
GET _search / template
{
"id": "caseNumber",
"params": {
"caseNumber": "200715103252"
}
}
#####带json转换模板
POST _scripts / caseNumber_tojson
{
"script": {
"lang": "mustache",
"source": "{\"query\": { \"terms\": {{#toJson}}numberes{{/toJson}}}}"
}
}
GET _render / template
{
"id": "caseNumber_tojson",
"params": {
"numberes": {
"caseNumber": ["200715103252", "222"]
}
}
}
GET _scripts / caseNumber_tojson
GET _search / template
{
"id": "caseNumber_tojson",
"params": {
"numberes": {
"caseNumber": ["200715103252", "222"]
}
}
}

5.查询二次评分

二次评分是指重新计算查询返回文档中指定个数文档的得分,es会截取查询返回的前N个,并使用预定义的二次评分方法来重新计算他们的得分。从一个最简单的例子介绍二次评分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET / blog / _search 
{
"query": {
"match_all": {}
},
"rescore": {
"query": {
"rescore_query": {
"function_score": {
"script_score": {
"script": {
"source": "doc['author_id'].value/2"
}
}
}
}
},
"window_size": 2
}
}