检务文书检索数据集的构建

doi:10.6040/j.issn.1671-9352.1.2019.048

摘要/Abstract

摘要：

智慧检务是进一步发展检察信息化的重要步骤,它的实施和普及能更好地提升检察院工作质量和工作效率。实际上,检察官在办公流程中往往会处理大量的检察院法律文书,若不能有效地组织和利用这些文书中的信息,会降低其工作效率,信息检索技术恰好可以解决这一问题。在法律领域,中文信息检索数据集的缺失在一定程度上制约了法律信息检索的发展。在这一背景下,针对检察院法律文书的特点,提出了一种构建检察院法律文书检索数据集的方法,并构建了一个可用于法律领域信息检索研究的小型中文数据集。通过实验分析,验证了该数据集在不同检索模型上的性能。

关键词: 智慧检务, 信息检索, 检察院法律文书, 检索数据集

Abstract:

Smart procuratorial is an important step in further developing procuratorial informationization. Prosecutors often deal with a large number of procuratorate legal documents. Failure to effectively organize and use the information in these instruments will reduce the efficiency of the procuratorate. Information retrieval technology can solve this problem precisely. In the legal field, the lack of Chinese retrieval datasets actually restricts the development of legal information retrieval. In this background, according to the characteristics of the procuratorate legal documents, this paper proposed a method for constructing a retrieval dataset for the procuratorate legal documents, and implements a small Chinese dataset that can be used for information retrieval research in the legal field. According to the experimental results, the performance of the dataset on different retrieval models has been verified.

Key words: smart procuratorial, information retrieval, procuratorate legal documents, retrieval dataset

中图分类号:

TP391.1

王佳麒,杨沐昀,赵铁军,赵臻宇. 检务文书检索数据集的构建[J]. 《山东大学学报(理学版)》, 2020, 55(7): 81-87.

Jia-qi WANG,Mu-yun YANG,Tie-jun ZHAO,Zhen-yu ZHAO. Construction of retrieval dataset of procuratorate legal documents[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2020, 55(7): 81-87.

图/表 5

图1

表1

表2

表3

表4

参考文献 13

1	中华人民共和国科学技术部.关于对国家重点研发计划"公共安全风险防控与应急技术装备"重点专项(司法专题任务)2018年度第二批项目申报指南征求意见的通知[EB/OL].(2018-02-08) [2019-05-20].http://www.most.gov.cn/tztg/201802/t20180208_138083.htm.
2	CLEVERDON C W . The Cranfield tests on index language devices[J]. Aslib Proceedings, 1967, 19 (6): 173- 194. doi: 10.1108/eb050097
3	National Insitiute of Standards and Technology.TREC (Text REtrieval Conference)[OB/OL]. (2017-12-21)[2019-05-20].https://trec.nist.gov/.
4	LIU Yiqun , ZHANG Min , CEN Rongwei , et al. Data cleansing for web information retrieval using query independent features[J]. Journal of the American Society for Information Science and Technology, 2007, 58 (12): 1884- 1898. doi: 10.1002/asi.20633
5	李静静, 闫宏飞. 中文网页信息检索测试集的构建、分析及应用[J]. 中文信息学报, 2008, 22 (1): 30- 36.
	LI Jingjing , YAN Hongfei . Chinese web retrieval test collections: construction, analysis and application[J]. Journal of Chinese Information Processing, 2008, 22 (1): 30- 36.
6	徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28 (1): 13- 16.
	XU Jianmin , WANG Ping . Small Chinese information retrieval test collentions: construction and analysis[J]. Journal of Intelligence, 2009, 28 (1): 13- 16.
7	FOX E A. Characterization of two new experimental collections in computer and information science contatining textual and bibliographic concepts[J]. Cornell University, 1983, Technical Report 83-561.
8	HERSH W, BUCKLEY C, LEONE T J, et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research[C]//SIGIR′94. London: Springer, 1994: 192-201.
9	National Insitiute of Standards and Technology. TREC legal track[EB/OL]. (2012-05-10)[2019-05-20]. https://trec-legal.umiacs.umd.edu/.
10	XIAO Chaojun, ZHONG Haoxi, GUO Zhipeng, et al. Cail2018: a large-scale legal dataset for judgment prediction[J]. arXiv, 2018, arXiv: 1807.02478.
11	CORMACK G V, PALMER C R, CLARKE C L A. Efficient construction of large test collection[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne: ACM, 1998: 282-289.
12	SALTON G , WONG A , YANG C S . A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 (11): 613- 620. doi: 10.1145/361219.361220
13	PONTE J M, CROFT W B. A language modeling approach to information retrieval[C]//ACM SIGIR Forum. New York: ACM, 2017: 202-208.

相关文章 15

[1]	王凯,洪宇,邱盈盈,王剑,姚建民,周国栋. 一种查询意图边界检测方法研究[J]. 山东大学学报（理学版）, 2017, 52(9): 13-18.
[2]	曹蓉,黄金柱,易绵竹. 信息检索—DARPA人类语言技术研究的最终指向[J]. 山东大学学报（理学版）, 2016, 51(9): 11-17.
[3]	孟烨,张鹏,宋大为. 探索数据集特征与伪相关反馈的平衡参数之间的关系[J]. 山东大学学报（理学版）, 2016, 51(7): 18-22.
[4]	张文雅,宋大为,张鹏. 面向垂直搜索基于本体的可读性计算模型[J]. 山东大学学报（理学版）, 2016, 51(7): 23-29.
[5]	李胜东, 吕学强, 孙军, 施水才. Lucene全文索引效率的改进[J]. 山东大学学报（理学版）, 2015, 50(07): 76-79.
[6]	许洁萍1,殷宏宇1,范子文2. 基于近似子乐句的翻唱歌曲识别研究[J]. J4, 2013, 48(7): 68-71.
[7]	孙静宇,陈俊杰,余雪丽,李鲜花. 协同Web搜索综述[J]. J4, 2011, 46(5): 9-15.
[8]	庞观松,张黎莎,蒋盛益*,邝丽敏,吴美玲. 一种基于名词短语的检索结果多层聚类方法[J]. J4, 2010, 45(7): 39-44.
[9]	高翔,王敏 . 模糊聚类算法在Web信息搜索中的应用[J]. J4, 2006, 41(3): 11-12 .
[10]	付雪峰,刘邱云,王明文 . 基于互信息的粗糙集信息检索模型[J]. J4, 2006, 41(3): 116-119 .
[11]	宋春芳,石冰 . 一种基于关联规则的搜索引擎结果聚类算法[J]. J4, 2006, 41(3): 61-65 .
[12]	何靖 . 一种问答式检索系统布尔查询生成方法[J]. J4, 2006, 41(3): 13-17 .
[13]	万海平,何华灿 . 基于谱图的维度约简及其应用[J]. J4, 2006, 41(3): 58-60 .
[14]	王太峰,袁平波,荚济民,俞能海 . 基于新闻环境的人物肖像检索[J]. J4, 2006, 41(3): 5-10 .
[15]	胡俊刚,董守斌,陈晓志,张元丰 . 基于URL类型优先级的入口页面查询算法[J]. J4, 2006, 41(3): 76-80 .

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	查询篇数/平均字数	查询平均最相关答案篇数	查询平均次相关(法条一致)答案篇数	查询平均次相关(罪名一致)答案篇数	待检索文档集篇数/平均字数)
训练集	869/164	4.5	0.05	0.85	13 678/791
测试集	2 345/189	14.0	0.14	4.28	47 424/675

严格度	标准答案组成
Strict	罪名与法律条款均一致
LooseCharge	法律条款一致,罪名任意
LooseLaw	罪名一致,法律条款任意
All	满足以上任意条件即为标准答案

模型	严格度	MAP	P@5	P@10
VSM	Strict	0.044 8	0.032 8	0.026 8
	LooseCharge	0.044 7	0.033 8	0.027 6
	LooseLaw	0.050 6	0.043 1	0.037 0
	All	0.050 8	0.044 2	0.037 8
BM25	Strict	0.043 3	0.031 6	0.025 2
	LooseCharge	0.043 1	0.032 6	0.025 9
	LooseLaw	0.048 0	0.040 5	0.034 3
	All	0.048 0	0.041 5	0.035 0
BM25_PRF	Strict	0.048 1	0.033 6	0.025 4
	LooseCharge	0.048 0	0.034 8	0.026 3
	LooseLaw	0.054 1	0.04 7	0.035 3
	All	0.054 3	0.044 9	0.036 2
LM_dirichlet	Strict	0.048 5	0.036 3	0.030 3
	LooseCharge	0.043 8	0.037 6	0.031 2
	LooseLaw	0.051 3	0.044 3	0.038 3
	All	0.051 2	0.045 6	0.039 2
LM_dirichlet_PRF	Strict	0.052 4	0.038 8	0.032 3
	LooseCharge	0.052 4	0.040 2	0.033 3
	LooseLaw	0.056 9	0.048 1	0.041 1
	All	0.056 9	0.049 5	0.042 1
LM_twostage	Strict	0.048 5	0.036 3	0.030 3
	LooseCharge	0.048 3	0.037 6	0.031 2
	LooseLaw	0.051 3	0.044 3	0.038 3
	All	0.051 2	0.045 6	0.039 2
LM_twostage_PRF	Strict	0.046 5	0.036 2	0.029 2
	LooseLaw	0.046 3	0.037 5	0.030 2
	LooseLaw	0.048 3	0.043 5	0.036 6
	All	0.048 3	0.044 8	0.037 6
LM_jelinek-mercer	Strict	0.047 1	0.035 1	0.027 9
	LooseCharge	0.046 9	0.036 2	0.028 7
	LooseLaw	0.052 2	0.044 7	0.037 4
	All	0.052 2	0.045 9	0.038 2
LM_jelinek-mercer_PRF	Strict	0.048 7	0.035 0	0.028 2
	LooseCharge	0.048 6	0.036 1	0.029 1
	LooseLaw	0.054 6	0.045 1	0.038 3
	All	0.054 7	0.046 2	0.039 2