JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2020, Vol. 55 ›› Issue (7): 81-87.doi: 10.6040/j.issn.1671-9352.1.2019.048

Previous Articles     Next Articles

Construction of retrieval dataset of procuratorate legal documents

Jia-qi WANG(),Mu-yun YANG*(),Tie-jun ZHAO,Zhen-yu ZHAO   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, Heilongjiang, China
  • Received:2019-10-12 Online:2020-07-20 Published:2020-07-08
  • Contact: Mu-yun YANG E-mail:17862702130@163.com;yangmuyun@hit.edu.cn

Abstract:

Smart procuratorial is an important step in further developing procuratorial informationization. Prosecutors often deal with a large number of procuratorate legal documents. Failure to effectively organize and use the information in these instruments will reduce the efficiency of the procuratorate. Information retrieval technology can solve this problem precisely. In the legal field, the lack of Chinese retrieval datasets actually restricts the development of legal information retrieval. In this background, according to the characteristics of the procuratorate legal documents, this paper proposed a method for constructing a retrieval dataset for the procuratorate legal documents, and implements a small Chinese dataset that can be used for information retrieval research in the legal field. According to the experimental results, the performance of the dataset on different retrieval models has been verified.

Key words: smart procuratorial, information retrieval, procuratorate legal documents, retrieval dataset

CLC Number: 

  • TP391.1

Fig.1

Dataset construction process"

Table 1

Dataset description"

数据集 查询篇数/平均字数 查询平均最相关答案篇数 查询平均次相关(法条一致)答案篇数 查询平均次相关(罪名一致)答案篇数 待检索文档集篇数/平均字数)
训练集 869/164 4.5 0.05 0.85 13 678/791
测试集 2 345/189 14.0 0.14 4.28 47 424/675

Table 2

An example of query"

被告  告人  驾驶  挂车  西超  超速  速行  行驶  十字  被害  害人  驾驶  驶电  电动  动自  自行  行车  南行  行驶  驶相  相撞  被害  害人  死亡  亡电  电动  动自  自行  行车  车损  损坏  道路  路交  交通  通事  事故  公安  安局  局交  交警  警大  大队  队认  认定  定被  被告  告人  负  事故  责任

Table 3

Strictness"

严格度 标准答案组成
Strict 罪名与法律条款均一致
LooseCharge 法律条款一致,罪名任意
LooseLaw 罪名一致,法律条款任意
All 满足以上任意条件即为标准答案

Table 4

Performance of three retrieval models using dataset in different strictness"

模型 严格度 MAP P@5 P@10
VSM Strict 0.044 8 0.032 8 0.026 8
LooseCharge 0.044 7 0.033 8 0.027 6
LooseLaw 0.050 6 0.043 1 0.037 0
All 0.050 8 0.044 2 0.037 8
BM25 Strict 0.043 3 0.031 6 0.025 2
LooseCharge 0.043 1 0.032 6 0.025 9
LooseLaw 0.048 0 0.040 5 0.034 3
All 0.048 0 0.041 5 0.035 0
BM25_PRF Strict 0.048 1 0.033 6 0.025 4
LooseCharge 0.048 0 0.034 8 0.026 3
LooseLaw 0.054 1 0.04 7 0.035 3
All 0.054 3 0.044 9 0.036 2
LM_dirichlet Strict 0.048 5 0.036 3 0.030 3
LooseCharge 0.043 8 0.037 6 0.031 2
LooseLaw 0.051 3 0.044 3 0.038 3
All 0.051 2 0.045 6 0.039 2
LM_dirichlet_PRF Strict 0.052 4 0.038 8 0.032 3
LooseCharge 0.052 4 0.040 2 0.033 3
LooseLaw 0.056 9 0.048 1 0.041 1
All 0.056 9 0.049 5 0.042 1
LM_twostage Strict 0.048 5 0.036 3 0.030 3
LooseCharge 0.048 3 0.037 6 0.031 2
LooseLaw 0.051 3 0.044 3 0.038 3
All 0.051 2 0.045 6 0.039 2
LM_twostage_PRF Strict 0.046 5 0.036 2 0.029 2
LooseLaw 0.046 3 0.037 5 0.030 2
LooseLaw 0.048 3 0.043 5 0.036 6
All 0.048 3 0.044 8 0.037 6
LM_jelinek-mercer Strict 0.047 1 0.035 1 0.027 9
LooseCharge 0.046 9 0.036 2 0.028 7
LooseLaw 0.052 2 0.044 7 0.037 4
All 0.052 2 0.045 9 0.038 2
LM_jelinek-mercer_PRF Strict 0.048 7 0.035 0 0.028 2
LooseCharge 0.048 6 0.036 1 0.029 1
LooseLaw 0.054 6 0.045 1 0.038 3
All 0.054 7 0.046 2 0.039 2
1 中华人民共和国科学技术部.关于对国家重点研发计划"公共安全风险防控与应急技术装备"重点专项(司法专题任务)2018年度第二批项目申报指南征求意见的通知[EB/OL].(2018-02-08) [2019-05-20].http://www.most.gov.cn/tztg/201802/t20180208_138083.htm.
2 CLEVERDON C W . The Cranfield tests on index language devices[J]. Aslib Proceedings, 1967, 19 (6): 173- 194.
doi: 10.1108/eb050097
3 National Insitiute of Standards and Technology.TREC (Text REtrieval Conference)[OB/OL]. (2017-12-21)[2019-05-20].https://trec.nist.gov/.
4 LIU Yiqun , ZHANG Min , CEN Rongwei , et al. Data cleansing for web information retrieval using query independent features[J]. Journal of the American Society for Information Science and Technology, 2007, 58 (12): 1884- 1898.
doi: 10.1002/asi.20633
5 李静静, 闫宏飞. 中文网页信息检索测试集的构建、分析及应用[J]. 中文信息学报, 2008, 22 (1): 30- 36.
LI Jingjing , YAN Hongfei . Chinese web retrieval test collections: construction, analysis and application[J]. Journal of Chinese Information Processing, 2008, 22 (1): 30- 36.
6 徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28 (1): 13- 16.
XU Jianmin , WANG Ping . Small Chinese information retrieval test collentions: construction and analysis[J]. Journal of Intelligence, 2009, 28 (1): 13- 16.
7 FOX E A. Characterization of two new experimental collections in computer and information science contatining textual and bibliographic concepts[J]. Cornell University, 1983, Technical Report 83-561.
8 HERSH W, BUCKLEY C, LEONE T J, et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research[C]//SIGIR′94. London: Springer, 1994: 192-201.
9 National Insitiute of Standards and Technology. TREC legal track[EB/OL]. (2012-05-10)[2019-05-20]. https://trec-legal.umiacs.umd.edu/.
10 XIAO Chaojun, ZHONG Haoxi, GUO Zhipeng, et al. Cail2018: a large-scale legal dataset for judgment prediction[J]. arXiv, 2018, arXiv: 1807.02478.
11 CORMACK G V, PALMER C R, CLARKE C L A. Efficient construction of large test collection[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne: ACM, 1998: 282-289.
12 SALTON G , WONG A , YANG C S . A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 (11): 613- 620.
doi: 10.1145/361219.361220
13 PONTE J M, CROFT W B. A language modeling approach to information retrieval[C]//ACM SIGIR Forum. New York: ACM, 2017: 202-208.
[1] WANG Kai, HONG Yu, QIU Ying-ying, WANG Jian, YAO Jian-min, ZHOU Guo-dong. Study on boundary detection of users query intents [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 13-18.
[2] CAO Rong, HUANG Jin-zhu, YI Mian-zhu. Information retrieval: the final direction of human language technology research in DARPA [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(9): 11-17.
[3] MENG Ye, ZHANG Peng, SONG Da-wei. Study on collection statistics for parameter selection in pseudo relevance feedback [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(7): 18-22.
[4] LI Sheng-dong, LÜ Xue-qiang, SUN Jun, SHI Shui-cai. Improvement of Lucene full-text indexing efficiency [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(07): 76-79.
[5] XU Jie-ping1, YIN Hong-yu1, FAN Zi-wen2. Study on cover songs identification based on phrase content [J]. J4, 2013, 48(7): 68-71.
[6] SUN Jing-yu, CHEN Jun-jie, YU Xue-li, LI Xian-hua. A survey of collaborative Web search [J]. J4, 2011, 46(5): 9-15.
[7] PANG Guan-song, ZHANG Li-sha, JIANG Sheng-yi*, KUANG Li-min, WU Mei-ling. A multi-level clustering approach based on noun phrases for search results [J]. J4, 2010, 45(7): 39-44.
[8] GAO Xiang,WANG Min . Applying fuzzy cluster algorithm to Web information retrieval [J]. J4, 2006, 41(3): 11-12 .
[9] FU Xue-feng,LIU Qiu-yun,WANG Ming-wen . Rough sets information retrieval model based on multual information [J]. J4, 2006, 41(3): 116-119 .
[10] SONG Chun-fang,SHI Bing . An algorithm to cluster the search results basedon the association rules [J]. J4, 2006, 41(3): 61-65 .
[11] HE Jing . An approach to generate boolean query in question andanswering retrieval system [J]. J4, 2006, 41(3): 13-17 .
[12] WAN Hai-ping,HE Hua-can . Dimensionality reduction based on spectral graph and its application [J]. J4, 2006, 41(3): 58-60 .
[13] WANG Tai-feng,Yuan Ping-bo,JIA Ji-min,Yu Meng-hai . Portrait retrieval based on news environment [J]. J4, 2006, 41(3): 5-10 .
[14] HU Jungang,DONG Shou-bin,CHEN Xiao-zhi,ZHANG Yuan-feng . Entry page search algorithm based on URLtype prior probabilities [J]. J4, 2006, 41(3): 76-80 .
[15] WANG Wei-dong,SONG Dan,SONG Ren-jie . Web news retrieval based on splited vector space model [J]. J4, 2006, 41(3): 135-138 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] YANG Ying, JIANG Long*, SUO Xin-li. Choquet integral representation of premium functional and related properties on capacity space[J]. J4, 2013, 48(1): 78 -82 .
[2] SONG Yu-dan, WANG Shi-tong*. Minimum within-class variance SVM with absent features[J]. J4, 2010, 45(7): 102 -107 .
[3] SHI Yan-hua1, SHI Dong-yang2*. The quasi-Wilson nonconforming finite element approximation to  pseudo-hyperbolic equations[J]. J4, 2013, 48(4): 77 -84 .
[4] CHENG Zhi1,2, SUN Cui-fang2, WANG Ning1, DU Xian-neng1. On the fibre product of Zn and its property[J]. J4, 2013, 48(2): 15 -19 .
[5] ZHAO Tong-xin1, LIU Lin-de1*, ZHANG Li1, PAN Cheng-chen2, JIA Xing-jun1. Pollinators and pollen polymorphism of  Wisteria sinensis (Sims) Sweet[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 1 -5 .
[6] HUANG Xian-li,LUO Dong-mei. Feature impprtance study on  transfer learning of  sentiment  text  classification[J]. J4, 2010, 45(7): 13 -17 .
[7] REN Hui-xue,YANG Yan-zhao,LIN Ji-mao,QI Yin-shan,ZHANG Ye-qing . Synthesis and characterization of 5-bromo-3-sec-butyl-6-methyluracil[J]. J4, 2007, 42(7): 9 -12 .
[8] WANG Kang, LI Hua. Analysis of the compound Haqing injection with hyphenated chromatography and chemometric resolution[J]. J4, 2009, 44(11): 16 -20 .
[9] JIANG Nan and XU Yu-ming . On the properties of fmaximum[J]. J4, 2007, 42(6): 52 -54 .
[10] LI Jiao. Existence and uniqueness results for Caputo fractional differential  equations with initial value conditions[J]. J4, 2013, 48(4): 60 -64 .