山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (12): 30-35.doi: 10.6040/j.issn.1671-9352.3.2014.230
毛福林, 瞿有利
MAO Fu-lin, QU You-li
摘要: 全文检索的效率依赖于数据结构-倒排索引,存储倒排索引需要较大的硬盘存储空间。提出了一种新的压缩算法,主要用于倒排索引中文档标识符的压缩。对于给定的文档集合使用信息检索工具Terrier,使用不同的压缩算法压缩倒排索引中的文档标识符,从而生成倒排索引文件,然后比较倒排索引文件的大小。实验结果表明,使用新的压缩算法能够节省倒排索引文件的存储空间。
中图分类号:
[1] KOBAYASHI M, TAKEDA K. Information retrieval on the web[J]. ACM Computing Surveys (CSUR), 2000, 32(2):144-173. [2] ANH V N, MOFFAT A. Improved word-aligned binary compression for text indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(6):857-861. [3] SHIEH W Y, CHUNG C P. A statistics-based approach to incrementally update inverted files[J]. Information Processing & Management, 2005, 41(2):275-288. [4] Wikipedia.Unary coding[EB/OL].[2014-03-05].http://en.wikipedia.org/wiki/Unary- coding. [5] SCHOLER F, WILLIAMS H E, YIANNIS J, et al. Compression of inverted indexes for fast query evaluation[C]//Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2002:222-229. [6] ELIAS P. Universal codeword sets and representations of the integers [J]. IEEE Transactions on Information Theory, 1975, 21(2):194-203. [7] RICE R F. Some practical universal noiseless coding techniques [D]. Pasadena: California Institute of Technology, 1979. [8] MOFFAT A, STUIVER L. Binary interpolative coding for effective index compression[J]. Information Retrieval, 2000, 3(1):25-47. [9] HEMAN S. Super-scalar database compression between RAM and CPU-cache [D]. Amsterdam: University of Amsterdam, 2005. [10] SOMASUNDARAM K, DOMNIC S. Extended golomb code for integer representation[J]. IEEE Transactions on Multimedia, 2007, 9(2):239-246. [11] Wikipedia.Truncated binary encoding[EB/OL].[2014-03-11]. http://en.wikipedia.org/wiki/Truncated-binary-encoding. [12] GLORY V, DOMNIC S. Re-ordered FEGC and block based FEGC for inverted file compression [J]. International Journal of Information Retrieval Research (IJIRR), 2013, 3(1):71-88. |
[1] | 宋省身,杨岳湘,江宇. 基于单指令级并行的快速求交算法[J]. 山东大学学报(理学版), 2018, 53(3): 54-62. |
|