您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2014, Vol. 49 ›› Issue (12): 30-35.doi: 10.6040/j.issn.1671-9352.3.2014.230

• 论文 • 上一篇    下一篇

一种变长编码压缩倒排索引算法

毛福林, 瞿有利   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 收稿日期:2014-08-28 修回日期:2014-10-27 出版日期:2014-12-20 发布日期:2014-12-20
  • 作者简介:毛福林(1990- ),男,硕士研究生,研究方向为信息检索。E-mail: fulinmao2012@gmail.com
  • 基金资助:
    中央高校基本科研业务费专项资金项目(2011JBM231)

An variable length code algorithm compression inverted index

MAO Fu-lin, QU You-li   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2014-08-28 Revised:2014-10-27 Online:2014-12-20 Published:2014-12-20

摘要: 全文检索的效率依赖于数据结构-倒排索引,存储倒排索引需要较大的硬盘存储空间。提出了一种新的压缩算法,主要用于倒排索引中文档标识符的压缩。对于给定的文档集合使用信息检索工具Terrier,使用不同的压缩算法压缩倒排索引中的文档标识符,从而生成倒排索引文件,然后比较倒排索引文件的大小。实验结果表明,使用新的压缩算法能够节省倒排索引文件的存储空间。

关键词: 倒排索引, 索引压缩, 整数压缩

Abstract: The efficiency of text search engines relies on data structure: inverted index. And the more large space is need to storage the inverted index. A new compression algorithm was proposed. For the given document collections. Terrier, the information retrival tool, was used to build inverted index, and the state-of-the-art compression techniques was used to compress inverted file. Then the compress ratio was confirmed by comparing the file size. Experiments show that the new compression techniques can get much better compress ratio.

Key words: integer compression, inverted index, index compression

中图分类号: 

  • TP391
[1] KOBAYASHI M, TAKEDA K. Information retrieval on the web[J]. ACM Computing Surveys (CSUR), 2000, 32(2):144-173.
[2] ANH V N, MOFFAT A. Improved word-aligned binary compression for text indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(6):857-861.
[3] SHIEH W Y, CHUNG C P. A statistics-based approach to incrementally update inverted files[J]. Information Processing & Management, 2005, 41(2):275-288.
[4] Wikipedia.Unary coding[EB/OL].[2014-03-05].http://en.wikipedia.org/wiki/Unary- coding.
[5] SCHOLER F, WILLIAMS H E, YIANNIS J, et al. Compression of inverted indexes for fast query evaluation[C]//Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2002:222-229.
[6] ELIAS P. Universal codeword sets and representations of the integers [J]. IEEE Transactions on Information Theory, 1975, 21(2):194-203.
[7] RICE R F. Some practical universal noiseless coding techniques [D]. Pasadena: California Institute of Technology, 1979.
[8] MOFFAT A, STUIVER L. Binary interpolative coding for effective index compression[J]. Information Retrieval, 2000, 3(1):25-47.
[9] HEMAN S. Super-scalar database compression between RAM and CPU-cache [D]. Amsterdam: University of Amsterdam, 2005.
[10] SOMASUNDARAM K, DOMNIC S. Extended golomb code for integer representation[J]. IEEE Transactions on Multimedia, 2007, 9(2):239-246.
[11] Wikipedia.Truncated binary encoding[EB/OL].[2014-03-11]. http://en.wikipedia.org/wiki/Truncated-binary-encoding.
[12] GLORY V, DOMNIC S. Re-ordered FEGC and block based FEGC for inverted file compression [J]. International Journal of Information Retrieval Research (IJIRR), 2013, 3(1):71-88.
[1] 宋省身,杨岳湘,江宇. 基于单指令级并行的快速求交算法[J]. 山东大学学报(理学版), 2018, 53(3): 54-62.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!