JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2024, Vol. 59 ›› Issue (3): 95-106.doi: 10.6040/j.issn.1671-9352.7.2023.2681

Previous Articles     Next Articles

Identification and statistical analysis methods of personal information disclosure in open government data

Haisu CHEN(),Jiachun LIAO*(),Sicheng YAO   

  1. Research Center of Big Data Technology, Nanhu Laboratory, Jiaxing 314002, Zhejiang, China
  • Received:2023-04-29 Online:2024-03-20 Published:2024-03-06
  • Contact: Jiachun LIAO E-mail:hschen@nanhulab.ac.cn;jliao@nanhulab.ac.cn

Abstract:

To promote the protection of personal information during data opening, an in-depth analysis of the current status of disclosure of personal information in the open government data is conducted. Firstly, the paper obtains the datasets from relevant platforms and pre-process to classify the datasets that containing personal information based on features such as field and table names, etc. Then, methods of sensitive information identification are applied to identify and extract various types of personal information in the data, and map the information back to individuals to summarise the total number of individuals and detect their associated data. Through data visualizations, the current status of personal information disclosure could be examined. Although some open government data platforms may have implemented certain measures such as data categorization and de-identification, the published open datasets still contain a large amount of personal information, which is required to be improved in terms of data categorization and classification, sensitive information identification and data desensitization in a normative and accurate manner.

Key words: big data privacy, personal information, open government data, information identification, statistical analysis

CLC Number: 

  • TP391.1

Fig.1

Application framework"

Fig.2

Plot of data size and number of fields in the datasets from test platform"

Table 1

Samples for the dataset of Marital Conflict (137 records in total)"

主要诉求 调处情况
刘*与丈夫陈**2010年结婚,育有2个孩子…… 联系**派出所询问情况,请派出所帮助开具家暴告诫书……

Table 2

Samples for the dataset of the Elders Information (4 619 records in total)"

门磁ID 老人证件号码 老人姓名
8632170****4067 ******1939******47 许**

Table 3

Listing of baseline algorithms for identification"

识别类型 目标类信息 识别模式
统一编码类信息 身份证 正则表达式:
r’^[1-9]\d{5}(18|19|([23]\d))\d{2}((0[1-9])|(10|11|12))(([0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]$’
手机号 正则表达式:
r’^((\+?[0-9]{1, 4})|(\(\+86\)))?(13[0-9]|14[57]|15[012356789]|17[03678]|18[0-9])\d{8}$’
车牌号 正则表达式:
r’^[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁琼使领][A-HJ-NP-Z](?: ((\d{5}[A-HJK])|([A-HJK][A-HJ-NP-Z0-9][0-9]{4}))|[A-HJ-NP-Z0-9]{4}[A-HJ-NP-Z0-9挂学警港澳])$’
银行卡号 正则表达式:
r’(?<![0-9a-zA-Z\-])[1-9](?: \d{11, 18})(?![0-9a-zA-Z\-])’
姓名信息 姓名 LAC工具命名实体识别

Table 4

Listing of developed algorithms for identification"

识别类型 目标类信息 识别模式
统一编码类信息 身份证 正则表达式:
r’[1-9]\d{5}(?: 18|19|(?: [23]\d))\d{2}(?: (?: 0[1-9])|(?: 10|11|12))(?: (?: [0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]’
身份证校验:模11算法校验、地区码和时间合规校验
手机号 正则表达式:
r’(?<![0-9a-zA-Z\-])(?: \+?86)?1(?: (?: 34[0-8])|(?: 8\d{2})|(?: (?: [35][0-35-9]|4[14-9]|6[567]|7[0-8]|9[12389])\d))\d{7}(?![0-9a-zA-Z\-])’
车牌号 正则表达式:
r’(?<![锅容管瓶梯起索游车]\d{2}(?=[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁台琼使领军北南成广沈济空海]{1}[A-Z]{1}[A-Z0-9]{4}(?: [A-Z0-9挂领学警港澳]{1}|[A-Z0-9]{2}\(\d{2}\))))[京津沪渝冀豫云辽黑湘皖鲁新苏浙赣鄂桂甘晋蒙陕吉闽贵粤青藏川宁台琼使领军北南成广沈济空海]{1}[A-Z]{1}[A-Z0-9]{4}(?: [A-Z0-9挂领学警港澳]{1}|[A-Z0-9]{2})(?!\d)’
银行卡号 正则表达式:
r’(?<![0-9a-zA-Z\-])[1-9](?: \d{11, 18})(?![0-9a-zA-Z\-])’
银行卡校验:Luhn规则校验和银行卡号前缀匹配
姓名信息 姓名 HanLP工具命名实体识别

Table 5

Identification results for the dataset of Marital Conflict"

统一编码类信息(漏检/误检) 姓名(漏检/误检)
基线识别算法 0/4 10/20
本文识别算法 0/0 1/4

Table 6

Identification results for the dataset of the Elders Information"

统一编码类信息(漏检/误检) 姓名(漏检/误检)
基线识别算法 0/11 184/0
本文识别算法 0/0 0/0

Table 7

The summarized number of related people with the direct disclosure of personal information in each single dataset from test platform"

领域标注 披露的个人信息类型涉及人数/人
个人基本信息 个人身份信息 个人健康生理信息 个人教育工作信息 个人财产信息 其他个人信息
社会救助 42 111 24 19 413 0 8 0
市场监督 10 110 3 0 10 094 0 0
科技创新 2 965 0 0 2 965 0 0
气象服务 1 027 0 0 0 0 0
生态环境 1 260 0 0 667 0 0
生活服务 256 7 1 29 17 94
城建住房 159 0 0 0 0 0
教育文化 64 162 2 143 0 61 138 0 0
地理空间 123 0 0 0 0 0
交通运输 42 59 0 59 0 0
信用服务 399 19 0 0 0 0
机构团体 1 0 0 0 0 0
工业农业 16 0 0 0 0 0
其他 12 0 0 12 0 0

Fig.3

The summarized number of related people with re-identification by data linking from test platform"

Fig.4

The summarized number of related people with re-identification by data linking under certain degree of confidence and quantity of information extension from test platform"

1 梅宏. 数据治理之路: 贵州实践[M]. 北京: 中国人民大学出版社, 2022: 47.
MEI Hong . On data governance: practice in Guizhou[M]. Beijing: China Renmin University Press, 2022: 47.
2 国务院. 国务院关于印发促进大数据发展行动纲要的通知[EB/OL]. (2015-09-05)[2023-02-12]. https://www.gov.cn/zhengce/content/2015-09/05/content_10137.htm.
The State Council. Circular of the state council on printing and issuing the action outline for promoting the big data development[EB/OL]. (2015-09-05)[2023-02-12]. https://www.gov.cn/zhengce/content/2015-09/05/content_10137.htm.
3 复旦大学数字与移动治理实验室. 中国地方政府数据开放报告—城市指数(2022年度)[R/OL]. (2023-01-10)[2023-01-30]. http://ifopendata.fudan.edu.cn/report.
DMG Lab Fudan University. China's local government open data report—city index (2022)[R/OL]. (2023-01-10)[2023-01-30]. http://ifopendata.fudan.edu.cn/report
4 黄玥, 周丽霞, 蒲攀. 基于AHP方法的我国信息安全政策方案优化决策研究[J]. 现代情报, 2015, 35 (3): 77- 81.
HUANG Yue , ZHOU Lixia , PU Pan . Study on the optimizing of information security policy based on AHP[J]. Journal of Modern Information, 2015, 35 (3): 77- 81.
5 周林兴, 周丽. 政府数据开放中的隐私信息治理研究[J]. 图书馆学研究, 2019, (12): 41- 47.
ZHOU Linxing , ZHOU Li . Research on privacy information governance in open government data[J]. Research on Library Science, 2019, (12): 41- 47.
6 李立新, 唐培洪, 臧滔, 等. 一种身份证号码识别方法、装置和电子设备: CN112380211A[P]. 2021-02-19.
LI Lixin, TANG Peihong, ZANG Tao, et al. The invention relates to a method, a device and an electronic device for the identification of resident identity card number: CN112380211A[P]. 2021-02-19.
7 闫萍. 基于规则和概率统计相结合的中文命名实体识别研究[J]. 计算机与数字工程, 2011, 39 (9): 88- 91.
YAN Ping . Research on the identification for Chinese named entity based on combination of rules and statistic analysis[J]. Computer & Digital Engineering, 2011, 39 (9): 88- 91.
8 俞鸿魁, 张华平, 刘群, 等. 基于层叠隐马尔可夫模型的中文命名实体识别[J]. 通信学报, 2006, (2): 87- 94.
YU Hongkui , ZHANG Huaping , LIU Qun , et al. Chinese named entity identification using cascaded hidden Markov model[J]. Journal on Communications, 2006, (2): 87- 94.
9 GUILLAUME L, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: Association for Computational Linguistics, 2016: 260-270.
10 孙瑞英, 李杰茹. 我国政府数据开放平台个人隐私保护政策评价研究[J]. 图书情报工作, 2022, 66 (12): 3- 16.
SUN Ruiying , LI Jieru . Research on the evaluation of personal privacy protection policies of government data open platforms in China[J]. Library and Information Service, 2022, 66 (12): 3- 16.
11 杜荷花. 我国政府数据开放平台隐私保护评价体系构建研究[J]. 情报杂志, 2020, 39 (3): 172- 179.
DU Hehua . On construction of privacy protection evaluation system of government data open platform in China[J]. Journal of Intelligence, 2020, 39 (3): 172- 179.
12 SWEENEY L . K-anonymity: a model for protecting privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10 (5): 557- 570.
13 LEE J S , JUN S P . Privacy-preserving data mining for open government data from heterogeneous sources[J]. Government Information Quarterly, 2021, 38 (1): 101544.
14 全国信息安全标准化技术委员会. 信息安全技术—个人信息去标识化指南: GB/T 37964—2019[S]. 北京: 中国标准出版社, 2019.
National Information Security Standardization Technical Committee. Information security technology—guide for de-identifying personal information: GB/T 37964—2019[S]. Beijing: Standards Press of China, 2019.
15 全国信息安全标准化技术委员会秘书处. 网络安全标准实践指南—网络数据分级分类指引[EB/OL]. (2021-12-31)[2023-01-30]. https://www.tc260.org.cn/upload/2021-12-31/1640948142376022576.pdf.
The Secretariat of National Information Security Standardization Technical Committee. Practice guide on network security standards—guidelines on classification of network data[EB/OL]. (2021-12-31)[2023-01-30]. https://www.tc260.org.cn/upload/2021-12-31/1640948142376022576.pdf.
16 JIAO Zhenyu, SUN Shuqi, SUN Ke. Chinese lexical analysis with deep Bi-GRU-CRF network[EB/OL]. (2018-06-05)[2023-01-30]. https://doi.org/10.48550/arXiv.1807.01882.
17 HE H, CHOI J D. The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominian: Association for Computational Linguistics, 2021: 5555-5577.
[1] YAN Li-mei, XU Feng-sheng. Attribute conjunctive expansion-reduction characteristics and #br# intelligent discovery of P-information [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(2): 98-103.
[2] LI Yuying1,2, RUAN Qunsheng1, ZHANG Shiliang1, YANG Huiling1, MAO Yanming1. Identification and recovery of inward-recursion  information [J]. J4, 2011, 46(6): 121-126.
[3] XIE Wei-qi1,2, ZHANG Li2, LI Yu-ying2,3. P-relation and its metric characteristic [J]. J4, 2010, 45(12): 112-116.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] YANG Ying, JIANG Long*, SUO Xin-li. Choquet integral representation of premium functional and related properties on capacity space[J]. J4, 2013, 48(1): 78 -82 .
[2] XIE Yun-long,DU Ying-ling . Function S-rough sets and integral metric of laws[J]. J4, 2007, 42(10): 118 -122 .
[3] SONG Yu-dan, WANG Shi-tong*. Minimum within-class variance SVM with absent features[J]. J4, 2010, 45(7): 102 -107 .
[4] SHI Yan-hua1, SHI Dong-yang2*. The quasi-Wilson nonconforming finite element approximation to  pseudo-hyperbolic equations[J]. J4, 2013, 48(4): 77 -84 .
[5] LIU Ru-jun,CAO Yu-xia,ZHOU Ping . Anti-control for discrete chaos systems by small feedback[J]. J4, 2007, 42(7): 30 -32 .
[6] LIANG Xiao, WANG Linshan. Global attractor of a class of recurrent neural network with Stype distributed delays[J]. J4, 2009, 44(4): 57 -60 .
[7] DONG Xin-mei . On problems of Suryanarayana[J]. J4, 2007, 42(2): 83 -86 .
[8] XU Chun-hua,GAO Bao-yu,LU Lei,XU Shi-ping,CAO Bai-chuan,YUE Qin-yan and ZHANG Jian . Study of chemically enhanced primary treatment of wastewater received by urban rivers[J]. J4, 2006, 41(2): 116 -120 .
[9] CHEN Hong-yu1, ZHANG Li2. The linear 2-arboricity of planar graphs without 5-, 6-cycles with chord[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(06): 26 -30 .
[10] CHEN Yong, . An approximate algorithm for the cost totalcoloring of trees[J]. J4, 2006, 41(1): 111 -114 .