Big Data Technology

Editor's Note

Recently, with the maturity and popularization of technologies such as Internet of Things, cloud computing, mobile internet, and Internet of Vehicles, massive data in various formats like images, audiovisual materials, and health files are rapidly generated. The International Data Corporation (IDC) predicted that global data volume would reach 175 ZB (approximately 175 billion TB) by 2025, which indicated that more than 99% of all data in human civilization were generated in recent years. Undoubtedly, we are entering a brand-new challenging era--the era of big data and intelligence, following mechanical and information ages. This special issue of big data technology has reported a series research in the field including big data storage, data mining algorithms, big-data platforms, visual big-data processing chip architectures, and super resolution image big-data processing frameworks.

Guest Editor

Zhibin Yu, Professor

Deputy Director, the Institute of Advanced Computing and Digital Engineering, Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, Shenzhen, China.Prof. Yu’s research includes heterogeneous intelligent computing systems, processor architecture design, cloud computing supported by computer architecture, big data analysis, artificial intelligence systems, and edge computing platform’s construction and optimization.

Peng Yin, Associate Professor

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.

Dr. Yin’s research interest includes biostatistics, statistical modelling in genetics, and machine learning in big healthcare data, etc.

Article List

Display Type:

1 Preface

YU Zhibin , YIN Peng

2019, 8(5):1-2.

[Abstract](782) [HTML](0) [PDF 408.08 K](2824)

Abstract:

2 Non-Negative Subspace Clustering Guided Non-Negative Matrix Factorization

CUI Guosheng , LI Ye

2019, 8(5):3-12. DOI: 10.12146/j.issn.2095-3135.20190702001

[Abstract](1184) [HTML](0) [PDF 1.70 M](3691)

Abstract:
As an effective data representation method, non-negative matrix factorization has been widely used in pattern recognition and machine learning. To obtain a compact and effective data representation in data dimension reduction, unsupervised non-negative matrix factorization usually needs to discover the latent geometry structure information of the data. A similarity graph constructed by reasonably modeling similarity relationships between data samples is typically used to capture spatial distribution structure information for data samples. Subspace learning can effectively explore the subspace structure information inside the data, and the obtained self-expressive coefficient matrix can be used to construct a similarity graph. In this paper, a nonnegative subspace clustering algorithm is proposed to explore the subspace structure information of data which is used to guide the non-negative matrix factorization, so as to obtain the effective non-negative low-dimensional representation of the original data. At the same time, an effective iterative strategy is developed to solve the problem of non-negative subspace clustering. The results of clustering experiments on two image datasets demonstrate that utilizing the subspace structure information of data can effectively improve the performance of non-negative matrix factorization.

3 End-to-end Data Integrity for Stacked Distributed File System

LI Shiyi , GU Liang , YU Zhibin

2019, 8(5):13-25. DOI: 10.12146/j.issn.2095-3135.20190729002

[Abstract](856) [HTML](0) [PDF 1.21 M](3494)

Abstract:
End-to-end checksum is an effective means of data integrity detection, which can provide basic reliability guarantee for the distributed storage systems. Glusterfs is a popular stacked distributed file system, but it lacks an effective data integrity detection mechanism. User data storage in the Glusterfs have a risk of being damaged and not being discovered. Moreover, this kind of risk can spread in some cases, causing data loss even with the protection of multiple copies or disaster recovery. This paper proposes a cost-effective Glusterfsbased end-to-end checksum scheme called Glusterfs-E2E, which can effectively solve the data integrity risk of Glusterfs. The proposed solution can not only provide full path protection, 2% to 8% performance overhead, but also can locate software bugs.

4 Spatial-Temporal Similarity-Based Data Fusion for Large-Scale Trajectories in Metro System

XIONG Wen , ZHOU Qianmei , YANG Kun , DAI Hao , SUN Li

2019, 8(5):26-33. DOI: 10.12146/j.issn.2095-3135.20190729001

[Abstract](1185) [HTML](0) [PDF 1.14 M](5715)

Abstract:
As the metro system becoming more and more important. How to utilize big data technology to support operational and management tasks is a hot topic in academia and industry communities. These tasks include metro network developing, service scheduling, risk response management, and public services. To address these issues, we propose a data fusion-based approach on two sources to rebuild a passenger’s full trip. The key idea is that we leverage the WiFi signal data and the smart card data together. We first calculate the spatiotemporal similarity between of smart card’s trajectories and mobile device’s trajectories. Then, we associate a passenger’s smart card and the corresponding mobile device via their similarity. Finally, we combine the instation trajectory hidden in the WiFi signal record and the coarse-grained trip presented by smart card record. We validate our approach on an extremely large dataset in the Chinese city Shenzhen. We calculate the similarity of trajectories generated by 7.28 million of smart cards and trajectories generated by 40.1 million of mobile devices in a Spark cluster. Experimental results show that this approach can rebuild 203 000 passenger trajectories. These results are enough to support many important applications in metro system.

5 Inferring Gut Microbial Interaction Network from Microbiome Data Using Network Embedding Algorithm

LI Qianying , CAI Yunpeng , ZHANG Kai

2019, 8(5):34-48. DOI: 10.12146/j.issn.2095-3135.20190704001

[Abstract](1147) [HTML](0) [PDF 1.92 M](4417)

Abstract:
Identifying the relationship between the gut microbial community and the host environment, as well as the driving mechanism, are the key tasks in gut microbial research. Microbiome high-throughput gene sequencing and big data analysis are currently the mostly used techniques for investigation microbial communities. Existing studies on human gut microbiota data mainly focus on the community diversity and composition, while methods for deep exploration of the ecological and functional relationships among bacteria species are still lacking. An urgent task is therefore raised on developing computational methods to explore the interaction pattern between gut microbial components from in the sense of molecular network from microbiome sequencing data. In this paper, we adopt the network embedding method proposed in machine learning as a remediation to the drawbacks of traditional biological network learning technology which were solely dependent on the direct correlation between nodes, with stronger power in capturing the heterogeneity, hidden variables and imbalance features in microbial network interactions. By analyzing the correlation between the created function modules with the new approach and the environmental variables as well as key metabolite components, it is confirmed that the derived functional modules managed to identify biological-relevant feature that can be less recognized with previous approaches, which are helpful for further modeling of the potential coupling mechanisms among the biological systems. The method described in this article not only provides a new perspective for the analysis of gut microbial community structure, but also can be extended to other environmental microbiology research and reflect the driving process of community structure through multi-level information of data.

6 An Efficient Multi-Model Super Resolution Framework

WU Xinzhou , YUAN Ninghui , SHEN Li

2019, 8(5):49-57. DOI: 10.12146/j.issn.2095-3135.20190810001

[Abstract](925) [HTML](0) [PDF 1.51 M](4160)

Abstract:
Super resolution (SR) technique is an important means for image resolution improvement, which has been widely used in remote sensing, medicine image processing, target recognition and tracking etc. In recent years, the deep learning techniques also have been applied in the SR domain successfully. However, researchers pay most of their attentions on the quality of the output images, but ignore the training or reconstruction efficiency. In this paper, we found that for images with different texture features, the most appropriate models are usually different. Based on this observation, a multi-model super resolution framework (MMSR) is proposed, which can choose a suitable network model for each image for training. Experimental results with the DIV_2K image set indicate that, efficiency can be improved 66.7% without the loss of image quality. Moreover, MMSR exhibits good scalability.

7 A Hardware Architecture for Accelerating Neuromorphic Visual Recognition

TIAN Shuo , LI Shiming , WANG Lei , XU Shi , XU Weixia

2019, 8(5):58-71. DOI: 10.12146/j.issn.2095-3135.20190729003

[Abstract](767) [HTML](0) [PDF 1.84 M](3988)

Abstract:
The widespread application of deep learning has led to the realization of many human-like cognitive tasks in visual analysis. HMAX is a visual cortex-based bio-inspired model that has proven superior to standard computer vision methods in multi-class object recognition. However, due to the high complexity of neural morphology algorithms, implementing HMAX models on edge devices still faces significant challenges. Previous experimental results show that the S2 phase of HMAX is the most time-consuming stage. In this paper, we propose a novel systolic array-based architecture to accelerate the S2 phase of the HAMX model. The simulation results show that compared with the baseline model, the execution time of the most time-consuming S2 phase of

8 Genome-Wide Association Study of Cardiovascular and Cerebrovascular Diseases Based on Multi-Step Screening

HU Yishen , ZHU Muchun , YIN Peng

2019, 8(5):72-85. DOI: 10.12146/j.issn.2095-3135.20190702002

[Abstract](818) [HTML](0) [PDF 1.64 M](4284)

Abstract:
Genome-wide association study (GWAS) is an effective method to study genetic variants associated with complex diseases or traits. Marginal statistical test is the common method of GWAS, however there following weakness such as lack of consideration of correlation between the features and unstable threshold selection. In this paper, we discuss a new method of GWAS based on multi-step tests model for cardiocerebrovascular disease. The method can be divided into the following two steps: Gini index is used for first step feature selection to achieve a subset of single-nucleotide polymorphisms (SNPs), and then random forest recursive cluster elimination (RF-RCE) filters the associated SNPs subset from first-step candidate SNP set. Experiment results show that the multi-step feature selection is better than the single-step feature selection, and the selected SNPs are more suitable for cardio-cerebrovascular disease prediction.

9 Big Data Platform for Clinical Scientific Research

WANG Chi , LI Chao , CHEN Xu , HONG Ping , ZHENG Wenli , SHEN Yao , QI Kaiyue , GUO Minyi

2019, 8(5):86-96. DOI: 10.12146/j.issn.2095-3135.20190729004

[Abstract](1557) [HTML](0) [PDF 1.10 M](7510)

Abstract:
By far, China’s medical informatization construction has been a while. Each hospital has accumulated a large amount of electronic medical clinical data, but the data structure is highly diverse. To better assist clinical diagnosis and treatment, research, save medical resources, improve medical efficiency and medical treatment quality has become a common requirement in various medical institutions. This paper proposes a big data platform for clinical research, solving the inconsistency of multi-hospital information infrastructure by constructing multi-source data collection methods, unified data storage methods to cope with different data types, and distributed data computing platforms to improve efficiency and scalability. We construct a clinical research platform to assist clinical researchers in scientific research. According to the proposed architecture, the cluster was simplified to about 15 seconds in the special disease analysis process. The data processing efficiency was compared with the existing platform. The 5 days of time for importing of 16 692 data records is reduced to 10 minutes that we can import 15 217 026 data records, significantly improving speed and quantity. This platform helps complete clinical data collection, establish a special disease database, clinical research, and assist in the closed loop of clinical diagnosis and treatment, providing an efficient and integrated data platform support for clinical research.