2019, 8(5):3-12. DOI: 10.12146/j.issn.2095-3135.20190702001
Abstract:As an effective data representation method, non-negative matrix factorization has been widely used in pattern recognition and machine learning. To obtain a compact and effective data representation in data dimension reduction, unsupervised non-negative matrix factorization usually needs to discover the latent geometry structure information of the data. A similarity graph constructed by reasonably modeling similarity relationships between data samples is typically used to capture spatial distribution structure information for data samples. Subspace learning can effectively explore the subspace structure information inside the data, and the obtained self-expressive coefficient matrix can be used to construct a similarity graph. In this paper, a nonnegative subspace clustering algorithm is proposed to explore the subspace structure information of data which is used to guide the non-negative matrix factorization, so as to obtain the effective non-negative low-dimensional representation of the original data. At the same time, an effective iterative strategy is developed to solve the problem of non-negative subspace clustering. The results of clustering experiments on two image datasets demonstrate that utilizing the subspace structure information of data can effectively improve the performance of non-negative matrix factorization.
2019, 8(5):13-25. DOI: 10.12146/j.issn.2095-3135.20190729002
Abstract:End-to-end checksum is an effective means of data integrity detection, which can provide basic reliability guarantee for the distributed storage systems. Glusterfs is a popular stacked distributed file system, but it lacks an effective data integrity detection mechanism. User data storage in the Glusterfs have a risk of being damaged and not being discovered. Moreover, this kind of risk can spread in some cases, causing data loss even with the protection of multiple copies or disaster recovery. This paper proposes a cost-effective Glusterfsbased end-to-end checksum scheme called Glusterfs-E2E, which can effectively solve the data integrity risk of Glusterfs. The proposed solution can not only provide full path protection, 2% to 8% performance overhead, but also can locate software bugs.
2019, 8(5):26-33. DOI: 10.12146/j.issn.2095-3135.20190729001
Abstract:As the metro system becoming more and more important. How to utilize big data technology to support operational and management tasks is a hot topic in academia and industry communities. These tasks include metro network developing, service scheduling, risk response management, and public services. To address these issues, we propose a data fusion-based approach on two sources to rebuild a passenger’s full trip. The key idea is that we leverage the WiFi signal data and the smart card data together. We first calculate the spatiotemporal similarity between of smart card’s trajectories and mobile device’s trajectories. Then, we associate a passenger’s smart card and the corresponding mobile device via their similarity. Finally, we combine the instation trajectory hidden in the WiFi signal record and the coarse-grained trip presented by smart card record. We validate our approach on an extremely large dataset in the Chinese city Shenzhen. We calculate the similarity of trajectories generated by 7.28 million of smart cards and trajectories generated by 40.1 million of mobile devices in a Spark cluster. Experimental results show that this approach can rebuild 203 000 passenger trajectories. These results are enough to support many important applications in metro system.
2019, 8(5):34-48. DOI: 10.12146/j.issn.2095-3135.20190704001
Abstract:Identifying the relationship between the gut microbial community and the host environment, as well as the driving mechanism, are the key tasks in gut microbial research. Microbiome high-throughput gene sequencing and big data analysis are currently the mostly used techniques for investigation microbial communities. Existing studies on human gut microbiota data mainly focus on the community diversity and composition, while methods for deep exploration of the ecological and functional relationships among bacteria species are still lacking. An urgent task is therefore raised on developing computational methods to explore the interaction pattern between gut microbial components from in the sense of molecular network from microbiome sequencing data. In this paper, we adopt the network embedding method proposed in machine learning as a remediation to the drawbacks of traditional biological network learning technology which were solely dependent on the direct correlation between nodes, with stronger power in capturing the heterogeneity, hidden variables and imbalance features in microbial network interactions. By analyzing the correlation between the created function modules with the new approach and the environmental variables as well as key metabolite components, it is confirmed that the derived functional modules managed to identify biological-relevant feature that can be less recognized with previous approaches, which are helpful for further modeling of the potential coupling mechanisms among the biological systems. The method described in this article not only provides a new perspective for the analysis of gut microbial community structure, but also can be extended to other environmental microbiology research and reflect the driving process of community structure through multi-level information of data.
2019, 8(5):49-57. DOI: 10.12146/j.issn.2095-3135.20190810001
Abstract:Super resolution (SR) technique is an important means for image resolution improvement, which has been widely used in remote sensing, medicine image processing, target recognition and tracking etc. In recent years, the deep learning techniques also have been applied in the SR domain successfully. However, researchers pay most of their attentions on the quality of the output images, but ignore the training or reconstruction efficiency. In this paper, we found that for images with different texture features, the most appropriate models are usually different. Based on this observation, a multi-model super resolution framework (MMSR) is proposed, which can choose a suitable network model for each image for training. Experimental results with the DIV_2K image set indicate that, efficiency can be improved 66.7% without the loss of image quality. Moreover, MMSR exhibits good scalability.
2019, 8(5):58-71. DOI: 10.12146/j.issn.2095-3135.20190729003
Abstract:The widespread application of deep learning has led to the realization of many human-like cognitive tasks in visual analysis. HMAX is a visual cortex-based bio-inspired model that has proven superior to standard computer vision methods in multi-class object recognition. However, due to the high complexity of neural morphology algorithms, implementing HMAX models on edge devices still faces significant challenges. Previous experimental results show that the S2 phase of HMAX is the most time-consuming stage. In this paper, we propose a novel systolic array-based architecture to accelerate the S2 phase of the HAMX model. The simulation results show that compared with the baseline model, the execution time of the most time-consuming S2 phase of
2019, 8(5):72-85. DOI: 10.12146/j.issn.2095-3135.20190702002
Abstract:Genome-wide association study (GWAS) is an effective method to study genetic variants associated with complex diseases or traits. Marginal statistical test is the common method of GWAS, however there following weakness such as lack of consideration of correlation between the features and unstable threshold selection. In this paper, we discuss a new method of GWAS based on multi-step tests model for cardiocerebrovascular disease. The method can be divided into the following two steps: Gini index is used for first step feature selection to achieve a subset of single-nucleotide polymorphisms (SNPs), and then random forest recursive cluster elimination (RF-RCE) filters the associated SNPs subset from first-step candidate SNP set. Experiment results show that the multi-step feature selection is better than the single-step feature selection, and the selected SNPs are more suitable for cardio-cerebrovascular disease prediction.
2019, 8(5):86-96. DOI: 10.12146/j.issn.2095-3135.20190729004
Abstract:By far, China’s medical informatization construction has been a while. Each hospital has accumulated a large amount of electronic medical clinical data, but the data structure is highly diverse. To better assist clinical diagnosis and treatment, research, save medical resources, improve medical efficiency and medical treatment quality has become a common requirement in various medical institutions. This paper proposes a big data platform for clinical research, solving the inconsistency of multi-hospital information infrastructure by constructing multi-source data collection methods, unified data storage methods to cope with different data types, and distributed data computing platforms to improve efficiency and scalability. We construct a clinical research platform to assist clinical researchers in scientific research. According to the proposed architecture, the cluster was simplified to about 15 seconds in the special disease analysis process. The data processing efficiency was compared with the existing platform. The 5 days of time for importing of 16 692 data records is reduced to 10 minutes that we can import 15 217 026 data records, significantly improving speed and quantity. This platform helps complete clinical data collection, establish a special disease database, clinical research, and assist in the closed loop of clinical diagnosis and treatment, providing an efficient and integrated data platform support for clinical research.