Skip to main content


Corporate InformationResearch & Development

Industrial AI blog

Integrated retrieval system based on medical big data

30 September 2020

Song Yu

Song Yu
Hitachi (China) Research & Development Corporation

This blog is not about AI/Analytics per se but a new system architecture we developed to prepare complex real-world data for machine learning in the future. Several years ago, hospital data in China was distributed in different medical systems. In particular, hospitals hold various medical records of patients in large volume with diverse formats [1], including not only structured data, but also unstructured data, such as dictation, handwriting, photographs, images, etc. [2]. Diverse data formats require flexible data storage and access, whereas large volume of data asks for high scalability and availability. Although some Chinese companies have developed data integration systems, most of them are still not able to integrate complicated hospital data, and in particular, are not able to integrate structured data and unstructured data. We decided to address this challenge, and developed this system architecture. In this blog article, I outline the system design, distributed crawler architecture and distributed strategy that we employed in an integrated retrieval system to create a system that meets the above needs and realizes the collection, storage, retrieval and visual display of such massive and diverse data.

System design

In existing hospitals, available data includes electronic medical records in HIS (Hospital Information System), inspection reports in LIS (Laboratory Information Management System) system and DICOM (Digital Imaging and Communications in Medicine) files in PACS (Picture Archiving and Communication Systems) system [3].

  • The Data extraction and upload module extracts data from all above data sources, constructs MetaData files and uploads them to storage module in pairs with original source files;
  • the Data storage module uses Perst for object-oriented storage;
  • the Index constructing module regularly uses a distributed crawler to crawl the latest data from the storage module, constructs index and adds crawled items to the index database, and
  • the Data display module reads from the index database and displays the data to users according to the application.

Module design of integrated retrieval system

Figure 1: Module design of integrated retrieval system

The difference between SQL and NoSQL

Figure 2: The difference between SQL and NoSQL

The system adopts NoSQL instead of a traditional SQL relational database to allow for flexible data storage and access [4]. In addition, the index database is based on Solr, which is a high-performance full-text search server that can be easily scaled up by adding storage devices and search engine servers [5], which ensures high scalability and availability.

Distributed crawler architecture in the Index Constructing Module

Furthermore, we introduced a new distributed crawler architecture as shown in Figure 3 to the Index Construction Module to crawl large-capacity data quickly and in real time. In this architecture:

  • The Scheduler accepts the request sent by crawlers, press in the queue. It can be imagined as a priority queue of URLs, which decides the next metadata file to be grabbed;
  • the Downloader receives requests from Scheduler, downloads metadata and actual data from Perst, and then returns the information to crawler components;
  • the Crawlers define the rule of retrieving information from Perst and extracting information needed, known as the Items. Crawlers can also remove duplicate metadata files, and
  • the Item pipeline which deals extracted Items, verifies Item effectiveness, removes unnecessary information and uploads the parsed data to the search engine Solr.

Distributed crawler architecture

Figure 3: Distributed crawler architecture

Distributed strategy followed by the multiple crawler devices

The cluster crawler in Figure 3 above contains multiple crawler devices and follows a distributed strategy as shown in Figure 4 below, where

  • The Master records whether a request has been crawled, and
  • the Slave produces requests and asks the Master whether a request has been crawled. If not, add it to the waiting queue in the Scheduler.

Distributed crawler strategy

Figure 4: Distributed crawler strategy

After the system starts, crawlers occasionally run as the Scheduler keeps a record of the creation time of the last-visited metadata files and arranges crawlers to only crawl data that have not been crawled before. The distributed structure of the crawler also makes it easy to scale-up.


In this blog, I outlined a new system architecture for integrated retrieval which enables easy and fast access to massive medical data as well as providing high scalability and availability of service by maintaining proper loose coupling among modules and adopting NoSQL and distributed strategy. In terms of application, we did the trials in some hospitals, having extracted HIS, LIS and DICOM data and completed the integration. Of course, appropriate modifications to this system architecture can also be applied to other scenarios. And to find out more about our proposed system, please refer to my paper, “Integrated Retrieval System Based on Medical Big Data” which is included in the Proceedings of the 3rd International Conference on Vision, Image and Signal Processing (ICVISP2009),


I would like to thank my team members for sharing their pearls of wisdom with me.


Zhu Rui, Peng Lu. Application of Medical Big Data. Science and Technology in Western China, 2015, 14(5):95-97.
Li Ruiqin, Zheng Janguo, Big data Research: Status quo, Problems and Tendency[J], Network Application,Shanghai,1994, pp.107-108.
Meng Xiaofeng, Wang Huiju, Du Xiaoyong, Big Daya Analysis: Competition and Survival of RDBMS and ManReduce[J], Journal of software, 2012,23(1): 32-45.
Ni Mingxuan, Zhang Qian. Intelligent Medicine - From Internet of Things to Cloud Computing. Chinese Science: Information Science, 2013(4).
Chapman A, Allen M D, Blaustein B. It’s About the Data: Provenance as a Toll for Assessing Data Fitness[C]//Proc of the 4th USENIX Workshop on the Theory and Practice of Provenance, Berkeley, CA: USENIX Association, 2012:8.