IBM Releases Deep Search For Scientific Discovery
Written by Nikos Vaggalis   
Tuesday, 16 August 2022

IBM's Deep Search for Scientific Discovery (DS4SD) Toolkit has been made available to the public. It comes from the depths of IBM's research labs using NLP to analyze mass amounts of data.

Deep Search is a cloud-based AI research service offered as a SaaS that allows researchers to load large amounts of structured or unstructured data to immediately find useful connections. The sources that Deep Search can consume vary and range from journal articles to patents to technical reports and more. By using AI and NLP it can ingest 20 pages per second whereas a typical human expert takes 1–2 minutes per page just to read, and automatically extracts the semantic units and their relationships. It then builds a searchable knowledge graph which enables its users to:

robustly explore information extracted from tens of thousands of documents without having to read a single paper.

As such it has been widely adopted in the scientific field, for instance on Covid research or for alternative cancer treatments by working out the connections between individual research papers, or discovering new molecules. Of course, the use cases are not constrained to the medical research sector but can be applied anywhere there is data like documents, legal briefs, financial statements, technical specifications, research papers, slide decks, you name it.


IBM has made available part of the service in the form of a toolbox , calling it Deep Search for Scientific Discovery (DS4SD). This toolbox is broken down into two parts, Deep Search Experience and Deep Search Toolkit.

The Deep Search Experience is the automatic document conversion service which allows users to upload documents to inspect a document’s conversion quality, using a simple drag-and-drop interface that makes it very easy for non-experts to use. This part is not open sourced but has been made publicly available online for anyone to use. To work with the Deep Search Experience service,you upload your document and then let it work its magic:

  • Inspects the data that can be extracted from one of your documents. Your document is decomposed on the spot, cut into pieces of text, images, and tables. Numeric data, entities, and their relationships are then inferred from these pieces.
  • Searches and collectes data from preprocessed document collections. These data include structured text, numerics, entities, and their relationships.
  • Processes data into usable information in your workspace , where you connect documents with curated knowledge from databases. The resulting knowledge graphs enable queries and analyses that span the entities and relationships that are described in both your documents and domain-specific databases.

The Deep Search toolkit, on the other hand, is an open source Python package allowing users to interact with the Deep Search platform by programmatically uploading and converting documents in bulk. They can point to a folder and direct the toolkit to upload the documents, convert them, and ultimately analyze the contents of the text, tables, and figures. The Deep Search Toolkit is available as a PyPI package. It can be installed using the standard Python package managers like pippoetry, etc.

The Deep Search Experience is reachable at 

while you can find the Python DeepSearch Toolkit on its repo.

The wider context is that we are entering an era where AI evolution and advancements in Computer Science will play a crucial role in bringing society forward.That's the one ingredient necessary for success; the other is the democratization by open sourcing those tools in order to make them available to as many brains as possible, increasing multi-fold the chances of making a groundbreaking discovery and so changing the world for the better.


More Information

Related Articles

Artificial Intelligence, Machine Learning and Society

Take Stanford's Natural Language Understanding For Free

Take Stanford's Natural Language Processing with Deep Learning For Free


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Google Introduces PaliGemma, A New Visual Language Model

Last week's Google I/O saw the introduction of PaliGemma, an open vision-language model (VLM), together with some details of what's coming in Gemma 2. 

GitHub Announces 2024 Accelerator Cohort Winners

GitHub has announced the companies chosen to form the next cohort for GitHub Accelerator. Find out about this year's participating projects, all of which focus on AI.

More News

C book



or email your comment to:

Last Updated ( Tuesday, 16 August 2022 )