Data Mining PubMed
The National Institutes of Health provides a full programming interface to search PubMed called E-Utilities. Interacting with the PubMed database is conveniently through simple HTTP requests and returns the article metadata as XML. Every article in PubMed has a title, author, abstract, journal, year, volume, issue, pages, and keywords, amoung other metadata. Getting the metadata from PubMed, however, involves two separate queries. Very simply, the first query returns a list of PubMed IDs for articles matching the search criteria and the second query returns article data for a given PMID.
The workflow is divided into two parts:
Query E-Search passing it your search term and it returns a list of PMIDs that are used to query E-Fetch for the article metadata.
E-Search returns a list of PMIDs:
<eSearchResult> <Count>157380</Count> <RetMax>10</RetMax> <RetStart>0</RetStart> <IdList> <Id>23858010</Id> <Id>23856563</Id> <Id>23856146</Id> <Id>23855510</Id> <Id>23839460</Id> <Id>23839375</Id> <Id>23853340</Id> <Id>23853339</Id> <Id>23853324</Id> <Id>23853296</Id> </IdList> ...
Next, query E-Fetch for the article data. You can request multiple PMIDs at once and even the return type (XML, text, JSON). The API also supports pagination to iteratively get many thousands of results.
I've written several interfaces to access the PubMed API, including in PHP, Python, and C#. For instance, the Python script was written specifically to data-mine PubMed. Given a search term
pmquery.py will query PubMed and save each article to a text file. For some search terms, like "transcranial magnetic stimulation" this results in over 9000 articles returned by Pubmed. So the process is iterative and can take some time (minutes). The PHP implementation provides a web-based search interface. For desktop based applications, see the C# code and the Scholared app for a working example.