A Content Analysis of Google Scholar: Coverage Varies by Discipline and by Database

Virginia Wilson

Abstract


Objective – To ascertain the coverage by discipline, publication date, publication language, and upload frequency of the scholarly articles found in Google Scholar.

Design – Comparative content analyses.

Setting – Electronic information resources accessible via the internet (both freely accessible and for-fee databases).

Subjects – Forty-seven online databases and Google Scholar.

Methods – The study compared the content of 47 databases (21 Internet resources freely available to the general public; 26 restricted-access databases) covering a variety of subjects with the content of Google Scholar. Each database was assigned to one of the following discipline categories: business, education, humanities, science and medicine, social science, and multidisciplinary. From April through July 2005, researchers generated random samples of 50 article titles from each of the 47 databases and searched the titles on Google Scholar to determine inclusion.

Related studies were conducted for publication date and publication language analysis, and for the Google Scholar upload frequency study. For the publication date study, random samples from one database (PsycINFO) with a high degree of variability in Google Scholar coverage were searched for 1990, 2000, and 2004. For the publication language study, Google Scholar coverage of PsycINFO articles in English was compared to coverage of PsycINFO articles published in non-English languages. For the upload frequency study, two databases chosen for their high degree of coverage (BioMed Central and PubMed) were monitored to determine how often the new content was uploaded to Google Scholar.

Main Results – This study revealed that content covered by Google Scholar varies greatly from database to database and from discipline to discipline. Of the 47 databases studied, coverage ranged from 6% to 100%. Mean and median values of coverage for all databases were both 60%. The mean discipline category scores varied from the humanities databases at 10% coverage, to the social sciences and education at 39% and 41% respectively, to science and medicine databases at 76% coverage. Mean coverage was 77% for the multidisciplinary databases. Mean coverage of open access journal databases was 95%, freely accessible databases had 84% mean coverage, and single publisher databases had 83% mean coverage.

The publication language study found a bias towards English language publications. As well, a publication date bias was found – coverage of earlier dates was not as thorough as coverage of more recent publications. In the upload frequency study, for BioMed Central and PubMed there appears to be an approximately 15-week delay in the uploading of new material to Google Scholar.

Conclusions – The results of this study serve to alert researchers and information professionals that Google Scholar (in beta test mode at the time of the study) has poor coverage in certain areas. To those with access to commercial databases, this serves as a cautionary tale. To those with a dearth of commercial databases, Google Scholar is a welcome site and can provide at least some information. The researchers state that the search engine itself could make future content studies unnecessary if it decides to make its content collection methodology transparent to users. Upload frequency, Google Scholar’s linking services, the advanced search option, and the “cited by” feature could all be subjects of future studies. For its first year in operation, Google Scholar offers a broad range of discipline coverage with substantial depth in some areas. At the time of the study, Google Scholar was working with libraries and vendors to connect search results to library-licensed full text.

Full Text:

PDF



Evidence Based Library and Information Practice (EBLIP) | EBLIP on Twitter