Pulling data from websites into Google Sheets
Pulling data from a public website, sometimes called web scraping, is a useful way to grab some data for research when a public API is not avaliable.
Using a spreadsheet means you don't need to look after an application and is a lower barrier of entry to folks who are not as comfortable with programming.
This is not as reliable as using a dedicated API provided by a website and should not be used for production services. Be sensible and don't abuse websites by over doing it.
Picking an element you want data from
In this example I'm going to find out how many GitHub projects use the jest-axe JavaScript library. These steps can be followed in Google Chrome.
- Visit the page with the data you want.
- Right click the data.
- Select 'Inspect'.
You will see the element in your Developer tools.
- Right click the element.
- Select 'Copy', then 'Copy XPath'.
You will have an XPath selector in your clipboard, when you paste you should see something similar to:
//*[@id='dependents']/div[3]/div[1]/div/div/a[1]
Note: you may have to swap out double quotes (
"
) for single quotes ('
) in copied XPath selectors.
This selector is what we will use in the importXML
function.
importXML function
Google Sheets has functions, for example =SUM()
adds cells together.
importXML
allows you to take a page URL and XPath selector and return data from the selected element.
For example, we can take the previous URL and XPath and use it in a cell like this:
=importXML(
"https://github.com/nickcolley/jest-axe/network/dependents",
"//*[@id='dependents']/div[3]/div[1]/div/div/a[1]"
)
When this function runs the cell returns with:
287 Repositories
Live example of this in Google Sheets:
Simple importXML example
Making it less likely to break
Web scraping is an unreliable way of getting data from a website as the HTML can change breaking an XPath selector.
We can improve the selector by selecting something unique on the element rather than relying on the structure of the page.
=importXML(
"https://github.com/nickcolley/jest-axe/network/dependents",
"//a[contains(@href,'network/dependents?dependent_type=REPOSITORY')]"
)
The contains
XPath function looks for an anchor element with an href that has the data we need.
We can then use regexReplace
to replace any characters that are not numbers, then turn the value into a number with value
:
=value(
regexReplace(
importXML(
"https://github.com/nickcolley/jest-axe/network/dependents",
"//a[contains(@href,'network/dependents?dependent_type=REPOSITORY')]"
)
,"\D+", "")
)
When this function runs the cell returns with:
287
Live example of this in Google Sheets:
More robust importXML example
Check out devhints.io/xpath you want more information on how to build your own XPath selectors.
Updating the data
The result of importXML
will not be updated unless the function's URL is updated. This is to stop pulling the same data everytime the spreadsheet is opened.
I've noticed that it does seem to update on it's own after a long period of time, but I could not find where this is documented, if you know more let me know.
Force your cells to update by creating a cell that you update manually and add it to the end of your URL.
=value(
regexReplace(
importXML(
"https://github.com/nickcolley/jest-axe/network/dependents#" & $B$3,
"//a[contains(@href,'network/dependents?dependent_type=REPOSITORY')]"
)
,"\D+", "")
)
Live example of this in Google Sheets:
Refreshing data importXML example
Full examples
I've put together examples that will show you other ways you can make use of this.
Take Twitter account names and get their followers and following count.
Open source project popularity
Take GitHub repositories and list how many people have stared, watched, forked the project and get npm weekly downloads.
Open source project popularity example
Ruby gem version usage in GitHub projects
List GitHub projects and check what version of the tech-docs-gem
ruby gem they're using.