Web data scraping Guide for SEO
What is Web data scraping?
We frequently hear these words “data scraping” “data mining” etc. Yes it’s true that there is wealth of information out there and every marketer can reap some benefit out of this as well, but the question is how? I have been using a number of tools for a while now to scrape data for SEO. I am using data for On-site Audit, Link Analysis, Blog prospects and outreach. Data scraping for SEO does sounds complicated but trust me it’s not. You need to make sure you have the right tools for the job and you are going after right data.
How to use scraped data for SEO?
You can use a number of crawlers to crawl the site you are auditing and can scrape very useful data on on-page elements to determine if the site is optimised well or not. You can use tools like Screaming Frog (Paid with functional free version available) or Xenu (free). I would recommend Screaming Frog as you can call it “Ferrari” of crawling. There is a great article on seer interactive blog on how to use screaming frog to its maximum potential.
With bad link penalty from Google is a reality link analysis or link audit is becoming more and more vital day by day. Given pre 2010 SEO tactics utilised by many business owners and agencies it is quite obvious they now have a number of links within their backlink profile they wish to disappear. Mighty Google (and Bing as well) took pity on us and allow us to inform search engines of these bad link via Disavow tool (Google and Bing). However I believe this disavow is not a good solution until you know which links are good and which are bad.
Now scraping data can be the solution to determine external link quality. You might say there are backlinks data available out there via third party tools such as Opensite Explorer, Ahrefs, Majesticseo etc., but no such data is complete and reporting on complete backlink profile. As Google is our primary target engine I would prefer to use their data and it is certain if someone receive link penalty, the offender link(s) will be listed on backlink profile from Google webmaster tools. Now you will face the problem Google does have backlink data for you but not with any additional metrics so it’s hard to tell which are the bad ones! Yes you can go and check one by one on your browser but if you have a link profile with thousands of links, I wish you all the best with that!
Now data scraping at rescue. You can gather link value metrics using a number of scraping tools depending on what metrics you consider to evaluate links. I personally use External linking domains PageRank, Server Status (404, 500 etc.), Page Title (i.e. look for title with foreign language in it, if you have an English language site), External links from linking page (i.e. over 100 external links), match with blacklist (match linking domain from known spam site, low quality directory etc.). I understand you might want to use different metrics; however these metrics I have mentioned above can be scraped using tools below:
Server Status, Page Title and number of external links – Screaming Frog using list mode (see under configuration)
Blacklist: I have a list of couple of million blacklisted domains you can Download for Free.
Also find example (analysis done by me) of link analysis for bad links using following method here (this have couple of extra metrics such as TrustFlow and CitationFlow). I will publish a blog post with detailed process to create this analysis here soon.
Blog prospects and find outreach data
It is becoming more and more difficult to run a link building campaign with in a set budget mainly due to the amount of human hours needed to prospect and finds outreach information (email, social media etc.). Most of the agencies charge from $150 – $300 per hour for conducting SEO work. It is quite hard to justify investing 20-30 hours to conduct link prospect and collecting outreach emails. There are services out there to help you with blog outreach however it seemed they are too protective about their data (no raw export etc.) and again quite costly for the service they are providing. On the other hand you cannot go and manually compile a massive list of blogs and potential sites so you can outreach.
You can use crawler such as screaming frog to crawl sites with lots of qualified blog listed (verified by actual human and categorised based on contents) using complete site crawl and export external links. You also can crawl a specific category to export sites under that category only. Once you have the list of high quality blogs then you can use ScrapeBox to generate data to evaluate blog popularity such as PageRank, Alexa Rank, number of page indexed etc. So you will have a great list of blogs from the niche you are working on with blog value metrics so you know who to outreach first. You also can use browser based scraper such as multi-links for Firefox or Scraper for Chrome to scrape already created list by known publisher or from search results. Again I would love to write a detailed blog about the process I have mentioned above.
Once you have a nice list of blogs and sites to outreach next thing you need would be contact details for identified opportunities. If you are after email only you can purchase web email extractor from newprosoft, this works really well for large list. If you need social profiles as well you can buy Buzzstream for link building and you will get a scraper capable of gathering all social profiles along with email scraper (however this email scraper search certain urls only and returns far less contacts compared to web mail extractor). Also you can create a custom scraper on 80legs to scrape social profiles from urls.
Happy scraping. Let me know via comment if you guys love scraping too and what tools you use to get your job done. Also let me know if want to see more post on data scraping like this.