Opt In (Do Not Edit Here)

Feb 11, 2016

GoogleSearch-R

However, there are vast web data scraping tools and some cloud services are available, and they are vary widely in features. Here I'll show you one of the task from such scraping tools, that is, scraping Google Search Engine Results (Only links) using R Studio.
 

GoogleSearch-R using R Studio

Here, I'll show you how to scrape the URLs from First few pages of Google Search Engine Results for whatever search query you enter, and store the listing in CSV file for further use.

Why to scrape Google Search Engine Results?
The most common reason to scrape GSERs is for keyword planning and deeper keyword analysis. The another common reason is to monitor the organic search ranking of your website in Google for specific keywords.
 

The Code
#-load packages
library(RCurl)
library(XML)


#-function to trim whitespace from string 'x'
trim <- function( x ) {  gsub("([[:space:]])", "", x) }


#-function to scrape list of URLs
googleURLs <- function(u){
    ##- parse HTML
    doc <- htmlParse(getURL(u))
 
    ##-find matching node with H3 Tag, Anchor Tag and HREF attribute
    attrs <- xpathApply(doc, "//h3//a[@href]", xmlAttrs)

    ##- grab nodes matching with 'http'
    links <- sapply(attrs, function(x) x[[1]])
    links <- grep("http", links, fixed = TRUE, value=TRUE)

    ##- this is necessary to remove unwanted part of links
    ##- split results with '&' char

    links <- strsplit(links,'&')
    links <- sapply(links, function(x) x[[1]][1])

    ##- split results with '=' char
    links <- strsplit(links,'=')
    links <- sapply(links, function(x) x[[2]][1])


    ##- write list of URLs to googleURLs.csv file
    write.table(plinks, file="googlURLs.csv", append=TRUE, sep=",", row.names=FALSE, col.names = FALSE)
}


#- Using For loop grab links from first 5 search result pages
for (i in seq(0,40,10)){
    u <- trim(paste("http://www.google.co.in/search?q=MayurDighe&start=", i,""))
    googleURLs(u)
}


How result will look like?
Answer: see below..(googleURLs.csv)
"https://in.linkedin.com/pub/dir/Mayur/Dighe"
"https://www.facebook.com/public/Mayur-Dighe"
"https://in.linkedin.com/pub/dir/Mayur/Dighe/in-7350-Pune-Area,-India"
"https://www.quora.com/profile/Mayur-Dighe"
"https://bintray.com/mayurdighe"
"http://www.tripoto.com/mayurdighe"
"http://www.upwork.com/o/profiles/users/_~01459018039c776d0c/"
"https://plus.google.com/115474132218686530375"
"https://www.instagram.com/mayur.dighe/"
"https://www.hackerearth.com/users/MayurDighe/"
"https://twitter.com/mayurdighe739"
"https://twitter.com/mdighe10"
"https://twitter.com/mayur__dighe"
... cont.



Conclusion
This GoogleSearch-R code is small efforts to scrape list of URLs from Google Search. By doing some tweaks you can also grab URLs from other reputed search engines like Yahoo, Bing etc.

About The Author :

Freelancer and IT Engineer
Softwares Developed by Mayur Dighe ImmortalDotNet.WordPress.com

0 comments :

All Rights Reserved. 2014 Copyright SIMPLITONA

Powered By Blogger | Published By Gooyaabi Templates Designed By : BloggerMotion

Top