Loading...
 
TRP News - 8 November 2011

TRP & Google

TRP and Google

Just a quick follow-on from the various reports during the third quarter of 2011 about how Google is progressing with crawling The Reeves Project website. It remains the case that Google will give the world at large by far the best coverage of TRP content.

Site Maps

Our site map file provides links to the pages we'd most like search engines such as Google to index. They find and index additional pages from links on these pages.

The first version of our site map was posted on 26 June, the second on 14 July and the current version on 28 August as denoted by the steps in the gold line on the chart below, which is only visible to registered users.
The current sitemap has 3339 urls for search engine crawlers to examine.

Robots.txt

As well as finding the pages we like them to index from the Site Map, search engine crawlers will also notice links to other pages, many of which we don't want crawled.

The blue bars on the chart represent the pages being ignored by Google based on the explicit instructions we've coded in a file called robots.txt.

We posted our first version of the exclusion file on 31 July to stop print and edit versions of the wiki pages being being indexed and it took a few days before Google started ignoring these pages and dropping those it had already indexed.

The second change to the exclusion file occurred on 12 September, to test the exclusion of the full screen version of a very specific subset of pages, less than 40 in total. The first pages so excluded were noticed on 15 September, with no adverse impacts.

However, that does not explain the large down-step in excluded pages between 16 September and 18 September. That occurs because Google only reports and summarises data for the past 45 days and 17 September is 45 days after 3 August when the first exclusions occurred. Notice that the number of URL's indexed, as shown by the red line, remains constant across this period.

The next significant change to the exclusion file occurred on 8 October, causing a small up-tick in the pages being ignored. This was to stop wiki pages with &fullscreen=n in their url's being indexed. (This url parameter is only seen after having viewed a page in fullscreen mode.) This had a small but positive effect on the total number of pages indexed, as noted by the up-tick in the red line.

There is no significant effect as we start the third 45-day period on 1 November.

The next significant change to the exclusion file occurred on 1 November, with an additional 3770 urls being excluded over the following two days. This exclusion prevented the crawling of various sort sequences of the "Last Changes" list. Again there was a small but positive effect on the total number of pages indexed, as noted by the up-tick in the red line.

A further change to the exclusion file was made on 7 November. to prevent the indexing of various sort sequences of any page. The impact of this most recent change has yet to be observed. There is still more to do in giving the search engines the cleanest possible view of the content of our wiki pages over the coming weeks.

TRP & Bing

And then there's Bing which is still proving to be very stubborn in crawling our pages. However, given Bing's current focus to social media content, its perhaps not surprising TRP doesn't feature strongly in their crawling priorities.
Contributors to this page: @MartinB , MartinB. and system .
Page last modified on Wednesday 01 of May, 2013 04:03:35 CDT by @MartinB.