As of today (24 July 2011), Google only lists 1980 pages for The Reeves Project, down from around 2600 a week or so ago. BING and Yahoo only show Home and About Us.
What is our situation in terms of having the search engines crawl our pages? If TRP is to be a success, it has to be available on the web.
Also, on Google some of the pages are not being displayed in the correct Wiki format. Those pages have a notation at the bottom of the page - "The original document is available at http://thereevesproject.org/data/tiki-index.php?page=Reeves_John_3282". What has caused this?
Google IS NOT reporting a large number of crawl errors (only 25 which is less than 1%), so it must be making conscious decisions about pages it doesn't want to index.
I know we have a number of leaf nodes generated within structures which have no unique content added. They will read simply "Table of Contents:". I don't know exactly how many of these pages there are, but I could well understand they aren't indexed as they have duplicate content. I don't have the time or inclination at present to wade through and find out, at an individual page level, what its deciding to omit. I'm trying to tackle the higher level issues first, like getting coverage by all the major search engines.
Yahoo and Microsoft work in collaboration with respect to crawling. I submitted our site map to Yahoo at about the same time as I did to Google in June. Yahoo validated the file within a few days and has ever since indicated that it is "pending submission to Microsoft". On 21 July I got bored with waiting and directly submitted the Site Map to Bing. I'll be checking on the status there later.
Excessive duplicate or near duplicate pages is one attribute that search engines are know to penalise sites for. Because of the way the TW software generates pages, its possible for one wiki page to have at least three discrete urls. I'm working with Barry to tune our Robots.txt file, which should determine which version of the urls the bots do or don't crawl.
How a page is displayed by Google isn't under our control. I've just searched for
"John Reeves" "The Reeves Project" terrellon Google and got three good hits. The page preview from Google for Reeves_John_3282 looks good to me. I'll contact you directly to better understand the issue here.
With Barry's help, this is my primary focus for TRP at present.
Martin 25 July 2011
The following text was initially added as comments, now promoted to page content
Sun 24 of July, 2011 10:20 CDT, by Carolyn in AR
When I googled The Reeves Project, the first link was a post by Sharland which is fine. The second hit was a bad link to the old, crashed site. The third hit was also a bad link to the Reeves Registry Forums.
on Mon 25 of July, 2011 05:04 CDT, by MartinB.
Since Google uses a variety of factors, including location, in determining which results to present to a user in response to a search, its going to be difficult for me to reproduce any specific search test.
WRT to the two bad links from The Reeves Registry, both the old, crashed site and the forum, it would look like the whole of the RR site may be down. Google has a good cached result for the Forum post I made, but attempting to load even http://www.reevesregistry.com/ gives an error page at present.
on Mon 25 of July, 2011 10:11 CDT, by Beverly
I looked at Reeves_John_3282 this morning and it looks fine now.
What I had seen before on that page and several others was the correct information but no masthead and no side margins. It didn't affect the content, it just didn't look right and I didn't understand what was causing the change in format.
on Mon 25 of July, 2011 05:04 CDT, by MartinB.
Sounds like you're describing a wiki page in full screen mode.
We need to stop the crawlers accessing both the fullsceeen version, where the url has "&fullscreen=y" appended, and the regular version post full screen viewing where "&fullscreen=n" is appended.
Barry & I are trying to work out what the appropriate lines in Robots.txt are to prevent these versions being crawled, without any undesired side effects.
Postscript - turns out it was a print screen view. Barry & I are also working to stop those being crawled as well.
Robots.txt was additionally updated in early October 2011 to instruct crawlers not to access the "&fullscreen=n" versions of our web pages and those pages previously indexed are now slowly being dropped from Google's indices, without adverse impact to the total number of pages indexed.
Once the bulk of the "&fullscreen=n" versions have been dropped, we'll update robots.txt again to additionally ignore the "&fullscreen=y" versions.
Today (17 October 2011) Google's Web Master tools reports 2940 indexed pages from a site map containing 3339 pages.
Bing continues to be reluctant to index our pages and whilst I'll continue to seek ways to improve this, I'm going to set the status of this issue to closed..