A few days ago I had a discussion with our managing editor for our company's Web site about how crawlers discover and index pages. He was convinced that search engines can somehow find hidden pages on a Web site even if there are no links to those pages. I, on the other hand, wouldn't be persuaded. How could search engines crawl a page if they don't know the page's name and location, i.e. its path? Turns out we were both wrong – and right, depending on how you look at it.
In order for search engines to crawl a Web page, they must first be directed to it. The process of page discovery is generally a hyperlink on another page that the crawler can follow. I'm not sure if search engines also follow plain text URLs, but it is a possibility. A site that wants to publicize a new page would normally have links to the new page from other pages, or the page will be in a directory index which lists all files in a directory when accessed (Web sites normally disable this option though for security reasons). In the absence of a link to a Web page's URL, crawlers would have no idea about the existence of that page (referred to as a hidden or orphaned page). I suppose they could engage in name-guessing, but that's an expensive proposition I suspect most search engines shun.
Then a few days ago I ran into an anomaly that disproved my belief about hidden pages and crawler discovery. I was working on a fairly popular page (Browser Simulator/Emulator) on my personal site. Due to the nature of the page, it has the potential of becoming a tool in the hands of abusers, so it is monitored for abusive activity patterns. I began to notice that the page was being accessed excessively by Googlebot with specific parameters as if a human was commandeering the page. Respecting the privacy of users however, I only monitor general patterns on that page, so I didn't have detailed information about Googlebot's activity.
With my curiosity piqued, I constructed a similar but hidden page in the same folder and switched on full monitoring. Then I began hitting the page, entering various data in the form fields. Sure enough, Googlebot began accessing that page with the same data as I had specified. How could Googlebot discover the hidden page so fast (if at all) and specify the same data as I was? A glance near the top of my Internet Explorer browser found the culprit. It was the Google toolbar, the seemingly innocuous toolbar that most people have installed on their browsers and are oblivious to its operation.
I am certain the Google toolbar comes with a privacy disclosure detailing how and what it gleans from the user's activity. I never bothered to read this and chances are most people ignore it as well. I am also not sure what Google does with the data. I suppose they do use it for ranking purposes, but I am now certain that it crawls the pages surfed on by users. I am, however, still unsure whether the crawled pages ever make it to the Google's index to be displayed as search results. I am also unsure if what the browser displays to the users is sent to Google along with the URLs (this could have potentially disastrous privacy repercussions).
There you have it. If you place hidden pages on your Web folders, don't be too confident about their secrecy, even if those pages are only accessed internally by you and a few trusted people. Anyone with a Google toolbar (or any other toolbar such as Alexa or A9) would be unwittingly sending the URLs of those hidden pages to Googlebot (or other robots/spiders), and potentially exposing the location of those pages to the world.