Fravia's web-searching lore (¯`·.¸(¯`·.¸ rumst

~ Essays ~

				essays

(Courtesy of fravia's advanced searching lores)

(¯`·.¸ Getting the SIZE of the search engines ¸.·´¯)
by Rumsteack
slightly edited by fravia+
published at searchlores in July 2000

A very interesting approach... let's hope some readers will find the time to deepen this work.

As all readers of Fravia's searchlores should be aware of, commercial oriented souls are now to be found everywhere on the web. Some of them are even trying (and managing to) make money with so-called search engine ranking / testing / algos reversing and the like.

Since I personally don't like such people at all (and since I can't punish them in other ways - at least for now : "A searcher could develop hisself into a dangerous fellow, if he needs to" - Fravia+), I have deceided to write some essays about these interesting lores. I hope that the more seekers as possible will work on the tricks I have found, improve them, and eventually, make the commercial bastards out of business, because the *knowledge* they are trying to sell or use will be free for anyone...

Nota bene: you have to work on these findings on your own. Unfortunately, I don't have the time (nor some other things) to work all day on them. Consider this small essay a 'starting path'.

Introduction

What are the primary tools all searchers use? The search engines. Quite Logic.
How much of the Web does these search engine index? Nobody know it precisely (most estimates move between 25 and 40%). But the fact is that no search engine covers all the Web. Consequently, provided you know what it means, selecting your search engine in regards of its size could be a good idea.

In fact, the size of a search engine is important for various reasons. First, the more sites there are, the more results you are supposed to have. Then, the more sites there are, the more chances you have to find interresting results with a complex query. Finally, the more sites there are, the more advanced the interface with them (that is, the search engine query possibilities) will be (you can't index 300+ million sites with the same search options that 50+ million).

Nevertheless, do not forget some old truths. The size of a search engine has nothing to do (or so few) with the number of unique results it gives. THIS point is one of the more important for us (it will be dealt in another essay). Besides, the size of a search engine is directly linked to the number of 404 errors it gives, more sites to manage meaning necessarely more problems with the quicksand Web.

Let's go

All this must be quite interresting, but how do we find the sizes of our search engines? I mean, without relying on commercial estimations, which are most of the time completely useless for us.

The answer is with our brains. Actually, why a searchable index of sites couldn't display all the sites it have in memory (provided it doesn't have timeout limitations or the like)? That would be illogical...

With this constatation in mind, I've played a little with some big search engine and have found interresting results that I share here. Enjoy, and work on them.

Northernlight attack

As you have probably seen elsewhere on the Fravia's site, it is indeed possible to have the entire size of Northernlight. Just run this query:
http://www.northernlight.com/nlquery.fcg?cb=0&qr=search+or+not+search&orl=2%3A1
and you will see how much URLs does Northernlight have today.

I will give some explainations about that trick.
OR and NOT are boolean operators; they are used to enable a kind of logic with the terms they are relating to.
For example, the query "search OR seek" will give you all the pages of an index which have the word search OR the word seek in them. Of course, they could also have the both words.
Always for example, the query "NOT search" will give you all the pages of an index which have NOT the word search in them.

You should be close to 2 conclusions.
The first : ORing a search broads it, NOTing a search narrows it.
The second : given that the query "search" give all the pages with the word search in them and that the query "NOT search" give all the pages that have NOT the word search in them, ORing these two querries will give the entire size of a database.

This second conclusion is used with Northernlight.

As soon as a search engine supports the OR and NOT operators, try these kind of query. Nevertheless, the search engine in question must support entirely the NOT operator. That is, not the "NOT between two terms": the NOT with only one term.

Infoseek attack

Infoseek is also kindly enough to allow us to get its full size quickly. Just run this query:
http://infoseek.go.com/Titles?qt=url%3Ahttp&svx=home_searchbox&sv=IS&lk=noframes&nh=25

As you will notice, here I have used the url search.
Actually, Infoseek allows searchers to search for words in the URLs of considered documents.
For example, if you want to find all the indexed sites with "search" in their URLs, just made an url search on "search" and appreciate the results.

One thing should now make "tilt" in your mind.
Given that most of the search engine index Web sites, ALL the URLs in their database will begin with "http".
So, a query like "url:http" should give us all the sites which adress is beginning with "http"... In other words, this query will give us the full index of a search engine

This trick is used with Infoseek.
Of course, it could also be used with Northenlight; see below:
http://www.northernlight.com/nlquery.fcg?dx=1004&qr=&qt=&pu=&qu=http
&si=&la=All&qc=All&d1=&d2=&rv=1&search.x=43&searchy=10#

As soon as a search engine allows an URL search, try it to see what happens about the number of sites returned. In fact, some big search engine have limited their URL search to the part AFTER the "//" of "http://". So, the trick used here won't be of any use. But there are other possibilities, many others....

FAST attack

FAST claims to have 300+ million sites. I'm happy to say they are right :-)
Like Infoseek or Northenlight, FAST is a good search engine for size study. Try this query, and tell me whether you like it or not:
http://www.ussc.alltheweb.com/cgi-bin/advsearch?terms=3&type=any&query=&lang=any&A1=&
B1=http&C1=url.all%3A&A2=&B2=&C2=&A3=&B3=&C3=&dincl=&dexcl=&hits=10&exec=FAST+Search

FAST is a special search engine "en ce sens que" it doesn't support keywords searching. Nor boolean for that matter. You are compelled to use a form in order to search effectively. But, we don't care : new experiences are searchers' everyday bread...

One of the FAST advanced option is to put word filters for our queries. What is interesting, is that you can define the elements of a page which have to contain some desired words: title, text, links, and ..URLs.

Consequently, if we just choose that the pages we want must contain the word "http" in their URLs, the results should be all the sites of the database. And loo! This happens.

Similarly to the URL search, if you want the size of a search engine that allows you to filter its results by fields, you should play a little with the URL filter "http". Note, however, that very few search engine allow this kind of filtering.

Lycos attack

Actually, I do not know if the results of this attack really mean what I believe. Just click there:
http://search.lycos.com/adv.aspsrchpro/?aloc=sb_init_field&first=1&lpv=1&
type=advwebsites&query=&t=all&qt=&qu=http&qh=&x=26&y=5
and you will notice that the number of results is EXACTLY the same as the FAST one.

Consequently, I begin to believe that these two share a common database (I may be wrong of course, awaiting your own conclusions).

Hotbot attack

Hotbot is also a special case, because it belongs to Lycos. So, I think it shares the Lycos/FAST database as well. Nevertheless, finding evidences of that is quite *difficult*, because the options in Hotbot are specials.

Here, no URL search nor refining, and no NOT trick could be used. We have to think.
Let's peruse the help. We see that some fields searching are allowed, among other oddities : depth, within, before, after,...
After trying a little, I've come to the conclusion that the query:
http://hotbot.lycos.com/text/default.asp?MT=depth%3A4+feature%3Aacrobat
+feature%3Aapplet+feature%3Aactivex+feature%3Aaudio+feature%3Aembed+feature%3Aflash
+feature%3Aform+feature%3Aframe+feature%3Aimage+feature%3Ascript+feature%3Ashockwave
+feature%3Atable+feature%3Avideo+feature%3Avrml&search=SEARCH&SM=B&DV=0&LG=any&DC=10&DE=2
is the one which give the more results with Hotbot.

The trick is to OR all the possible search options, in order to broad as much as possible the number of results.
The only problem is that the number returned is not the size of the entire database, but only of a big part of it. So, it's just a more or less precise estimation.

By the way, you will notice that there are more sites in the Hotbot database than in the FAST/Lycos one. This shouldn't be so, because, as said, Hotbot BELONGS to Lycos.
The explanation might by the following : Hotbot was bought by Lycos, but before, it was an independant search engine, with its own index. So, I think that now, Hotbot uses the FAST/Lycos database in addition of its previous one.

Yahoo attack

I won't even bother searching to list all the URLs at Yahoo, because it's already done. Actually, for each category, Yahoo display the number of sites it has recorded. So, with some simple maths (additions that is) you can easily find the global size of the Yahoo database.

Of course, you don't have to display all the pages each time you want to count the number of sites. Just build your own "Yahoo size fetcher" bot and let him work for you. The section about bot-building could be an interesting beginning for this point. Keep in mind that the number of reported sites is always enclosed into parenthesis; this will make things easy for your script.

Altavista attack

This one is hard, due to Alta's damned "timeout" limit.

Given that Atlavista supports the NOT search, let's try this query:
http://www.altavista.com/cgi-bin/query?hl=on&q=%28search+OR+NOT+search%29&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=&d1=
It's impossible!! Altavista's database cannot be so small.

Given that Altavista supports the URL search, let's try this query:
http://www.altavista.com/cgi-bin/query?hl=on&q=url%3Ahttp&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=&d1=
Impossible!! Something must be wrong.

Actually, Altavista has a timeout. That is, it doesn't scan all its database. Consequently, we can never have all the sites it has indexed this way.

But, we can play a little with the Date filtering. Remember: filtering can fetch incredible results if used correctly.

After some tries, I've discovered these two complementary queries:
http://www.altavista.com/cgi-bin/query?hl=on&q=search+OR+NOT+search&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=01%2F01%2F70&d1=01%2F01%2F00
and
http://www.altavista.com/cgi-bin/query?hl=on&q=search+OR+NOT+search&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=01%2F01%2F00&d1=

The addition of the two results give something strange yet. Provided the engineers at Altavista are not totally incompetent with their Date system, all the estimations I have seen until now about this search engine size seem to be wrong: Altavista seems to have nearly 430+ million of sites indexed !!

Other search engine attack

I hope you will find more tips and tricks on your own. Each search engine can be "reversed" differently. It's up to you to find the "magic queries"!!

Conclusion

You now have enough material to work on this "search engine size" stuff. The principle is simple: use queries that returns the more possible results.
You could also think about writing a "search engine size survey" bot. In fact, running all these queries each time you want to compare the search engine sizes would be insane.

Should you want to send me your additions / critics / anything else, feel free to write to Rumsteack@operamail.com. (note for Fravia+ please do not "protect" this email adress of mine: I'm awaiting spam bots with some tools... :-)

Rumsteack, from France