_____________________________________________________________________ ��>>>�����>��>>>�����>��>>>�����>�ؕ<�����<<<-�<�����<<<-�<�����<<<-� ��������������������������������������������������������������������� ��������������������������������������������������������������������� �HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH� �H�>>>�����>��>>>�����>�->>>����>�ؕ<����<<<-�<�����<<<-�<�����<<<-H� �H->>>����>� _ _ _ �<����<<<-H� �H->>>����>� /�\_______ _____/| _____/| �<����<<<-H� �H->>>����>� /��S�������\ /�����/ /�����/ �<����<<<-H� �H->>>����>� |�/|T|�����\T\ |�/|G|� |�/|T|� �<����<<<-H� �H->>>����>� � |E| |H| � |I| � |H| �<����<<<-H� �H ______ |A| /E/ |V| __ |E| H� �H _\_\_\_\_____|L|____/�/____________|E|/_�\ __|�|__________|\ H� �H >))))))))))))|�������<)))))))))))))|�|/)\�\)/|�|))))))))))))> H� �H �/�/�/�/�����|F|����\�\������������|�|���\��/|P|����������|/ H� �H ������ |R| \R\ |T| �� |O| H� �H->>>����>� |O| _ \I\ _ |O| _ |O| _ �<����<<<-H� �H->>>����>� _|M|_/| \C\ /| _|�|_/| _|R|_/| �<����<<<-H� �H->>>����>� /�����/ \H\/�| /�����/ /�����/ �<����<<<-H� �H->>>����>� |�/���� \��/ |�/���� |�/���� �<����<<<-H� �H->>>����>� �� Robin �� � �� Hood �� � �<����<<<-H� �H�>>>�����>��>>>�����>�->>>����>�ؕ<����<<<-�<�����<<<-�<�����<<<-H� �HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH� ��������������������������������������������������������������������� _____________________________________________________________________ ��>>>�����>��>>>�����>��>>>�����>�ؕ<�����<<<-�<�����<<<-�<�����<<<-� ���������������������������������������������������������������������
The terms involved with searching have gotten a little confusing as their definitions become obfuscated through the years. When most people think of "Search Engines" the add laden web pages of commercial engines usually come to mind but your browser isn't the only way to search. There are many different types of search engines on the net and not all of them can be accessed through the web but even these web based engines can be harnessed through other means which will be discussed in later Hunting Lessons. In this lesson we'll just concentrate our efforts on understanding various aspects of search engines in general.
Though the terms are not as clear as they once were, they are still important to understanding how everything works. A true "Search Engine" will use a program or agent ("web spider") to virtually scan everything it comes accross and builds an index of information from it's findings by crawling link, after link, after link...(I think you get the idea). Quite often people mistake a "Directory" for a "Search Engine." A true "Directory" builds it's index of information from user submissions. The "Meta Engines" simply query multiple search engines and directoies and then sum up the results. Although all the other types of engines use databases to hold their information, a true "Database Engine" is built and maintained by some person, group or company to hold specific information.
The reason why the terms are getting a bit blured is currently many of the "Search Engines" and "Database Engines" are encouraging user submissions and many "Directories" are also being fed by their own spiders and/or other search engines. Even some of the "Meta Engines" taking on facets of the other three types and are building their own indexes. Whether you're looking for the most current version of SoftICE, want to know who made a program or just curious about some topic, a decent search should consist of both a few of "Search Engines" and a "Directory" or two. Anything less, would really be leaving yourself in the dark, especially if the hunts you go on are somewhat obscure. The entire hunt can be easily done add free from YOUR OWN custom search page residing on your hard drive.
The usefulness of a search engine has nothing to to with it's popularity, positioning or status. Most of this supposed fame is due to business alliances (i.e. money) rather than any technical prowess or functional utility. The idiot button on your Navigator (the one marked "Search") is of course linked to a add spaming page listing more add-spamming commercial search engines. Companies had to pay Netscape five millon dollars a piece for a top spot on that page. If by now you haven't redirected the idiot button to somthing useful like YOUR OWN search page, then you need to place your head in a vise and tighten it until you have two brain cells close enough together for a synapse.
Well, I think I've made myself clear. With all the money trading hands for adds, positioning and alliances, it's a good idea to know your history on the major commercial engines. The "who owns who" of the situation actually does affect the results of your searches to some degree.
As mentioned above search engines use a "spider" or "crawler" agent to gather information. You can tell if your site has been visited by one of these robots by checking your web site logs and looking for the various names which are often part of the crawler's host name. Here are the names of some of the spiders.
The main trouble with search engines is that most people really don't know how they work or how to use them effectivly and the result is a high noise to signal ratio. There is really no sense in rewriting all the advanced search help files for all the major search engines. You know how to find them off the advanced search pages and you can read them for yourself. If you don't bother reading a few of them, it's you're loss and the rest of the hunting lessons will be a lot more difficult on you.
There are a few topics not covered by the said help files. The primary one is various "types" of searches and this is mainly due to one engine needing to mention the usefulness of another engine -a big no-no in a competitive business. A good example of misinformation is the "Subject" search. A "Subject" search of a Directory is quite a bit different than a "Subject" search of a Usenet Search Engine. The former gives results from an index of pages and topics, while the later gives results based on the subject line of usenet posts.
There are a number of different types of searches that can be done through the major engines, though not all engines support all types. The typical type is the "Title" search and it's fairly useless these days, namely because few people authoring web pages are smart enough to put a descriptive <Title> tag in their documents. Supposed "Natural Language" searching is another type available through many engines but I've yet to see any public engine with good "Natural Language Processing" of either search phrase or documents. At the moment NLP searching is a generally waste of time but the technology should improve in the years to come.
Ahhh, HREF Searching! This type of search can be quite a powerful little device when used correctly but not all engines support it. Most search engines don't refer to it as "HREF Searching" and usually don't make it very obvious but the phrase comes from the standard HTML tag used in pages to create all links. There are three main facets of HREF searching; Anchor, URL and Link.
Example Link: Please DamageMicrosoft
The first HREF Search type, Anchor Search, is a search of the visible characters of the links. The example link above could be found through a Anchor Search on "please" "damage" "microsoft" "magemicro" or any combination of it's letters. You've already noticed how the lack of spaces doesn't effect our search but this may not be the case with all search engines. You can also force the Anchor Search to look for a complete phrase (with spaces) by enclosing it in quotes. The fact that the link actually sends you to Sun Microsystems doesn't effect anything in regards to our Anchor Search of it's characters.
The second HREF Search type, URL Search, is extremely powerful. It's a search of the invisible part of links (i.e host, path, page or file name). The example like above would turn up on an URL Search for "www.sun.com" along with a ton of other hits. The URL Search is more than blessing when you're looking for specific files because you can query by the actual file name and unlike the specialized FTP/File Search Engines, the Web Search Engines can locate files that may only exit on a web server somewhere.
The third HREF Search type, Link Search, comes in handy for ferreting out interesting information, such as "How many other pages are currently linked to mine?" or "what pages are linking to a particular host, path or page?" Though this can be of limited value because of it's specialized nature, it can still be the right tool for the job when the job is a special one. As for how it works, if a Link Search was done on "www.sun.com" somewhere in the massive list of hits would be this page because of the Example Link above.
I'll walk you through an example hunt using the HREF Search. I did this on AltaVista through it's Advanced Search page. Our prey will of course be fravia himself. The page of his we will find no longer exists, so you don't need to worry about anything.
We start with an Anchor Search on the nym "fravia" to find all the pages with links that have his name in the visible text of the links.
Search: anchor:fraviaI picked one of the results, Web Cache, at random and followed the link to it. Sure enough I found a link with the following text, "Fravia's searchlores.org." By the way, notice how the apostrophes didn't effect our search results. The link on the Web Cache page sends you to the interesting URL below.
http://www.geocities.com/Athens/5513/
I visited the URL given by the Web Cache page to find a boring page about some teacher in Minnesota named "Eric J. Ose" with a job, a wife and 2.4 kids. He supposedly used a MAC and AOLpress to make his site. Either the URL was a out right mistake or something is starting to seem a bit fishy around here. I ran another type of HREF Search, an URL Search, to see if the spiders and crawlers had even found this waste of web space and more importantly, how deep into the pages did the spider go.
Search: url:www.geocities.com/Athens/5513/Well we know the crawler found the basic.html of the site but didn't seem to go much further. Note: You can stop the robots very easily by placing a file named robots.txt in the main directory that lists what sub-directories contain no information, then hide whatever you like in those sub-directories. The single found document possibly came through the supposed link to fravia but it isn't what we're looking for... or is it? I know there is an Athens Greece and an Athens Georgia but Minnesota? Well, it's possible but combined with an obvious "teacher" fassade, the basic.htm could easily be a fake. So how do we find the right page name to add to the URL?
It's time to use our special HREF Search, the Link Search to see how many links there are to the part of the URL we have and see if anyone put a page name on their link.
Search: link:http://www.geocities.com/Athens/5513/Hmmm now this *IS* interesting. The search resulted in 60 pages with links matching the part of the URL we knew, created mainly by people who have a sTiCkY sHiFT kEy and are into cracking, hacking, phreaking, warez and other things. Obviously our teacher friend in Minnesota isn't real. I followed one of the links to a page, searched through the HTML source for the part of the URL we knew and found gold; the page name we needed.
http://www.geocities.com/Athens/5513/orc.htmThe orc.htm and other files are no longer at the site mentioned so don't bother trying. If this was a real hunt, check all of the results from the Link Search for page names and write them down. If none of the page names work, the web master probably changes the page names regularly to keep people from linking into the middle of his secret site. The next step would be to send a Mapping-Bot in through an anomyous proxy to map the entire site from the basic.htm down to it's last file. If the web master of the site is clever, there would be no links to the "real" pages of the site but on some of the free web services such links are mandatory.
If the Mapping-Bot fails and the secret pages are something I *really* want to access, I would run a Brute-Bot to sequentially go through page names like a.htm, b.htm, c.htm until it found a page the Mapping Bot missed. It's slow work, very slow, but can pay off if the hunch your Zen Approach gave you is correct. If you don't know about building, using and routing Auto-Bots, don't worry, I'll be covering it in some of my later Hunting Lessons.
Though the search types above seem fairly straight forward, they can also benifit from using conditional statements, so before we go any further, it's a good idea to understand Boolean Operators and how to use them. The term "Boolean" honors George Boole, a 19th-century British mathematician who suggested that logical thought could be expressed as algebra. Most of the major mearch engines support some form of conditional operators but the trouble is both options and syntax will vary from engine to engine. A good example of syntax variance is the difference between the AltaVista Standard Query allowing "+" and "-" operators, while the AltaVista Advanced Query does not allow them but does allow "AND" and "NOT" which do the same thing. It's a good idea to bookmark both the advanced search pages and their corresponding help files.
These are the basic operators but the exact syntax for using them on various engines will vary. There are also a number of other operators available on different engines but their value is somewhat limited.
Back to Robin Hood's lessons
fravia's searchlores.org
Copyright © 1998