Evaluating Spider-spotter Projects

ML Jan 28, 2006

The baby-step documentation system is working, and now it’s time to build the 2 spider-spotting projects up from scratch. Now that this site has a little bit of content on it, and posts have been made with the blogger system, and people have surfed to it who may have toolbars that report back the existence of pages, and because I have a couple of outbound links that will begin to show up in log files—because of all of these reasons, the first spider visits will start to occur. And that’s what we’re interested in now. But are we tracking search hits yet? No, that comes later.

So, how do we monitor spider visits? There are 2 projects here. First, is specifically monitoring requests for the robots.txt file. All well-behaved spiders will request this file first to understand what areas of the site are supposed to be off limits. A lot of concentrated information shows up here, particularly concerning the variety of spiders hitting the site. You can’t always tell a spider when you see one in your log files, because there are so many user agents. But when one requests robots.txt, you know you have some sort of crawler on your hands. This gives you a nice broad overview of what’s out there, instead of just myopically focusing on GoogleBot and Yahoo Slurp.

The second project we will engage in will be a simple way to view log files on a day-by-day basis. Log files are constantly being written to the hard drives. And until the site starts to become massively popular, the log files are relatively easy to load and look at. ASP even has dedicated objects for parsing and browsing the log file. I’m not sure if I’m going to use that, because I think I might just like to load it as a text file and do regular expression matches to pull out the information I want to see. In fact, it could be tied back to the first project. I also think the idea of time-surfing is important. Most of the time, I will want to pull up “today’s” data. But often, I will want to surf back in time. Or I might like to pull up the entire history of GoogleBot visits.

It’s worth noting, that you can make your log files go directly into a database, in my case, SQL Server. But you don’t always want to do that. I don’t want to program a chatty app. Decisions regarding chattiness is a concept that will be coming up over and over in the apps I make for HitTail. And exactly what is chatty and what isn’t is one of those issues. Making a call to a database for every single page load IS a chatty app. So, I will stick with text-based log files. They have the additional advantage that when you do archive them, text files compress really well. Also, when you set the webserver to start a new log file daily, it makes a nice system for writing a date-surfing system. For each change of day, you simply connect to a different log file.

It will always be an issue whether thought-work like this ends up going into the blog or into the baby-step tutorials themselves. I think it will be based on the length and quality of the thought-work. If it shows the overall direction the HitTail site is going, then it will go into the blog. So, this one makes it there. Time to post, and start the tutorial. Which one comes first? Am I going to slow myself down with copious screenshots? It actually can be quite important for an effective tutorial. But it can make the project go at almost half the speed. So, I’ll probably be skipping screen shots for now.

So, the robots.txt project or the log file reading project? There is definitely data in the log files, if even it’s just my own page-loads. But there’s not necessarily any data if we grab right for the robots.txt requests. That would make that app difficult even to test with no data. Except, I could simulate requests for robots.txt, so that really shouldn’t stop me. So, I’m going to go for the easiest possible way to load and view the text files.

Leave a Reply

Your email address will not be published. Required fields are marked *