MSWC.IISLog or the TextStream Object to Parse Logfiles

ML Jan 28, 2006

OK, the first step in the first spider spotter project is choosing which technology to use to open and manipulate log files. There are basically two choices: the TextStream object, and the MSWC.IISLog object. Both would be perfectly capable, but they bring up different issues. The power of manipulating the log files as raw text comes in using regular expression matching (RegEx). But doing RegEx manipulation directly within Active Server Pages requires dumping the contents of the log file into memory and running RegEx on the object in memory. And log files can grow to be VERY large. One way to control how much goes into memory is to encase the ReadLine method of the TextStream object in logic to essentially create a first-pass filter. So, if you were looking for GoogleBot, you could pull in only the lines of the logfile that  mention GoogleBot. Then, you could use RegEx to further filter the results.

The other approach is to use MSWC.IISLog. I learned about this from the O’Reilly ASP book. It essentially parses the ASP file into fields. And I’m sure it takes care a lot of the memory issues that come up if you try using the TextStream object. One problem is that it’s really an Windows 2000 Server technology, and I don’t even know if it’s in Server 2003. It uses a dll called logscrpt.dll. So, first to see if it’s still even included, I’m going to go search for that on a 2003 server. OK, found in the inetsrv directory. So, it’s still a choice. The next thing is to really think about the objectives of this app. It’s going to have a clever aspect to it, so the more you use it, the less demanding it is on memory. And I’ll probably create a dual ASP/Windows Scripting Host (WSH) existence for this program. One will be real-time on page-loads. And the other will be for scheduled daily processing.

Even though it’s really not worth pulling in the entire logfile into a SQL database, it probably is worth pulling in the entire spider history. Even a popular site only gets a few thousand hits per day from GoogleBot, and from a SQL table perspective, that’s nothing. So, why write an app that loads the log files directly? It’s the enormous real-time nature of the thing, and the fact you’ll usually be looking at the same day’s logfiles for up-to-the-second information. So, the first criteria for the project is to work as if it were just wired to the daily log files. But lurking in the background will be a task that after the day’s log file has cycled, it will spin through, moving information like GoogleBot visits into a SQL table. It will use the time and IP (or UserAgent) as the primary key, so it will never record the same event twice. You could even run it over and over without doing any damage, except maybe littering your SQL logs with primary key violation error messages.

MSWC.IISLog has another advantage. Because it automatically parses the log file into fields, I will be able to hide the IP addresses on the public-facing version of this app if I deem it necessary. Generally, it will only be showing GoogleBot and Yahoo Slurp visits, but you never know. I’d like the quick ability to turn off the display of the IP field, so I don’t violate anyone’s privacy by accidentially giving out their IP addresses. OK, it sounds like I’ve made my decision. I don’t really need the power of RegEx for spotting spiders. IIISLog has a ReadFilter method, but it only takes a start and end time. It doesn’t let you filter based on field contents. OK, I can do that manually—even with RegEx at this point. If it matches a pattern on a line-by-line basis, then show it. Something else may be quicker, though.

OK, it’s decided. This first spider spotter app will use MSWC.IISLog. I’m also going to do this entire project tonight (yes, I’m starting at 11:00PM). But it doesn’t have nearly the issues of the marker-upper project. And it is a perfect time to use the baby-step markup system. I do see one issue.

There are two nested sub-projects lurking that are going to tempt me. The first is a way to make the baby-step markup able to get the previous babystep code post no matter how far back it occurred in the discussion. That’s probably a recursive little bit of code. I think I’m going to get that out of the way right away. It won’t be too difficult, and will make the tutorial-making process even more natural. I don’t want to force babystep code into every post. If I want to stop and think about something, post it, and move on, I want to feel free to do that.

The other nested project is actually putting the tutorial out on the site. I’ve got an internal blogging system where I actually make the tutorials. But deciding which once to put out, how, and onto what sites is something that happens in the content management system. Yes, the CMS can assemble Web content for sites pulling it out of blogging systems. In sort, the CMS can take XML feeds from any source, map them into the CMS’s own data structure, apply the site’s style sheet, and move the content out to the website. But the steps to do this are a little convoluted, and I have the itch to simplify it. But I’ll avoid this nested sub-project. It’s full of others.

Leave a Reply

Your email address will not be published. Required fields are marked *