| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45
Back To Interview list
In this issue, Michael Smith, in the cunning disguise of Judith Dinowitz, jumped on Michael Dinowitz and wrestled from him an interview the like of which has never been seen.
Michael Smith: I am here talking with Michael Dinowitz of House of Fusion fame about the presentation he will be giving at the upcoming CFUN on "Working with remote data". What kind of data are we talking about here, Michael?
Michael Dinowitz: The short answer is: Text data.
MS: You mean that's it?
MD: Well, honestly, when dealing with the Internet, the only data we really care about is text data. Email is text data. Web pages are HTML-formatted text data. RSS feeds are XML packets, which are just text data. Everything is text, and that is the data we're dealing with. So I'm waiting for you to ask the real question.
MS: So what is the real question?
MD: The real question is: Where will we be retrieving the data from and how will we be making use of it? Let's take my latest "fun" site, JewsLikeNews.com. This site gets updated every half hour with news, if the news exists. The question is: Where does this data come from. The answer is quite simple. Google emails it to me. This means that some location (Google) is sending me text in an email of a specific format that I can programmatically retrieve and enter into a database.
MS: So the format really makes a difference ...
MD: Exactly. Format is everything when dealing with remote data. Format allows you to separate data from markup from garbage. Take the google example that I just gave. Google sends different types of emails with different formats -- some with more useful data than others. My job is to decide what data is important so that I can write up a parsing program to always retrieve that data. I should never have to touch the JewsLikeNews site. It should all run automatically. The only time I'll ever have to rewrite my code is if Google changes their format.
MS: What if you don't know the format? And does the format change often?
MD: The ability to know the format is rather simple: You look at it and try to decide where the real data is. Is it the first line? Is it the first line after a blank line? Is it right after the letter P? All you have to do is get enough examples of the data (five emails from Google is enough for me) to see what the format is and where it changes. Once you identify what goes where, you can then write a program to deal with it. It's just pattern analysis. All you have to do is intuit what their pattern is.
As for how often the data changes, that all depends on who you're dealing with. I have a cute little agent that actually scans Macromedia's forums and emails me all the new posts. It would be simple to do if Fusetalk had an RSS feed or something else. But as things stand, I have to parse through multiple pages to get the data that I need. The data being outputted is in HTML here, with CSS there, and they've changed the format at least twice in the time that I've been running the agent. (And no, I won't actually be showing off this agent at the conference. What I will be showing off, on the other hand, is an Ebay agent that I've been using for a while that will retrieve all the items under a specific keyword. Now that's fun, because Ebay changes their format at least once a month, and the format is different if you go in with or without cookies.)
MS: So agents are one way of dealing with remote data. What other types of ways are there?
MD: When you say agents are a way of dealing with remote data, it's really important to define what an agent is, and what "dealing" means. An agent is a program whose job is to go out, get data from somewhere, parse through it for something that's important, and then do something with the parsed data. An agent can be as simple as emailing any time a news article is found with the keyword "Macromedia" to an entire content management system which will retrieve data, parse through it, check if the data already exists in a database, store it if it doesn't, and do some other operation if it does. An agent is a general term.
What you really think of is how we deal with data, as in, how do we parse it.
MS: So how do we parse the data?
MD: In reality, there's only one "true" way to parse data where you are not sure of the exact format that the data will be in, and that is Regular Expressions.
Regular Expressions is basically a sublanguage for defining text patterns. Using it, we can easily say what the patterns of data are, what comes before the important data, what comes after it, and what differentiates the important data from the formatting. That might sound a little confusing, so let's give a perfect example. Let's say we want some text that is within an anchor tag. We don't know what the attributes of that anchor is. We don't know that it has a target. We don't know the HREF. All we know is that we want the text that is going to be displayed as the link. Using Regular expressions, we can say, "Look for the beginning of an anchor. Include anything that is within the tag itself. Look for the ending tag that is right after the anchor. And then, using these two defined boundaries, get all the text that is being used as the link in that anchor."
MS: Sounds complicated.
MD: Actually, it's not. It's very simple once you know the basics of Regular Expressions. The problem is that Regular Expressions look like someone threw up across your screen. (Yes, Perl uses Regular Expressions all the time.)
I've given classes on Regular Expressions before, and people always come out of them nodding their head with a look of understanding in their eye. Anyone can learn them. It just takes a few simple rules, and then the ability to string those rules together. Take the anchor example from above. All you have to do is define the beginning of an anchor tag, define the end of the tag (which is a closed bracket) and grab everything between the beginning and the end of the anchor. You don't care what's there -- You're just defining the pattern that says an anchor tag starts with a <a and ends with an >. Following that will be the link text, which will go on until you get a closing anchor, which will always be </a>. It's all about patterns and definitions.
MS: Now you've made a complex subject sound very interesting.
MD: It is. It's actually more interesting than you think, because by using some of the agents which I'll be talking about, you can save time that would otherwise be spent searching through Ebay or searching through news ... You can actually write something that will do the work for you and deliver the exact content that you want. The time taken to write an agent vs. the time saved when searching for news that you find interesting makes this presentation definitely worth your while.
MS: Well, I'm someone that never has time to do anything, so I'll definitely try to find some time to stop in on some presentation. Will you be giving out code to save us time when building our own agents?
MD: Yep. All the code from my presentation is on the conference CD. Just put it in a computer, load it up and read through it. It's all commented. And I can tell you a story about comments ... but that's for another time.
MS: Well, thank you very much, Michael. I look forward to seeing you at CFUN.
MD: Thank you for your time.