The Concept of a Markup Language
Thumb through any general history of computing, and you'll read that mass data storage was one of the primary breakthroughs of the computer. But mass storage has its limitations, at least for humans: we deal best with information, not data. Information is, in a basic sense, structured and contextualized data.
For example, each of the names, addresses, and phone numbers of everyone in the United States is data; taken as a whole, this collection could be referred to as a data set. Unfortunately, a stored data set is not smart; neither are computers that store data sets. The data is there, but it's useless to human beings (and, to a certain extent, useless to computers, too) unless a structure and context transform it into information. Pick up a copy of any local phone directory, and you'll see everyone listed alphabetically by last name in columns that visually associate everyone's address and phone numbers. Phone companies add context to data by publishing different phone directories for different locations across the country, and by publishing different kinds of phone listings (individual people, business, government offices, etc.) in separate sections in the same phone book.
Visual organization is the oldest form of data structuring. We can see it in tables used by ancient traders on the Mediterranean Sea, whose merchant cataloging activities were, by many accounts, among the first written literate acts. Visual organization is analog, e.g., it offers a physical representation of something (like a column). The power—and in certain cases, the drawback—of computers is that they store digital data, which is a numeric representation. Computers are used, of course, to produce visual organization all the time, as when you use tables or tabs in your word processor. But those visual relationships are not necessarily meaningful to the computer, nor to other computers. When a digital representation of data is not meaningful, a computer loses many of its abilities to transform and manipulate data.
For example, let's say that a student, Sarah, stores a simple text file on her computer that contains the names, locations, and times of her courses for a given semester. It might look something like this:
Fall 2004 Biology 110 Hall of Science 10:30AM MWF French 102 Humanities Building 12:30PM MWF English 220 Humanities Building 1:30PM TR Calculus 310 Engineering Hall 3:30PM TR
Using her computer, the student can refer to this file again and again. But an interesting situation emerges here. To the student, this is information; to the computer, it is just data. But even to other people this might be just data. How is this possible? Well, Sarah obviously has employed a system for representing days of the week. By context, most people could tell that MWF is short for Monday, Wednesday, and Friday and that T is Tuesday; but it's purely by convention that R is short for Thursday. Why not use TH? The other weekdays are easily represented by a single letter, not two. By looking at just TR, it's more obvious that the class meets two days a week. TTH would not provide that kind of at-a-glance information.
What's at work in Sarah's schedule, in the representation of days of the week, and even times of the day and course numbers, is a sort of code. People use codes all the time to not only abbreviate things, but also to facilitate comparing things. For example, Sarah could have written a line of her schedule as "Biology One Ten in the Hall of Science at Ten-Thirty in the Morning on Mondays, Wednesdays, and Fridays." That might be exactly how Sarah would read her schedule out loud, but her schedule written out in that form would not make for very easy comparison of the start times of her Biology and French classes.
We're all pretty comfortable with many kinds of codes, especially if we encounter them often enough that the code's conventions are familiar.
Just like humans, computers also use codes to compare and manipulate data. If Sarah wanted her schedule to be shared easily with other computers in such a way that even the computer could identify her classes, she might mark up her schedule like this:
<schedule>
<semester>Fall 2004</semester>
<class>Biology 110 Hall of Science 10:30AM MWF</class>
<class>French 102 Humanities Building 12:30PM MWF</class>
</schedule>
Not too hard to follow, is it? Of course not! And what you're looking at is a markup language in action. All a markup language is is a set of short words for describing the structure of things; by computer convention, we enclose those words in carets: <, >.
Sarah could offer even more information about the structure of her data by marking up each class line in her schedule, perhaps like this:
<class>
<subject>Biology</subject>
<number>110</number>
<location>Hall of Science</location>
<time>10:30AM</time>
<days>MWF</days>
</class>
All Sarah would have done, in this situation, is to explicitly tell the computer that which humans implicitly know: Biology is a subject in school; Hall of Science is where it meets. In a twist, then, the very codes that one might intend for computers to read can also be read by humans.
Updated on Mon. Jan. 7 2008 at 11:23AM