Storage and Retrieval in the
Physical World
We can own a book or a hammer without giving it a name or a permanent place of residence in our houses. A book can be identified by characteristics other than a name — a color or a shape, for example. However, after we accumulate a large number of items that we need to find and use, it helps to be a bit more organized.
Everything in its place: Storage
and retrieval by location
It is important that there be a proper place for our books and hammers, because that is how we find them when we need them. We can’t just whistle and expect them to find us; we must know where they are and then go there and fetch them. In the
21_084113 ch15.qxp 4/3/07 6:07 PM Page 325
Chapter 15: Searching and Finding: Improving Data Retrieval
325
physical world, the actual location of a thing is the means to finding it. Remembering where we put something — its address — is vital both to finding it and to putting it away so it can be found again. When we want to find a spoon, for example, we go to the place where we keep our spoons. We don’t find the spoon by referring to any inherent characteristic of the spoon itself. Similarly, when we look for a book, we either go to where we left the book, or we guess that it is stored with other books. We don’t find the book by association. That is, we don’t find the book by referring to its contents.
In this model, the storage system is the same as the retrieval system: Both are based on remembering locations. They are coupled storage and retrieval systems.
Indexed retrieval
This system of everything in its place sounds pretty good, but it has a flaw: It’s limited in scale by human memory. Although it works for the books, hammers, and spoons in your house, it doesn’t work for all the volumes stored in the Library of Congress, for example.
In the world of books and paper on library shelves, we make use of a classification system to help us find things. Using the Dewey Decimal System (or its international offshoot, the Universal Decimal Classification system), every book is assigned a unique “call number” based upon its subject. Books are then arranged numerically (and then alphabetically by author’s last name), resulting in a library organized by subject.
The only remaining issue is how to discover the number for a given book. Certainly nobody could be expected to remember every number. The solution is an index, or a collection of records that allows you to find the location of an item by looking up an attribute of the item, such as its name.
Traditional library card catalogs provide lookup by three attributes: author, subject, and title. When the book is entered into the library system and assigned a number, three index cards are created for the book, including all particulars and the Dewey Decimal number. Each card is headed by the author’s name, the subject, or the title.
These cards are then placed in their respective indices in alphabetical order.
When you want to find a book, you look it up in one of the indices and find its number. You then find the row of shelves that contains books with numbers in the same range as your target by examining signs. You search those particular shelves, narrowing your view by the lexical order of the numbers until you find the one you want.
21_084113 ch15.qxp 4/3/07 6:07 PM Page 326
326
Part III: Designing Interaction Details
You physically retrieve the book by participating in the system of storage, but you logically find the book you want by participating in a system of retrieval. The shelves and numbers are the storage system. The card indices are the retrieval system. You identify the desired book with one and fetch it with the other. In a typical university or professional library, customers are not allowed into the stacks. As a customer, you identify the book you want by using only the retrieval system. The librarian then fetches the book for you by participating only in the storage system.
The unique serial number is the bridge between these two interdependent systems.
In the physical world, both the retrieval system and the storage system may be very labor-intensive. Particularly in older, noncomputerized libraries, they are both inflexible. Adding a fourth index based on acquisition date, for example, would be prohibitively difficult for the library.
Storage and Retrieval in the
Digital World
Unlike in the physical world of books, stacks, and cards, it’s not very hard to add an index in the computer. Ironically, in a system where easily implementing dynamic, associative retrieval mechanisms is at last possible, we often don’t implement any retrieval system other than the storage system. If you want to find a file on disk, you need to know its name and its place. It’s as if we went into the library, burned the card catalog, and told the patrons that they could easily find what they want by just remembering the little numbers painted on the spines of the books. We have put 100% of the burden of file retrieval on the user’s memory while the CPU just sits there idling, executing billions of NOP instructions.
Although our desktop computers can handle hundreds of different indices, we ignore this capability and frequently have no indices at all pointing into the files stored on our disks. Instead, we have to remember where we put our files and what we called them in order to find them again. This omission is one of the most destructive, backward steps in modern software design. This failure can be attributed to the interdependence of files and the organizational systems in which they exist, an interdependence that doesn’t exist in the mechanical world.
There is nothing wrong with the disk file storage systems that we have created for ourselves. The only problem is that we have failed to create adequate disk file retrieval systems. Instead, we hand the user the storage system and call it a retrieval system. This is like handing him a bag of groceries and calling it a gourmet dinner. There is no reason to change our file storage systems. The Unix model is fine. Our applications can easily remember the names and locations of the files they have worked on, so they aren’t the ones who need a retrieval system: It’s for us human users.
21_084113 ch15.qxp 4/3/07 6:07 PM Page 327
Chapter 15: Searching and Finding: Improving Data Retrieval
327
Digital retrieval methods
There are three fundamental ways to find a document on a digital system. You can find it by remembering where you left it in the file structure, by positional retrieval. You can also find it by remembering its identifying name, by identity retrieval (and it should be noted that these two methods are typically used in conjunction with each other). The third method, associative or attribute-based retrieval, is based on the ability to search for a document based on some inherent quality of the document itself. For example, if you want to find a book with a red cover, or one that discusses light rail transit systems, or one that contains photographs of steam locomotives, or one that mentions Theodore Judah, you must use an associative method.
The combination of position and identity provide the basis for most digital storage systems. However, most digital systems do not provide an associative method for storage. By ignoring associative methods, we deny ourselves any attribute-based searching and we must depend on human memory to recall the position and identity of our documents. Users must know the title of the document they want and where it is stored in order to find it. For example, to find a spreadsheet in which you calculated the amortization of your home loan, you need to remember that you stored it in the directory called “Home” and that the file was named “amort1.” If you can’t remember either of these facts, finding the document will be quite difficult.
Attribute-based retrieval systems
For early GUI systems like the original Macintosh, a positional retrieval system almost made sense: The desktop metaphor dictated it (you don’t use an index to look up papers on your desk), and there were precious few documents that could be stored on a 144K floppy disk. However, our current desktop systems can easily hold 500,000 times as many docu
ments (and that’s not to mention what even a meager local network can provide access to)! Yet, we still use the same old metaphors and retrieval models to manage our data. We continue to render our software’s retrieval systems in strict adherence to the implementation model of the storage system, ignoring the power and ease-of-use of a system for finding files that is distinct from the system for keeping files.
An attribute-based retrieval system enables users to find documents by their contents and meaningful properties (such as when they were last edited). The purpose of such a system is to provide a mechanism for users to express what they’re looking for according to the way they think about it. For example, a saleswoman looking for a proposal she recently sent to a client named “Widgetco” could effectively express herself by saying “Show me the Word documents related to ‘Widgetco’ that I modified yesterday and also printed.”
21_084113 ch15.qxp 4/3/07 6:07 PM Page 328
328
Part III: Designing Interaction Details
A well-crafted attribute-based retrieval system also enables users to find what they’re looking for by synonyms or related topics or by assigning attributes or “tags” to individual documents. A user can then dynamically define sets of documents having these overlapping attributes. Returning to our saleswoman example, each potential client is sent a proposal letter. Each of these letters is different and is naturally grouped with the files pertinent to that client. However, there is a definite relationship between each of these letters because they all serve the same function: proposing a business relationship. It would be very convenient if the saleswoman could find and gather up all such proposal letters, while allowing each one to retain its uniqueness and association with its particular client. A file system based on place — on its single storage location — must necessarily store each document by a single attribute (client or document type) rather than by multiple characteristics.
A retrieval system can learn a lot about each document just by keeping its eyes and ears open. If it remembers some of this information, much of the burden on users is made unnecessary. For example, it can easily remember such things as:
The application that created the document
Contents and format of the document
The application that last opened the document
The size of the document, and if the document is exceptionally large or small
If the document has been untouched for a long time
The length of time the document was last open
The amount of information that was added or deleted during the last edit
If the document was created from scratch or cloned from another
If the document is frequently edited
If the document is frequently viewed but rarely edited
Whether the document has been printed and where
How often the document has been printed, and whether changes were made to it each time immediately before printing
Whether the document has been faxed and to whom
Whether the document has been e-mailed and to whom
Spotlight, the search function in Apple’s OS X, provides effective attribute-based retrieval (see Figure 15-1). Not only can a user look for documents according to meaningful properties, but they can save these searches as “Smart Folders,”
which enables them to see documents related to a given client in one place, and all proposals in a different place (though a user would have to put some effort into
21_084113 ch15.qxp 4/3/07 6:07 PM Page 329
Chapter 15: Searching and Finding: Improving Data Retrieval
329
identifying each proposal as such — Spotlight can’t recognize this). It should be noted that one of the most important factors contributing to the usefulness of Spotlight is the speed at which results are returned. This is a significant differentiating factor between it and the Windows search functionality, and was achieved through purposeful technical design that indexes content during idle time.
Figure 15-1 Spotlight, the search capability in Apple’s OS X, allows users to find a document based upon meaningful attributes such as the name, type of document, and when it was last opened.
While an attribute-based retrieval system can find documents for users without users ever having to explicitly organize documents in advance, there is also considerable value in allowing users to tag or manually specify attributes about documents. Not only does this allow users to fill in the gaps where technology can’t identify all the meaningful attributes, but it allows people to define de facto organizational schemes based upon how they discuss and use information. The retrieval mechanism achieved by such tagging is often referred to as a “folksonomy,” a term credited to information architect Thomas Vander Wal. Folksonomies can be especially useful in social and collaborative situations, where they can provide an alternative to a globally defined taxonomy if it isn’t desirable or practical to force everyone to adhere to and think in terms of a controlled vocabulary. Good examples of the use of tagging to facilitate information retrieval include Flickr, del.icio.us, and LibraryThing (see Figure 15-2), where people are able to browse and find documents (photos and links, respectively) based upon user-defined attributes.
21_084113 ch15.qxp 4/3/07 6:07 PM Page 330
330
Part III: Designing Interaction Details
Figure 15-2 LibraryThing is a Web application that allows users to catalog their own book collections online with a tag-based system. The universe of tags applied to all the books in all the collections has become a democratic organizational scheme based upon the way the user community describes things.
Relational Databases versus
Digital Soup
Software that uses database technology typically makes two simple demands of its users: First, users must define the form of the data in advance; second, users must then conform to that definition. There are also two facts about human users of software: First, they rarely can express what they are going to want in advance, and second, even if they could express their specific needs, more often than not they change their minds.
Organizing the unorganizable
Living in the Internet age, we find ourselves more and more frequently confronting information systems that fail the relational database litmus: We can neither define
21_084113 ch15.qxp 4/3/07 6:07 PM Page 331
Chapter 15: Searching and Finding: Improving Data Retrieval
331
information in advance, nor can we reliably stick to any definition we might conjure up. In particular, the two most common components of the Internet exemplify this dilemma.
The first is electronic mail. Whereas a record in a database has a specific identity, and thus belongs in a table of objects of the same type, an e-mail message doesn’t fit this paradigm very well. We can divide our e-mail into incoming and outgoing, but that doesn’t help us much. For example, if you receive a piece of e-mail from Jerry about Sally, regarding the Ajax Project and how it relates to Jones Consulting and your joint presentation at the board meeting, you can file this away in the “Jerry” folder, or the “Sally” folder, or the “Ajax” folder, but what you really want is to file it in all of them. In six months, you might try to find this message for any number of unpredictable reasons, and you’ll want to be able to find it, regardless of your reason.
Second, consider the Web. Like an infinite, chaotic, redundant, unsupervised hard drive, the Web defies structure. Enormous quantities of information are available on the Internet, but its sheer quantity and heterogeneity almost guarantee that no regular system could ever be imposed on it. Even if the Web could be organized, the method would likely have to exist on the outside, because its contents are owned by millions of individuals, none of whom are subject to any authority. Unlike records in a database, we cannot expect to find a predictable identifying mark in a record on the Internet.
Problems with databases
There’s a further problem with databases
: All database records are of a single, pre-defined type, and all instances of a record type are grouped together. A record may represent an invoice or a customer, but it never represents an invoice and a customer. Similarly, a field within a record may be a name or a social security number, but it is never a name and a social security number. This is the fundamental concept underlying all databases — it serves the vital purpose of allowing us to impose order on our storage system. Unfortunately, it fails miserably to address the realities of retrieval for our e-mail problem: It is not enough that the e-mail from Jerry is a record of type “e-mail.” Somehow, we must also identify it as a record of type
“Jerry,” type “Sally,” type “Ajax,” type “Jones Consulting,” and type “Board Meeting.”
We must also be able to add and change its identity at will, even after the record has been stored away. What’s more, a record of type “Ajax” may refer to documents other than e-mail messages — a project plan, for example. Because the record format is unpredictable, the value that identifies the record as pertaining to Ajax cannot be stored reliably within the record itself. This is in direct contradiction to the way databases work.
21_084113 ch15.qxp 4/3/07 6:07 PM Page 332
332
Part III: Designing Interaction Details
Alan Cooper, Robert Reinmann, David Cronin - About Face 3- The Essentials of Interaction Design (pdf) Page 45