October 29, 2003
The eXtensible Business Reporting Language
My most recent column at Transform magazine discusses XBRL, the eXtensible Business Reporting Language. This is a follow-on to another column where I discussed how XBRL is used at the FDIC. The really big fish in the business reporting world is, of course, the Securities and Exchange Commission. The column speculates on how XBRL can be made to work at the SEC.
Posted by Bill Trippe at 7:56 AM
October 28, 2003
Long-term Archiving of Digital Content
I had never given much thought to long-term storage and archiving of content. Well, I did to the extent that I typically advocated XML and other standards-based technology--mainly so content and data would outlast any particular software system or application.
But I had never thought much about truly long-term storage—what some in the field like to refer to as the "100 year digital object"—until I was engaged by the State of Washington to help study the feasibility of creating a digital archive. Part of my analysis was to interview other organizations that had undertaken similar initiatives. The most interesting interview was with Harvard University. I have included a summary of the case study here.
The central administration of Harvard University is four years into a project, The Library Digital Initiative, to bring the handling, archiving, and preservation of digital material up to par with the handling of analog material. Harvard has a centralized library services organization that supports the university archives and over 20 independent research and specialized libraries on the Harvard campus and elsewhere in the world. This centralized library services organization, known as the Office for Information Systems, has been developing the physical infrastructure, storage, software, tools, and professional services for digital archiving.
To date, the project has focused on the core infrastructure and on "challenge grants" that have funded archiving efforts in various libraries. The infrastructure in place includes servers, storage, and a wide array of supporting hardware and software falling into three general categories:
• Hardware and software to support the collection of digital material. This ranges from hardware and software for digitizing and converting analog materials, software for cataloging the digital materials with the inclusion of metadata, hardware and software to support the data repository, and software for indexing the digital text and metadata.
• Hardware and software to support the access to digital material. This includes access tools such as portals, catalogs, and other finding aids, as well as delivery tools allowing users to download and view textual, image-based, multimedia, and cartographic data.
• Core software for functions such as authentication and authorization, name administration, and name resolution.
The projects completed and in development under the challenge grant program represent a wide variety of data and content types, reflective of the many research specialties at an organization such as Harvard. They include:
• A project to digitize and catalog biomedical images such as CAT scans and MRIs that have lasting value to researchers and educators. Images have been converted to high resolution Tag Image File Format (TIFF), with lower-resolution Joint Photographic Experts Group (JPG) images used for Web distribution. Metadata is being captured in a specialized eXtensible Markup Language (XML) vocabulary that was designed by Harvard.
• A project to digitize, catalog, and index the annual reports of the University, since they began producing formal annual reports in the 1800s. Over 100,000 pages have been scanned to high resolution TIFF (again using JPG for the Web distribution), with structural metadata being captured using the METS (Metadata Encoding and Transmission Standard) XML vocabulary. The resulting reports can be viewed in a Web-based page turning application that uses the METS metadata for navigation. Sections or entire reports can then be automatically collected into Portable Document Format (PDF) files for printing and distribution.
• A project to digitize and index older directories of United State and United Kingdom businesses. Printed directories are being re-keyed into an XML vocabulary that captures elements such as company name and address. The XML is then rendered into HTML for viewing through the use of an XSLT (Extensible Stylesheet Language Transformations) style sheet.
• Projects to digitize, catalog, and index primary field research materials in such diverse fields as botany, Slavic languages, and music. In most of these cases, image data is captured as TIFF, audio data as AIFF (Audio Interchange File Format) and metadata in an XML vocabulary such as METS.
While the projects represent a wide variety of material, there are a number of common attributes to the projects when it comes to issues of digitization, data conversion, and application of metadata. These attributes include:
• There is not a single, monolithic approach to conversion. Researchers and archivists have a number of options for content conversion methods and target formats. In some cases, document material is captured as page images only (always TIFF), sometimes as page images with full text captured through OCR for search and retrieval, and sometimes in XML as a neutral format for later reprocessing. The archivists make the decision on approach, often in consultation with technical staff from the centralized Office for Information Systems.
• The greater "intellectual" emphasis seems to be on metadata. Whatever format the data is captured in, Harvard is promulgating standards for metadata. As mentioned above, they favor the METS Document Type Definition (DTD) for structural, administrative, and technical metadata. For finding aids, they have based their tools on the EAD (Encoded Archival Description) DTD; all content contributed to the central repository and catalog is submitted with an EAD-based finding aid.
• Their approach is pragmatic. While they do have substantial resources for digital archiving, they are faced with practical decisions on every project. For example, the annual reports project uses TIFF pages supported by OCR-captured text for indexing. They use the OCR text "as is," without correction, knowing that the accuracy is somewhere around 95 percent. However, because the OCR text is used for search and retrieval of pages, the results of using it are better than the specific accuracy of the text. That is, OCR-scanned text with some errors will still successfully return 99 percent of pages because the hits are based on page boundaries. (For example, a word appearing more than once on a page is incorrectly captured once and correctly captured once. The page is still correctly returned on a full-text search.)
• They have relied heavily on commercial applications for the repository and indexing applications themselves, while the submission, discovery, and delivery applications have been largely custom Java applets and Java servlets accessed through browser interfaces. Thus, the full text indexes are managed in Oracle 9i's Text feature, and the XML metadata repository is managed in Software AG's Tamino.
It is worth noting that, while Harvard's efforts to date have been substantial, they are largely based on the conversion of non-digital legacy data. That is, there is very little "born digital" data in Harvard's Library Digital Initiative. In fact, the university archives still have an official policy of not accepting digital material, so the challenge grant projects have fallen outside this official purview of the archives.
Despite this, Harvard's efforts are still very instructive, and have positioned the institution very well to continue with digital archiving projects. To date, they have 350,000 digital objects under management, totaling some three terabytes of content. They also have impressive infrastructure for collecting, digitizing, cataloging, storing, and distributing digital material.
Harvard also looked at costs associated with fee-for-service archives, something other projects did not do. In particular they looked at what the Online Computer Library Center, Inc (OCLC) charges for storage. OCLC stores on a "bag of bytes" basis; they do not provide viewing technology. Onus for that is on the customer. OCLC's charges:
• $60 per GB if < 100 GB.
• $32 per GB if 101-1000 GB.
• $15 per GB if > 1000 GB.
Harvard also figured the cost per year of physical storage versus digital storage, based on a collection of 2,202 volumes (729,000 pages, 322 pages per volume).
The conclusion is that, wherever possible, store only the ASCII data (including metadata markup). The metadata capture and/or creation will cost more up front, but the payoff in access and cost savings over the long run will be enormous.
Posted by Bill Trippe at 4:42 PM
October 27, 2003
Does Personalization Do Anything Useful?
I interviewed user interface expert Jared Spool a couple of years ago for a now defunct Web site. I really like Jared's ideas on all things technical, so it was a pleasure to discuss personalization with him. The full text of the original article follows.
For Jared Spool, Purpose, not Personalization, is All
Poorly conceived efforts to personalize can result in clumsy, off-putting sites
Jared Spool is a long-time--and highly sought after--advocate of taking time and care in designing human-computer interfaces. His Haverhill, MA-based company, User Interface Engineering, has been advising and training development teams since 1988 and is now 20 people strong. Spool and his team provide primary research, publications, and training for software developers involved in interface design. With the massive build-out of the Web, User Interface Engineering has turned its substantial resources to understanding how users interact with Web sites.
Spool himself conducts some of the training classes, and has a vast and ready arsenal of examples, ideas, and anecdotes of things done well and ill. We talked recently about the many efforts sites have made to personalize the visitor's experience.
One Fact Does Not Tell Much
"The thought of personalizing based on one or two known facts..." Spool begins, and then leaps to an example. "A person is walking down the mall and slows down to look in the window of one store. Will they slow down at they next similar store? If they happen to go into one store that sells dishes, will they go into every store that sells dishes? Will they ever buy dishes?" In other words, Spool is advising, one fact does not tell us much, if anything, about that person's future behavior.
UIE has researched shopping behavior both in malls and online. They've learned that even what people expressly declare they are interested in does not have a determinative effect on what they will do. "We observed one woman go to Crate and Barrel specifically to buy napkins. She never bought napkins. Instead, she happened upon a pitcher she liked, and bought it and the set of matching glasses. Had Crate and Barrel shown her only napkins, they may have ended up with no sale."
These two examples go directly to the two predominant types of information that Web sites use to personalize:
1. Information that can be inferred about a user based on, for example, click stream behavior, and
2. Explicit information that the user might agree to share with the web site.
We don't, Spool would contend, end up with enough information to fully understand the user's needs. And even if we did, we probably wouldn't know what to do with it. Web developers are coming face to face with a fundamentally complex problem--trying to determine what specific information a user might want at a specific point in time.
"Take Amazon.com, for example," says Spool. "A friend was pregnant two years ago, and purchased a book there related to pregnancy. Two years later, every time she goes to Amazon.com, they suggest baby naming books to her." Spool then lets the thought sink in for a minute. Of course, the baby is now a toddler and indeed already has a name. It's a single example, but a telling one for Spool, one that illustrates how the inference made from one piece of data has clearly led to an off-putting result.
Watch out for the Daily Special
Amazon.com is not alone in clumsily offering recommendations and specials. Spool's examples include the gardening site that won't sell you certain plants based on your zip code, apparently out of fear the climate may kill them. Another favorite example is the pharmacy site that will recommend specials on a certain medication, despite the fact that you've completed a lengthy on-line questionnaire and included information that you're allergic to that very medication. "Apparently," Spool deadpans, "It's ok to kill yourself but not your plants."
Spool's examples are catchy and often very funny, but they also make the point about personalization. "I'm not sure we know enough how to do it," says Spool. "And when we do it wrong, it's at best an annoyance."
The problem is what Spool calls the "indiscriminate attention" a site may end up paying to certain information. "It's nice at times for a third party to pay attention to certain interests and make recommendations. For example, if you go to a restaurant regularly, and a waiter knows you like a certain salmon dish that is sometimes available as a special, it's nice for the waiter to point that out. But you probably don't want that same waiter to start commenting on your recent choice in friends."
Other sites have adopted the approach of always fronting certain information, presenting the user with daily specials and offers, regardless of whether the user is there for such information. In fact, UIE's research shows that users almost always bypass that kind of information in favor of the things that interest them. "It's like going running into a restaurant needing the restroom," says Spool. "And having the waiter insist on reading you the daily specials."
The lesson? Specific information about a user is could yield some useful functionality. For example, it might be helpful for a floral web site to track the types of flowers you have sent, and then warn you not to send the same ones twice (or to suggest one that you've indicated was well received). But one or two facts, Spool suggests, don't merit redesigning the whole Web site. This is probably most true in complex applications involving knowledge workers. "You would have to predict what they need at any given time," says Spool. "And the odds of you being psychic are slim."
Ask the Right Question
For Spool, the way to tackle personalization isn't to start with the question, "What can we personalize?" The right questions to ask are, "What does the user need to see right now? What information does the user need?"
Spool offers another example. "Take a user coming to eBay. If that user has some bids on some items pending, it would be handy for the first screen they see to be a summary of their current bids. Which ones our being out bid? How long do the auctions have? If there are certain items that they've indicated they are always on the lookout for (personally, I'm a big fan of the 'Elvis Pezley PEZ dispenser'), eBay could display new items and have an easy way for them to place opening bids on those items."
Spool's final advice is that the ultimate solution may not even have to be expensive or particularly complex. "Watching users come to your site frequently would give you some ideas on the patterns that show up in their regular visits," says Spool. "Sometimes it will require personalization technology to optimize those visits, but often it can just be done with cleverly changing the content, without any real sophisticated tools."
Posted by Bill Trippe at 3:08 PM
October 26, 2003
More SVG Support, and a Thought
Software company Beatware announced their latest version of e-Picture Pro with additional support for SVG, including both SVG Tiny and SVG Basic. What's interesting about this announcement is that Beatware is emphasizing the need to put more control over SVG in the hands of the graphic arts professional. For example, e-Picture Pro has built-in constraints that enable you to create illustrations while honoring the limitations of Basic and Tiny. This keeps the graphic artist in the driver's seat, and eliminates the need for hand-coding. This is the kind of product feature that will help expand the use of SVG.
My thought is that there should be a source of concentrated news on SVG. It is very hard to tease SVG news out of some product announcements (e.g., Adobe's recent release of its new creative suite). In other cases, smaller companies who have dedicated themselves to SVG have trouble getting the word out. Could the market use a newsletter or news source specifically to cover SVG? I have been thinking about if for a while. Please get in touch if you have some ideas on this.
Bill Trippe
btrippe@nmpub.com
Posted by Bill Trippe at 10:04 AM
October 21, 2003
Enter InfoPath
InfoPath launched today, to quite a bit of fanfare from Microsoft and its many partners. Despite Microsoft's size and success, they don't often create the best buzz at a product launch. To some, they seem stingy about the details, and with a product line as ubiquitous as Microsoft Office, it is all about the details. But with InfoPath, they do seem to have done a good job of getting the word out. My inbox has been flooded with InfoPath-related press releases, especially from the partner companies. Moreover, technical and marketing folks have been very available to discuss the launch.
I am tempted to say something tongue-in-cheek about InfoPath (e.g., InfoPath is Latin for "Microsoft Office everywhere, damnit!"). But, in fairness to Microsoft, I have not really looked that closely at it yet. I have been using the beta version of Office 11 for several months, but I never installed the InfoPath componentry. I have spent some time working with Microsoft Word output to XML, and have a pretty good idea about that.
There are at least two interesting things about InfoPath. First, on the one hand, it seems to be very true to XML, but it still looks like a complex and heavy client installation. Second, it is pretty clear that InfoPath is positioned as Microsoft's entree to the electronic forms market; however, Microsoft has been "doing" forms for a long time through products such as Visual Basic, Access, Excel, and even Word. Is InfoPath more than the sum of those parts? Less? And what about XForms? InfoPath specifically does not support XForms. Does this set up InfoPath to be its own, unique vocabulary for forms development, despite the fact that it is based on XML?
With the public announcement of InfoPath today, Microsoft published a great deal more material about the product on the Microsoft Web site today. See, for example, the FAQ and some customer case studies.
I would love to hear from people in the field who have started working with InfoPath, especially in content applications, about their experiences thus far.
Bill Trippe
btrippe@nmpub.com
Posted by Bill Trippe at 4:15 PM
October 17, 2003
WorX Studio Reviewed
EContent Magazine has published a review I wrote about WorX Studio from HyperVision, Ltd. As noted in the capsule summary of the review, "WorX for Word and its companion tool, WorX Studio, provide a novel way to bring Microsoft Word into an XML-based editorial workflow. WorX for Word acts as a plug-in to Word to provide structured authoring of XML. WorX Studio gives users a means to interactively convert unstructured documents into structured XML, and can be used in concert with WorX for Word. The product suite can be very useful for a group of writers who work on a small number of structured document types."
Posted by Bill Trippe at 10:49 PM | Comments (2)
Will XForms Count?
The W3C has released XForms 1.0 as an W3C Recommendation. While I do not yet have intimate familiarity with the XForms syntax, I am convinced already that XForms will have an impact. The established e-forms companies (Pureedge, Cardiff, et al) are all positioning their products as being compatible with XForms, and a number of newer, smaller companies have emerged with targeted offerings.
There are two 1000-pound gorillas in this market, maybe three. Microsoft will be releasing InfoPath next week, and Adobe will be releasing its new Forms Designer in beta sometime in November. But I wouldn't underestimate IBM. They have been very active in the development of the XForms recommendation, and have a long list of core products that can take advantage of XForms (beginning with WebSphere and continuing through more targeted applications such IBM Content Manager). The next several months should be very interesting.
Posted by Bill Trippe at 10:11 AM
October 14, 2003
EMC and Documentum? Bolt from the Blue?
I honestly can not say I would have predicted that EMC would acquire Documentum, but I wasn't entirely floored either. There are several ways of looking at this as sensible from EMC's perspective at least:
- If you buy the argument that enterprise content management really does mean all significant assets in the enterprise—from content to digital assets to documents to forms to email, etc., etc.—than the storage and addressability of those assets becomes paramount.
- We are beginning an era when all content will be "born digital." This hasn't always been true. Many organizations still have legacy content and data that have not been digitized, and many more have legacy content and data that are digitized but are in a proprietary, often binary format. If all content is born digital, and all content is heading toward neutral formats such as XML, the management of these data structures over their entire lifecycle becomes key. ECM's tagline, "Information Lifecycle Management" is a useful one.
- ECM has been interested in the content management problem for a while now. They labeled their Centera product line as the "first content addressed storage" solution (coining the acronym CAS in the process, I believe). Documentum, Artesia, Enigma, FileNet, and INSCI were among the early Centera partners.
Perhaps most significantly, the combined EMC-Documentum will make sense to a significant number of prospective customers who see the enterprise content management problem as a storage problem first. This is especially true of those organizations and individuals who come to enterprise content management with an orientation toward records management and archive management. Such prospective customers will understand the combined offering more quickly than will some others.
Posted by Bill Trippe at 12:34 PM
October 13, 2003
Applications of Internet Publishing
At the request of Mark Cummings, VP and Publisher at Scholastic Library Publishing, I was a guest lecturer at a class he is teaching at NYU, Principles and Applications of Publishing on the Internet. The class has been delving into some real nuts and bolts--how a reference publisher, for example, goes about digitizing and structuring their content for effective publishing on the Web.
It was interesting to speak to a group of graduate students, some of whom are already working in the field and some of whom hope to. As I said to them, I spend so much time speaking with other technical people in the field, I am guilty of speaking too much in the jargon of the industry.
They are using The Columbia Guide to Digital Publishing as a text. Mark Walter and I co-wrote the chapter on content management.
If you would like to see the slide presentation from the NYU talk, you can download it here. My thanks to reader Brian Casey for taking the PowerPoint and creating the PDF.
Posted by Bill Trippe at 1:45 PM
October 8, 2003
The Sarbanes-Oxley Boom?
Is it me, or does every fourth email I receive mention how Sarbanes-Oxley compliance can be reached with technology from (enter the name of a CMS platform vendor here)?
Sarbanes-Oxley is about transparency, accuracy, and completeness of record keeping. Isn't this exactly what enterprise content management (in its broadest sense) is supposed to do? Implemented correctly, an enterprise CMS can precisely manage important documents and related data, establish and enforce a workflow, and accurately report on the entire lifecycle of the document. So, yes, enterprise CMS technology can support these aspects of Sarbanes-Oxley.
Of course, the question remains, did we need a law to tell us this was a good idea? This suggests to me that the people who have historically championed technologies such as document management and content management have been ahead of the curve. It also suggests to me that, Sarbanes-Oxley aside, well established content management practices are a good thing.
Posted by Bill Trippe at 11:12 AM | Comments (3)
October 6, 2003
Random Entries I Could Have Written
I have been very busy with some client work recently, so the blog has lagged behind some thinking I have been doing. Among the topics that I have been considering lately:
- There is a growth in the number of tools for converting content in and out of XML. I got an update from Rizwan Virk at Cambridge Docs. I liked their xDoc Converter when I reviewed it earlier this year for eContent magazine. They have since expanded on a product line that they bill as an "XML Document Backbone."
- I keep thinking that Digital Rights Management has a place in the homeland security mix. I am sure that I am not the only person that has thoughts about this, and I know the security agencies are beginning to use it. To me, the real power of DRM in a security apparatus is not in protecting content from access; rather, it is in the ability of DRM technology to let people freely access it while keeping track of how it is being used.
- I was playing with the MovableType editing interface on my Palm wireless (I use a Treo 270). The forms worked great. The challenge is the text entry and editing itself. I am now using a combination of the micro-keyboard and graffiti, but it is less than efficient.
- I have been looking at Adobe forms as part of a potential project. I have never been more than a practical user of Acrobat, and have typically favored XML-based approaches to content and document management. But forms are a different thing. Design is intrinsic to forms. Isn't it?
Bill Trippe
btrippe@nmpub.com
Posted by Bill Trippe at 4:55 PM
October 2, 2003
Content Technology Works
The Gilbane Report has begun a new initiative, Content Technology Works. The goal is to identify best practices, and to document these best practices in a set of case studies. This is being spearheaded by Gilbane Report senior editor Sebastian Holst, and the work will be supported by a growing list of vendors. The idea is all vendors will contribute to the effort, but the overall effort will document the experiences of many case studies, regardless of the vendors involved. The content itself will be distributed free via the Gilbane Report web site. This way, successes can be documented and shared by a wide audience.
Posted by Bill Trippe at 9:55 AM





