A study of scanning habits
Wed 15 Aug 2007, 11:32 pm
I have been thinking a bit this past week or so about books–books as objects, things made of paper; and books as concepts, as long-form written works that might be on paper or a computer screen, or a yet-to-be-invented beautiful electronic reading machine.
I was prompted in part to think about this from the comment that David Lee King wrote on my post Writing and talking about library 2.0.* Here’s the part I’m thinking about:
I don’t denigrate books. I denigrate the container, not the content – two very different things. Books as a format I think will stick around for a very long time. The paper they are printed on? Well… I have a Sony E-Reader in my office right now for staff to play with.
If prodded, I’ll bet David would admit that most paper books on our shelves today will outlast the Sony E-Reader on the order of a few hundred years at least. But that’s being overly specific, and not the real problem. The problem, I believe, is that a book isn’t really a “container.” A book is a book.
That doesn’t mean that electronic books aren’t a worthwhile endeavor, or that it is impossible to make e-books that are worth using. I have read books on my Palm Pilot, and expect to read many more e-books on more usable devices in my lifetime. But if we fail to take into account the “bookishness” of books, we run the risk of making some terrible errors.
I had already been trying to get my thoughts together along these lines when I found Paul Duguid’s recent article for First Monday, “Inheritence and loss? A brief survey of Google Books.” In the article, Duguid uses the Google Books results for Tristram Shandy to see how the project handles a problematic text like Laurence Sterne’s novel.
One would think that scanning the pages would be enough to create a usable e-book, but in the cases that Duguid examines, it just isn’t. Some of the reasons Duguid covers:
- some scans are simply bad, missing parts of the page or illegible for significant parts of the page, or completely blank;
- there are mistakes or omissions in the metadata, such as mistaking the list of illustrations for the book’s table of contents, or not clearly identifying the parts of a multi-volume set;
- Google’s ranking algorithm seems to prefer odd, substandard versions of the work due to copyright or other restrictions.
Some of these things would be less problematic for a less complicated work than Tristram Shandy, but I expect the problems with Shandy are by no means unique.
From Duguid’s conclusion:
Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don’t submit equally to a standard shelf, a standard scanner, or a standard ontology….Even with some of the best search and scanning technology in the world behind you, it is unwise to ignore the bookish character of books. More generally, transferring any complex communicative artifacts between generations of technology is always likely to be more problematic than automatic.
Not incidentally, I believe I first came across a link to Duguid’s article in Dorothea Salo’s del.icio.us stream. Dorothea, as ever, is way ahead of me on this, having been “ranting” about similar topics since 2003 (and possibly before). In her post from that year, No, it really is that hard–a response to a person who thinks that encoding books digitally is a simple, straightforward process–she writes,
Nine times out of ten, these yahoos have utterly forgotten that there’s any book in the world more complicated than, say, a Robert Ludlum novel. (I don’t think these yahoos actually set foot in libraries, though I suppose I could be wrong—they could merely suffer from acute tunnel vision.) The rest of us don’t have that luxury…. We have to sweat over math, art, indexes, tables, links, complex layouts, production workflows, metadata, non-Roman alphabets, digital preservation issues, and all that fun stuff.
(Several of those e-books I mentioned reading on my Palm Pilot were Cory Doctorow’s novels. Dorothea did the HTML markup on them. Which is random, and cool, and sorta beside the point, but I thought I’d mention it here anyway.)
If you find the Duguid article interesting, you might also try these links:
- Limits of self–organization: Peer production and “laws of quality” an earlier First Monday article by Duguid on–among other things–the Project Gutenberg version of Tristram Shandy, which is itself quite problematic.
- “the bookish character of books”: how google’s romanticism falls short from if:book
- a book is not a text: the noise made by people an earlier post on if:book
- Alas, poor book, Gentleman from O’Reily Radar
* If it seems like David Lee King is my new bête noir, I don’t really think that’s the case. I believe he and I have a lot of views in common, so the areas where differ stand out to me, and I find them worth investigating. It’s also evident that he welcomes the discussion.

Great post, Steve. I’ve said it before and I’ll say it again, books are a great technology. There may come a day when libraries have bookless shelves, but I doubt it will be tomorrow, next week, or next year.
Comment by joshua m. neff — August 16, 2007 @ 6:58 am
Hey – any time someone can use my name and “bête noir” in the same sentence, I’m cool with it :-)
But back to your topic – I still see your whole post as basically a discussion of content vs container.
“a book is a book” – I sorta agree. The content in a book – the actual words… that’s the book. The paper it’s printed on? That’s the container.
The whole Google discussion above? A discussion about how the Google “container” for books and the automated process to transfer the content from one container to the next is still in heavy pilot project mode – it isn’t there yet.
The Tristram Shandy arguments and discussion seem to boil down to this – “but there’s words AND pictures. How do we do that?” OK – again, the new container still needs work.
That’s how I see it, anyway…
Comment by david lee king — August 16, 2007 @ 7:10 am
David, what about “the medium is the message?” As I say, I expect we will have massive amounts of usable e-books some day, maybe even soon, but the experience of reading a paper book and that of reading an e-book won’t necessarily be equivalent.
And if it was just a matter of “words AND pictures,” why does the scanned Google copy fail on other accounts, like worthwhile metadata and the like? Pictures are the least of the problems here, no?
I think my problem with the word “container” is similar to Dorothea’s problem with the word “just” in the post I linked to above. It makes it all sound very simple, very easy, when it isn’t. I can pour my ginger ale from an aluminum can container into a glass tumbler container, and it is still ginger ale. “Pour” Tristram Shandy from a paper book container to a Google scan container, and you no longer have Shandy, you have something else.
Comment by Steve Lawson — August 16, 2007 @ 1:03 pm
I think you’re right that the point is that it’s not simple. The point is not so much the words in a row. The point is not so much the placement of pictures, type face, or white space (which is also important). The point is that “containers” are not entirely benign. If they were entirely benign, people wouldn’t pour ginger ale from the can to the glass. There’s something about a glass that’s more comfortable to drink from. The edge feels different on the lips, the spray tickles your nose… the experience isn’t the same. And yet, the ginger ale is still ginger ale.
Ok, so the container matters in this case, but doesn’t change the substance. But what about the case of a Van Gogh painting? Seeing digital reproductions of his work is nothing like seeing the real thing. The colors aren’t as vibrant. The textures of the brushwork are simply shadows rather than spaces. In this case, the “container” changes the work absolutely and fundamentally.
So does the “container” change the substance of a book in the same way that it changes a painting? I’d say, “It depends.”
Ludlum might be a “book” no matter it’s container. Shandy, maybe, not so much. Complex texts are not generally read in as linear fashion as they are written. The words march forward, the same as ever, but my eyes jump back to the top of the facing page, the previous paragraph, the next sentence, almost without breaking the flow of my reading. Complex text require this type of reading-while-reading as you make sense of them.
Some day, technology might be able to simulate the act of putting a finger on a page in order to mark a point you’re trying to interpret by reading forward. Some day readers might allow the kind of non-linear reading that’s necessary for sense-making. And some day there might be a fully automatic process by which complex text, these books that have never been anything but Books. Just like some day we may have digitally reproduce paintings where the experience of the painting isn’t fundamentally changed.
Comment by Iris — August 16, 2007 @ 3:27 pm
Hmmm, please excuse my sloppy writing above. That second-to-last sentence is supposed to read:
“And some day there might be a fully automatic process by which complex texts, these books that have never been anything but Books, can be converted to digital formats.”
Comment by Iris — August 16, 2007 @ 4:08 pm
I think David is free to re-appropriate and re-purpose the word “book” if he likes. That is how language works.
But from current usage he is simply incorrect. A book is not the contents at all; it is a specific form of container.
“The paper it’s printed on? That’s the container.” Exactly! A key component of the definition of book.
Just as there were clay tablets, scrolls, codexes, and manuscripts there are books. The content is pretty much irrelevant. It is the form that matters.
Again, our usage may change and book may come to mean the content. But there are already many words for that–poetry, plays, novels, short stories, non-fiction, essays, etc.
What we need is a better word than e-book; perhaps one that would also cover digitized books as in Google Books and elsewise on the web.
“Book” is a form/container word; not a content word. Unless David can get more people to start using the word in the way that he defines it. Then that’s what it’ll mean. But good luck keeping it from changing out from under you. ;)
Comment by Mark — August 16, 2007 @ 5:27 pm
[...] Book Is A Book Is A Book – Or Is It? I’m fascinated by this discussion about books and what actually constitutes a book. There is quite a bit going on in these posts – [...]
Pingback by A Book Is A Book Is A Book - Or Is It? « Life as I Know It — August 16, 2007 @ 8:11 pm
I agree that a book as a physical object is very different from any electronic or digital version. As the article and comments imply, we’re nowhere near the perfect e-book reader/format.
The post also raises the issue of how we teach kids about the beauty and value of the printed word.
As a high school librarian I’m afraid I don’t take a lot of time doing that and get distracted by the need for information over the need to interact with the physical thing which is the book. Maybe this year I can figure out better ways to help my students and staff really appreciate books for their own sake as well as for the information and entertainment they contain.
Comment by Tom Kaun — August 16, 2007 @ 10:58 pm
Gee whiz, Mark… ok. You’re correct. Instead of “book” in my comment above, please insert: novel, novella, large work of gathered poetry, large multi-chapter work of non-fiction, etc.
Does that clarify my comment any?
Comment by david lee king — August 17, 2007 @ 7:11 am
Another way to think about it… a story. Used to be spoken only (oral tradition before people started writing things down), then the story appeared in drawings, then the story appeared on clay tablets, then on scrolls, and now, finally, on books. No, wait… also on blogs, in PDF, in many different emerging forms of e-book…
But it’s all the same story.
Comment by david lee king — August 17, 2007 @ 7:15 am
David, I think the word that academics tend to use in that context is “text.”
FRBR-ized librarians might prefer to say that David is talking about a “work” or “expression” while Mark is talking about the “manifestation” or “item.” (I’m not a FRBR expert, though. The distinction between “work” and “expression” seems very slippery to me.)
Comment by Steve Lawson — August 17, 2007 @ 7:24 am
You know, I just skipped over something in this fascinating multiblog conversation–probably because I only read it on the screen.
To wit, “I don’t denigrate books. I denigrate the container,”
I’ll avoid the issue of the definition of “book” (which, I submit, is ambiguous, but there are much better minds in this collective and I’m learning from the various perspectives) and just wonder why there’s any point in “denigrating the container.”
I mean, unless (David) you really are saying “digital is always better,” then why denigrate? If a different container works better for some people for some uses, fine–point out the virtues of that container. But unless there’s something actually defective about a bound collection of printed pages as a way to read long linear text–and the, um, overwhelming popularity of the Rocket Ebook and various others (including Sony’s device) seem to suggest that most people aren’t anxiously looking for a replacement–then denigrating physical books just seems silly.
I suppose this comment belongs on DLK’s blog…but the emphasis on those two sentences in this post broke through my poor on-screen retention. Thanks.
Comment by walt crawford — August 17, 2007 @ 9:23 am
No Walt, I think you’re correct. “Denigrate” is much too strong of a word… and I don’t actually “denigrate” paper books. In fact, I have a whole stack of them to read right now!
I think what I’m trying to get at is this: I see some public libraries build whole websites and huge programs around paper books, and actually get sorta mad when presented with the possibility that people at some future point in time might not equate paper and books (or texts).
I even see librarians that don’t understand that the concept of “reading” means much more than holding a paper book in one’s hands.
So I’m just trying to explain that I think there’s a huge difference between where you are reading the words (on a page, on a screen, etc) and the words themselves.
I think “denigrate” came from a comment or post Steve made, and I commented on that and used his comment on my words as a starting point… and this is probably where conversations in comment boxes start to break down. Kind of like how USENET conversations got lopsided pretty fast :-)
Comment by david lee king — August 17, 2007 @ 9:39 am
Heh. “Reading” being more than holding a book… that needs to be explained to all the people who say kids don’t spend any time “reading” these days because they’re too busy texting and IMing. :)
Sorry to go off topic a little… I now return you to your regularly scheduled discussion.
Comment by Iris — August 17, 2007 @ 10:27 am
Yay, I agree with David! It is the “story” that matters and it can come in all sorts of containers.
And, often, we do speak loosely of a book as the content.
If we want to FRBR-ize the discussion then in my stricter definition it is nonsensical to talk about a book as any FRBR entity. A book as container would be part of the thing being described as a manifestation (well, not really, but I’ll leave this philosophical issue aside) and certainly part of the item (if it is a book) being described.
In the loose sense of speaking, we could say that a book is the item. But in this case, and actually especially (hmm?), in this case the book is only the container. The work (the story) has already been cataloged at the work/expression level.
Yes! In fact, in FRBR there isn’t a single thing about the story that is part of the item. It is only individual carrier info–bar code, provenance, condition of the item, etc. Manifestation is pretty much the same.
E.g., see the RDA-FRBR mappings:
http://www.collectionscanada.ca/jsc/docs/5rda-frbrmapping.pdf
So, David, if you were only talking about the story all along then, on one hand I apologize for butting in. But on the other, I could not tell exactly what you meant because of the collapse of a distinction that I felt important (in many cases; again, not all).
David, I fully agree with you (and Iris and others) that, except in the ways Walt suggests matter, container should not matter.
Recorded story is recorded story, and reading is reading.
Certainly there are things that can be teased apart between reading a quality print item versus reading from a screen, and reading some kinds of materials may be better than others…. I guess I just feel these differences reside at a finer grain, perhaps, and thus won’t need to be trotted out as often.
Anyway, thanks for the conversation all; in all the various places we have spread it.
Comment by Mark — August 17, 2007 @ 12:57 pm
i am the person dorothea lambasted in her blog entry of 4 years ago. (and i’m amazed that she’s still pointing to that hatchet-job she did.) at any rate, i still maintain that digitization is _not_ a difficult process.
the fact that google can’t seem to get it right doesn’t mean it’s _hard_.
duguid (whose earlier article i admired) was very slipshod with this one.
he points to blurry scans. and they’re entirely unforgiveable, of course. but does anyone here really think it’s _difficult_ to get nonblurry scans?
likewise, duguid discovers google has lost track of the volume numbers. but does anyone believe that noting volume numbers is inherently hard?
or take the fact that google confused the list of illustrations in “shandy” as a table of contents. was this because it’s _hard_ to tell ‘em apart? no, it’s because there was no table of contents, so it found the closest thing.
***
why is google making such stupid mistakes? probably because they hire people at minimum wage for a job that requires very close concentration. and — even more importantly — because their quality control _sucks_…
but to pretend this job is _difficult_ is laughable, in the extreme.
***
dorothea’s post is quoted here:
> We have to sweat over math, art, indexes, tables, links,
> complex layouts, production workflows, metadata,
> non-Roman alphabets, digital preservation issues,
> and all that fun stuff.
math = graphics (or latex, if you’re nasty).
art = graphics.
indexes = indexes, hotlinked please.
tables = tables.
links = links.
complex layouts = oh, poor dear, yes, that layout _is_ complex. sheesh!
production workflows = gobbledygook.
metadata = shmetadata.
non-roman alphabets = unicode.
digital preservation issues = mirror stuff to new formats constantly.
all that fun stuff = make up your mind; is it hard or is it fun?
none of this is hard. i repeat, _none_ of it.
the vast majority of books can be digitized in a straightforward manner.
dorothea would have you think that i believe this because i’m “stupid or clueless” and have no experience digitizing books. but she declined to identify me, so you couldn’t check up on her little hatchet-job.
-bowerbird
Comment by bowerbird — August 17, 2007 @ 2:46 pm
[...] Also…, A study of scanning habit : a couple of [...]
Pingback by Book, music, communication, content, social — August 17, 2007 @ 4:13 pm
While I fully agree with that, I am less convinced by the discussion about “container” and “story”:
Jean-Michel Salaün, commenting (in French) about Paul Duguid’s article reminds us that documents could (should?) be considered as having three dimensions:
(my translation)
Anthopological : Form (Document = Format + Inscription)
Intellectual : Text (Document = Code + representation)
Social : Medium (Document = Memory + transaction)
In order to render the original (book) document, the digitization process should consider the three different aspects, which goes — most of the time — further than the mere distinction between “container” and “story”.
Comment by Alain Pierrot — August 20, 2007 @ 8:40 am