These are my notes from the session 'Aggregating Museum Data – Use Issues' at Museums and the Web, Montreal, April 2008.
These notes are pretty rough so apologies for any mistakes; I hope they're a bit useful to people, even though it's so late after the event. I've tried to include most of what was covered but it's taken me a while to catch up on some of my notes and recollection is fading. Any comments or corrections are welcome, and the comments in [square brackets] below are me. All the Museums and the Web conference papers and notes I've blogged have been tagged with 'MW2008'.
This session was introduced by David Bearman, and included two papers:
Exploring museum collections online: the quantitative method by Frankie Roberto and Uniting the shanty towns – data combining across multiple institutions by Seb Chan.
David Bearman: the intentionality of the production of data process is interesting i.e. the data Frankie and Seb used wasn't designed for integration.
Frankie Roberto, Exploring museum collections online: the quantitative method (slides)
He didn't give a crap of the quality of the data, it was all about numbers – get as much as possible to see what he could do with it.
The project wasn't entirely authorised or part of his daily routine. It came in part from debates after the museum mash-up day.
Three problems with mashing museum data: getting it, (getting the right) structure, (dealing with) dodgy data
Traditional solutions:
Getting it – APIs
Structure – metadata standards
Dodgy data – hard work (get curators to fix it)
But it doesn't have to be perfect, it just has to be "good enough". Or "assez bon" (and he hopes that translation is good enough).
Options for getting it – screen scrapers, or Freedom of Information (FOI) requests.
FOI request – simple set of fields in machine-readable format.
Structure – some logic in the mapping into simple format.
Dodgy data – go for 'good enough'.
Presenting objects online: existing model – doesn't give you a sense of the archive, the collection, as it's about the individual pages.
So what was he hoping for?
Who, what, where, when, how. ['Why' is the other traditional journalists questions but too difficult in structured information]
And what did he get?
Who: hoping for collection/curator – no data.
What: hoping for 'this is an x'. Instead got categories (based on museum internal structures).
Where: lots of variation – 1496 unique strings. The specificity of terms varies on geographic and historical dimensions.
When: lots of variation
How: hoping for donation/purchase/loan. Got a long list of varied stuff.
[There were lots of bits about whacking the data together that made people around me (and me, at times) wince. But it took me a while to realise it was a collection-level view, not an individual object view – I guess that's just a reflection of how I think about digital collections – so that doesn't matter as much as if you were reading actual object records. And I'm a bit daft cos the clue ('quantitative') was in the title.
A big part of the museum publication process is making crappy date and location and classification data correct, pretty and human-readable, so the variation Frankie found in data isn't surprising. Catalogues are designed for managing collections, not for publication (though might curators also over-state the case because they'd always rather everything was tidied than published in a possible incorrect or messy state?).
It would have been interesting to hear how the chosen fields related to the intended audience, but it might also have been just a reasonable place to start – somewhere 'good enough' – I'm sure Frankie will correct me if I'm wrong.]
It will be on museum-collections.org. Frankie showed some stuff with Google graph APIs.
Prior art – Pitt Rivers Museum – analysis of collections, 'a picture of Englishness'.
Lessons from politics: theyworkforyou for curators.
Issues: visualisations count all objects equally. e.g. lots of coins vs bigger objects. [Probably just as well no natural history collections then. Damn ants!]
Interactions – present user comments/data back to museums?
Whose role is it anyway, to analyse collections data? And what about private collections?
Sebastian Chan, Uniting the shanty towns – data combining across multiple institutions (slides)
[A paraphrase from the introduction: Seb's team are artists who are also nerds (?)]
Paper is about dealing with the reality of mixing data.
Mess is good, but… mess makes smooshing things together hard. Trying to agree on standards takes a long time, you'll never get anything built.
Combination of methods – scraping + trust-o-meter to mediate 'risk' of taking in data from multiple sources.
Semantic web in practice – dbpedia.
Open Calais – bought out from Clearforest by Reuters. Dynamically generated metadata tags about 'entities' e.g. possible authority records. There are problems with automatically generated data e.g. guesses at people, organisations, whatever might not be right. 'But it's good enough'. Can then build onto it so users can browse by people then link to other sites with more information records about them in other datasets.
[But can museums generally cope with 'good enough'? What does that do to ideas of 'authority'? If it's machine-generated because there's not enough time for a person in the museum to do it, is there enough time for a person in the museum to clean it? OTOH, the Powerhouse model shows you can crowdsource the cleaning of tags so why not entities. And imagine if we could connect Powerhouse objects in Sydney with data about locations or people in London held at the Museum of London – authority versus utility?
Do we need to critically examine and change the environment in which catalogue data is viewed so that the reputation of our curators/finds specialists in some of the more critical (bitchy) or competitive fields isn't affected by this kind of exposure? I know it's a problem in archaeology too.]
They've published an OpenSearch feed as GeoRSS.
Fire eagle, Yahoo beta product. Link it to other data sets so you can see what's near you. [If you can get on the beta.]
I think that was the end, and the next bits were questions and discussion.
David Bearman: regarding linked authority files… if we wait until everything is perfect before getting it out there, then "all curators have to die before we can put anything on the web", "just bloody experiment".
Nate (Walker): is 'good enough' good enough? What about involving museums in creating better and correcting data. [I think, correct me if not]
Seb: no reason why a museum community shouldn't create an OpenCalais equivalent. David: Calais knows what reuters know about data. [So we should get together as a sector, nationally or internationally, or as art, science, history museums, and teach it about museum data.]
David – almost saying 'make the uncertainty an opportunity' in museum data – open it up to the public as you may find the answers. Crowdsource the data quality processes in cataloguing! "we find out more by admitting we know less".
Seb – geo-location is critical to allowing communities to engage with this material.
Frankie – doing a big database dump every few months could be enough of an API.
Location sensitive devices are going to be huge.
Seb – we think of search in a very particular way, but we don't know how people want to search i.e. what they want to search for, how they find stuff. [This is one of the sessions that made me think about faceted browsing.]
"Selling a virtual museum to a director is easier than saying 'put all our stuff there and let people take it'".
Tim Hart (Museum Victoria) – is the data from the public going back into the collection management system? Seb – yep. There's no field in EMu for some of the stuff that OpenCalais has, but the use of it from OpenCalais makes a really good business case for putting it into EMu.
Seb – we need tools to create metadata for us, we don't and won't have resources to do it with humans.
Seb – Commons on Flickr is good experiment in giving stuff away. Freebase – not sure if go to that level.
Overall, this was a great session – lots of ideas for small and large things museums can do with digital collections, and it generated lots of interesting and engaged discussion.
[It's interesting, we opened up the dataset from Çatalhöyük for download so that people could make their own interpretations and/or remix the data, but we never got around to implementing interfaces so people could contribute or upload the knowledge they created back to the project, or how to use the queries they'd run.]