The University of Queensland Homepage
Matthew Smith Library Systems Programmer

Archive for the 'Fez' Category

Collections considered …

Peter Sefton of the AANRO project (and USQ) has raised the question of whether Fez can support dynamic collections and do away with the fixed community / collection model completely.

The AANRO project at USQ is evaluating repositories for use with a project for Land and Water Australia.

In Fez, the communities and collection model is hardwired for the use of the authorisation framework. The model is used so that an authorisation profile can be applied to a large group of objects. This is needed in Fez because our UQ eSpace repository contains some objects which can’t be published for copyright reasons so are only available to certain groups at UQ.

One of the problems with the current authorisation system is that rules can be applied on collections or communities or both and it can be confusing as to which rules are being applied. Furthermore, objects can belong to more than one collection. It would be good to see a central place where all of the authorisation settings can be seen.

If Fez were to go down the dynamic collections road, then the dynamic collections would occur on common specific metata items that define the collection. This is already implemented with the browse by author and browse by date links. Dynamic collections would extend these options to include configuring any metadata to be used as the criteria for browsing.

However, without the communities and collections heirarchy, Fez would need a new way to apply the authorisation rules to groups of objects.

My proposal is to achieve this by implementing a central authorisation rules table which would map FezACML rulesets to a set of search criteria. The search criteria defines the set of objects which will have the FezACML rules applied to them.

Using the concept of a rules table would allow Fez to break out of the communities / collections mindset and centralise the authorisation rules making them easier to mange.

Delete and Undelete in Fez

I’ve just commited changes to Fez that support the fedora ‘D’ object state which is the way to delete items in fedora without purging them. In previous version of Fez, deleting objects meant they were purged from fedora. In the next version, deleting an object will just set the object state to ‘D’ in fedora and remove the object from the Fez index so that it doesn’t show up in searches etc.

I’ve also written an interface for finding and undeleting fedora objects that have the ‘D’ state set. This process is similar to the rewritten ‘Discover new fedora objects’ management function.

To get this undelete functionality, do an svn update or checkout, hit the /upgrade url and import the /upgrade/workflows_delete.xml workflow. (NOTE: svn trunk is by no means a stable or well tested branch but until we do a release, it’s the way to go if you need bleeding edge features)

I’m not going to be working on Fez for a couple of weeks as I need to play catchup on a few other software projects but I’ll still be on the mailing lists and putting in a few bugfixes in my spare time… (and I’d like to write a LOT more PHPUnit tests. If anyone out there feels like they want to learn Fez a lot more, I recomend writing unit tests and sending your work in)

Also drop me a line here if you are going to the OSDC http://www.osdc.com.au/

Changes to sublooping elements

There have been some big changes commited to the Fez trunk recently: some may have seen Rev 923, labelled ‘MAJOR COMMIT! FEZ 2!’ by a uqckorte. That was a change to how Fez indexes the records stored in fedora. The Fez index is used throughout Fez whenever it produces a listing of records or searches for information in fedora. Rewriting it touched almost every file in Fez which has prompted us to consider incrementing the major version number on the next release. The new index has made Fez a lot faster and paved the way for adding better search tools later but I’ve already blogged about that.

My changes today were much less exciting but have come with a bit of documentation which might help others who have tried to figure out these things in the Fez XML mappings called sublooping elements.

We came across a problem recently where we realised that we couldn’t put a sublooping element on mods:identifier. The problem was that sublooping elements assume that the actual element that is the base of the loop can’t have a value stored in the XML.

In order to fix this, I decided to document the sublooping elements to some extent in order to get clear in my head how they work. The results of that effort can be read here on the FezWiki.

Once I felt I’d fully gotten my head around these sublooping elements, I dived into the code and tried to get the mods:identifier values mapped. My first attempt was heading down a pretty radical departure from the current code so I decided to rewind and keep the changes within the current framework. The result is Rev 1032. In order to show how the changes work, I’ve also recorded a quick screencast of the mappings I made for mods:identifier.

I hope that anyone mapping their own document types in Fez will find these resources of help.

Fez 1.4 ??

It’s about time Fez had a new release but as you probably know from following the mailing lists, we are just really busy supporting the RQF project which is to enable UQ to meet DEST requirements for all Australian Universities.

There is no release date for Fez 1.4 and when we do release it, the changes will all be purely driven by the RQF work. A more community driven release will probably then be on the cards. The most common requests popping up on the mailing lists are for Multilingual support and fulltext searching.

It is also apparant that the interface for mapping XSDs to input forms is far from intuitive. Unfortunately, the problem of allowing an arbitrary mapping from a complex nested tree structure to a flat input form and then back again is just plain difficult so I don’t think it will ever be solved using simple wizard like click throughs. The future probably involves a scripting language or an API for writing the mappings as code fragments which are run in a sanitised environment.

What will Fez 1.4 look like?

We are debating whether to release the Fez as 1.4 or as 2.0 because Christiaan has rewritten a huge chunk of code that does the indexing for the repository. Instead of storing our metadata index in one big table of PID / mapping Id / value, he has broken it out into a table for each search key in Fez (a search key groups semantically similar mapping ids together which allows us to search title fields or authors across multiple document types even though they may map to slightly different places in the metadata). The new design results in faster Fez searching and browsing. It is simpler to code with so the queries we write are less buggy. It will also pave the way for postgres support in Fez and the adition of bolt on search engines such as lucene.

As I may have mentioned, I have been working on an interface which allows duplicate objects in Fez to be located, merged and ‘retired’ using an interactive report object. This tool is essential as we are populating UQ eSpace with data from multiple sources which may have overlap.

The third member of our team, Lachlan, has been busy scripting up bulk ingest processes for a number of data sources to go into UQ eSpace. While these scripts will be of little use to the Fez community (or will they?), a benefit is that Fez is being used at UQ with a much larger set of records and by a larger number of users which has allowed us to find and fix a greater number of bugs. The process for indexing new fedora objects into Fez has been improved: it is speedier, less buggy and easier to use. We also have experience of the many pitfalls of transfering data between systems and can give lots of advice to organisations that have large collections

In a perfect world, I would expect to see a Fez release in late October at the earliest. Now I hear some people expressing shock and disgust but the thing is whenever I make a prediction about a timeline, my first estimate is usually pretty right, then I get pressured to change it to be more in line with what people are wishing for, then we try to do it, the timelines blow out and we end up delivering on my original estimate. So there. (reminder: the latest (ranging fro m slightly to completely untested) Fez code is always available in the Fez subversion repository )

The Fez XSD Problem

I’ve been thinking about a little problem to do with our Fez XSDs mappings in Fez. In a nutshell, we need a way of copying XSD mappings between different Fez instances.

The first iteration of this was to make an XSD exporter which reads the XSD mapping database and exports the structure to an XML file.

However, the complication is how to merge conflicts when importing XSDs? And also how to know what is in the new XSD XML files?

Part of the complication is that each XSD Display Mapping has an xdis_id which is stored in the record to tell Fez which content model to use to handle the record. But what if a Fez user ‘out there’ creates a new XSD Display and uses an xdis_id that we have used here at UQ? When they get our Fez upgrade, they will no doubt want to upgrade to the new UQ XSDs but not want to lose the ones they’ve made. So I’ve made the Fez XSD importer try to detect when the new XSD doesn’t match the existing one and save the new one to a new slot.

However, this collision detection is unhelpful in some ways. For example, what if the user has just changed the name of the XSD from english to spanish? The XSD importer will not upgrade their XSD because it will see a difference. How is the user supposed to get the updates? How are they to know what is in the updates? In many cases, XSD mappings need to be changed due to changes in the mapping API.

I’ve also had issues of users sending me exported XSDs that have used different xdis_ids for some of the core Displays like Generic MODS resulting in a mess of unresolved XSD Relationship links.

I had thought about reserving a block of xdis_ids for core Fez stuff and marking some XSDs as core Fez so that they will have a kind of ‘brute force’ power that makes them just overwrite whatever’s there and always work. However, this still excludes the possibility of users being able to send each other XSDs that are safe from xdis_id conflicts.

I think the solution to this problem is to have namespaces for the xdis_ids. We could prefix ours with fez_core. Perhaps for our special UQ eSpace, we could use UQeSpace as the prefix. In the fedora records FezMD, we would store the xdis_id_prefix as well as the xdis_id. For records that are in the wild lacking the prefix, we could default to fez_core until the records are updated. I guess there might be performance issues having to query the xdis_id and the prefix every time but something along these lines is what I’m thinking.

This doesn’t solve the problem of being able to compare old and new XSDs but takes us some of the way towards a sustainable solution.

Deduping in Fez

For those curious about the new record duplicates workflow I’ve been developing. Here is a screencast of what it vaguely looks like at this stage. Still a proof of concepty thing but will be tested with fire in the near future.

I used Wink to make this as I find it pretty easy to use and it is open source. We also have a licence for Adobe Captivate so maybe I will pitch them head to head and put a review here in case anyone is interested.

Patterns I Hate #2: Template Method via Pure Danger Tech

A friend emailed me this blog post about the template design pattern. Some good points made here and I wish I’d read this before I implemented class BackgroundProcess in Fez.

Usually, the best way to address the addition of functionality in orthogonal domains under a template class is to define an interface for each kind of functionality and inject an instance for each.

Oh well, I’ll have to remember this for Fez 2.0.

Warning of data ticking time bomb via BBC

Is digital preservation the next Y2K? It is certainly a problem for more and more ‘normal’ people, not just libraries. The other half of the problem is for John and Joan user to be able to deal with the Gigs of digital images and videos that they are now able to produce.

BBC News Online | Technology | Warning of data ticking time bomb

“If you put paper on shelves, it’s pretty certain it is going to be there in a hundred years.
“If you stored something on a floppy disc just three or four years ago, you’d have a hard time finding a modern computer capable of opening it.”
“Digital information is in fact inherently far more ephemeral than paper,” warned Ms Ceeney.
She added: “The pace of software and hardware developments means we are living in the world of a ticking time bomb when it comes to digital preservation.
“We cannot afford to let digital assets being created today disappear. We need to make information created in the digital age to be as resilient as paper.”

Digital repositories to the rescue…

Apple Announces iTunes U on the iTunes Store

This is pretty interesting:

Apple Announces iTunes U on the iTunes Store

AppleĀ® today announced the launch of iTunesĀ® U, a dedicated area within the iTunes Store (www.itunes.com) featuring free content such as course lectures, language lessons, lab demonstrations, sports highlights and campus tours provided by top US colleges and universities including Stanford University, UC Berkeley, Duke University and MIT.

Of course, universities with digital repositories can make the same data discoverable through services such as OAIster and RSS feeds directly form the collections but maybe doing it through iTunes narrows the type of content down in a useful way.

We have been having some discussions with external Fez devlopers who are planning features which will allow repository items to be easily showcased in podcasts and eNewsletters e.g. an ‘Add to podcast’ button. Being able to centralise an organisation’s research data (even if it is not all stored in the one place but at least all indexed in a meaningful way in one place) and construct meaningful and structured metadata opens the door to a lot of cool stuff like this.

APSR Market Day

I went to the APSR Market day on Friday to spruke Fez generally but also in relation to RQF . As well as giving my talk, it was good to see the other repository projects are offering and what they had to say about RQF.

The morning started with a talk from Sandra Fox from DEST: Not much different to the information given in February though I wasn’t there for the infamous DEST-bashing in Feb and also arrived late due to my flight so didn’t get all of Sandra’s talk (so this whole paragraph is a bit second hand – sorry about that). The general comments throughout the day were that Sandra is making the best of a tough job and bearing the brunt of the stress caused by RQF on some fronts.

The Dspace guys vented a bit of frustration at Dspace over a few things they’d really like to see change but then came back after morning tea and assured us that Dspace is their repository of choice. I guess their point was that no repository can be the “do everything” software, especially when it comes to the RQF requirements. Andrew (i.e. my boss here at UQL) also mentioned that Fez will only be part of the RQF solution at UQ for the same reasons – that a repository isn’t a content managment and business workflows system (though Fez can do a little bit of those things).

Some of the limitations of repositories in general when it coems to RQF is that we’d want to be able to select articles for the RQF without everyone knowing publically or even without authors knowing. The requirement for freezing objects in time throughout the assessment and the general workflows of selecting the four best papers and generating the output packages for DEST are also challenges for repositories.

I didn’t get a look at the new ePrints 3 but heard that one innovation is that when a user uploads a document, they are not confronted with a big metadata entry form but rather just enter a title and upload the document. Once the document is uploaded, the user can come back and enter the complete metadata in chunks. Apparently, once the document is in the system, the owner is a bit more committed to entering it’s metadata. In Fez, a user might get to that big metadata entry screen and bail out because it looks too hard – especially with all the required fields. The Fez workflows could be improved so that they chunk up the metadata entry screens and allow incomplete records to be saved in the unpublished state at least.

We were given a quick look at the new Vital . I noticed that they indicate if the fulltext is stored in the record when they do a listing. Fez currently shows thumbnails in the listings for image records – it would be nice to see maybe some icons for PDFs and things for our records with fulltext or even indicate the database for records that link to external fulltext (like the icon or name of the database perhaps).

There was a lot of interest in the ProQuest Digital Commons software probably because they have been the unknown quantity for the group at the market day. ProQuest had the most questions from the floor where as I got the impression that most people there already knew all about Fez, DSpace and ePrints. Most of the questions were about pricing and the relationship between a repository host and it’s customers rather than the technicalities of RQF requirements though.

Something that was also noted early on was that a disadvantage of open source is that you need a fairly strong IT capability to modify the software. The advantage, of course, is that if you can get your hands on some programmers, you can modify the software to do exactly what you need (within reason I guess) and have a lot of control over that. On the other hand, the turn-key solutions from ProQuest, VTLS and even ePrints hosting means that you can set things up without the hassel of trying to communicate with software geeks (you get to talk to sales people instead). While the hosted solutions are happy to have input from their customers, the level of responsiveness will vary according to how much money you have and/or whether what you want has a business case.

With all the interest in hosting and turn-key repositories, it would have been good to have the Fez hosting up and running to show off. Stay tuned for more updates in that area.

UPDATE: Presentation slideshows