Some Background into the 1.5.0 Release
This was supposed to be a status update, but I got a bit carried away, so here’s something completely different: the tale of how I became involved as a developer for MyDMS.
My employer, Sun Microsystems, generously allows me to develop MyDMS as part of my day-to-day work. The improvements I have been making are directly related to features required for a project management platform that I maintain. My involvement with MyDMS is quite accidential: while evaluating options for a document storage and management platform, I stumbled across a small internal repository that embeds MyDMS. The person who had set it up had wanted a simple tool with a small footprint to manage project documentation. Unfortunately, usage had spiralled on the system and performance had degraded to the point where it was extremely painful to use.
I offered to look into the problem.
An initial examination highlighted 2 main issues. First, that the explorer pane was very cumbersome and took a very long time to render. Second, searching was so slow it usually timed out. I got around the first problem by simply disabling the explorer view — it’s not needed as there is a clickable folder path displayed at the top of every page. The second problem, searching, was solved by looking at the structure of the data and examining how it was being used by the software.
What I discovered was quite interesting. Overall, the code is very well organised and is highly structured. There is a sensible object model in place. Also, the database is largely sound. But there was a flaw: the application was attempting to emulate a traditional file system hierarchy using data structures that resemble inodes, complete with pointers to sub-folders. So, in order to identify all the descendents of a particular folder, it was necessary to recursively query the database. One query for the top-level folder, then one query for each of the folders found during that query, etc. On our internal system, the system load would spike to 100% usage for minutes at a time when trying to resolve a folder’s hierarchy.
To make matters worse, most of the searching was being done in the PHP front-end, not in the database. The code would look up the database for information about the folder where the search commenced and retrieve the list of folders and documents found. Each document record found was then compared against the search criteria and a decision made about whether or not to store the record or discard it before moving to the next one. This is repeated, recursively, for each folder found. In our structure, this could result in up to 200 queries. It sort of makes sense within the object model used, but it means that you are losing all the features and benefits of using an optimised database management system.
Now, all the top-level document information is held within a single table. If one discards the folder structure for the time being, a search against a keyword or the document’s original name or the comment field can be done using a single query. Irrespective of the size of the database. Brilliant. This forns the basis of what went into MyDMS 1.5.0.
I presented this result to the users, and they were delighted. I still had to solve for folders, though. A sub-folder search is, after all, very useful. I never really found an answer for this, that I was entirely happy with, partly because I cannot just change the database structure in a way that breaks compatibility. In order to keep the query count down, what I have done is to store the hierarchy as a colin-separated string in the document table. This allows the folder identifier to be part of the query against the document table. It does introduce some overhead when moving folders about, but it is much, much lower than anything we had before.
I’d like to examine the use of a metadata-driven file system in MyDMS instead of the hierarchical file system, but this represents a significant change to the underlying structure and will have to wait until I have time to experiment. Plus, breaking compatibility requires a major version number change and a substantial flak jacket.
The rest of the changes that were made for 1.5.0 are in the changelog. I just expanded on the more interesting bits. Plus, right now, I can’t remember what else I changed and I certainly don’t rememeber why…