(How) Is your Alfresco repository being used?

In a recent post I just discovered this “Alfresco generate reports with JasperReports” project. It seemed to fit my needs for reporting against Alfresco and Share, and I was very happy and exited. However… It appeared to be too complex to use.

I try to be as little indispensible as possible. If someone from ‘the business’ is in need of reporting, I try to give access to a tool that suits his/her needs. If they need to create or modify a report, they should not be dependent of me, or anyone else capable of doing Alfresco Magic. Being build op Open Standards and proven technologies does not mean everybody is capable of doing everything with it

My view on reporting is:

  1. ‘The business’ should be able to create/maintain reports.
  2. The reporting environment should support Aspects. If one adds a type with some additional properties, the reporting environment should be able to deal with it, without manual reconfiguring.

I was charmed by the basic approach of the “Alfresco generate reports with JasperReports” project; the scripted access to the Java logic to push the data into the reporting, and the use of JavaScript controlled JasperReports/iReport to define the report definition (can be used in Rules, Scheduled jobs, Actions, Workflow…). That I would keep.

I basically investigated 3 approaches:

Go XML. This is the approach of the “Alfresco generate reports with JasperReports” project. It is nice, but has same disadvantages. First of all the way to configure the data feed. XPath and XQuery is not something I trust the business with. You need more knowledge to construct a query than iReport will bring. If the business should be able to construct/maintain reports, it is a no-go. Next to that, the xml feed needs to be cached somewhere, real time generation is a pain when there is a lot of content.

Go NOSQL. Considering the Aspect-nature of Alfresco, a flexible approach would be one that accepts all kind of metadata fields without reconfiguring the reporting environment. NOSQL should be able to cater for that. MongoDB would be a nice candidate, kind of mature, JSON to interact with the database and multi platform availability. Downside is that the query language is not ‘really common’. Considering my customers, I would be the query wizzard again. This seems technically the most fancy solution, but it does not solve my problem; the business will not be able to operate the reporting by themselves (although iReport would ‘support’ it).

Go SQL. The big plus would be that some SQL knowledge is out there. There would be a lot more resources other than me to create and maintain reports. there also is enough tooling of choice to define and generate these reports. Downside is that a SQL database has a rigid set of columsn and column definitions. Manual extending a set of table definitions if someone happened to extend some aspects in the repository would be ‘unacceptable’. If this could be auto-configured, I have a nice solution…

But how do I get the Aspect flexibility in the reporting tables? Can I use Alfresco’s own repository? In my Filenet days I created a view over the database to allow (real time) reporting against the production database. Feasible if there is hardly any load (on reporting nor document repository). For my intended use a separate DB would be required. I started creating a view on top of the DB, but stopped developing it. I do not like to be depending on Alfresco’s changes in the DB, they own the database schema, not me. Even ignoring any performance penalties.

My current implementation is based on a separate reporting database, JavaScript controlling a Freemarker search and some Java code to fill the reporting database, and a JavaScript method that creates any report against the external database.

Filling the reporting database
I fire a Freemarker Lucene query against the database. This is controlled by JavaScript, that targets at a specific TYPE query (excluding some derived types), and optionally includes a limitation in creation date (e.g. only look for stuff created since last query has run).  From the result set it creates a text based list of token separated key/values, and a super set of the list of token-separated  property name + type’s, given the TYPE configured. The bulk of the properties are just iterated over the array of properties of an object. Some properties are specifically though (like Site name, and the separate year/month of creation/modification of an object. I think these make reporting more easy). Currently, a range of reporting tables is provided, like Site, Content, Folder, Link, DatalistItem, DataList, Forum, Topic, Post.

The solution ran nice, but when tested against a decent-size repository I ran into the problem that a Lucene query never returns more than 1000 objects. A particular Alfresco instance out there can be configured to a higher number, but it is limited anyway. I solved this by doing a search, sort on sys:node-dbid, and search again with the restriction that the sys:node-dbid should be greater that the last one I spotted in the previous result set. Until there is no result anymore, then I got them all. This can of course be extended with the limitation to find objects only with a created/modified timestamp bigger than the last successful search query.

The result is pushed into a Java class that opens a JDBC connection to a configurable database of choice. The super set of result parameters and their type are matched against the description of the existing table. If a property is not existing in the database, I alter the table and add a column.  If all columns are validated, I add the batch of results into the reporting database.

Generating reports against the database
Generating a report is a JavaScript method (living on a new root object). The parameters are the noderef of the JasperReport report definition, the noderef of the resulting report-document, and the type of output (pdf, doc, xls, html for now). It is up to the JavaScript how the resulting document is processed; will it be a brand new object? Will it be a new content element in an existing (previously created) report? Or will it be a new version of an existing document? All combinations are possible and scriptable.  The same for the report definition. Do you execute all reports in a given Space? Are output names in any way connected to names of report definitions? Maybe your reports have an Aspect carrying the last execution date of the report (can be used to limit your result set to documents created/modified after last query only). All is possible in your own scripts.

So far, so good. The reporting database is filled, and I can generate a report against this database, both JavaScript triggered. There still is some work to do. The mechanism to fill the reporting DB is okay, but the implementation needs some rework. The generation of the reports against the DB works nice, but is currently applicable only if the report does not contain sub-reports. (The fix already is work in progress.)  If it works out well I am considering adding my code to the“Alfresco generate reports with JasperReports” project. But not before I fixed the stuff noted below. If you got any enhancements or remarks, please post those here!

The possibilities
With this tool anyone with some imagination and SQL skills can get information from their Alfresco repository.  Some queries I would be interested in:

  • What sites are heavily used?
  • What sites are used by many users? What sites are not? Are there sites without content (–> pollution) or with just one author?
  • Are there Sites where content actually is modified? (Are there back doors in use?)
  • Who are my champion users, who hardly contribute? (–> who to compliment, who to encourage?)
  • What Sites have not had any activity for a period of time? (–> should action be taken?)
  • What is the main use of the repository/Site? Documents, Wiki, Datalists etc? (–> and target training to a set of users)
  • Measure a particular parameter before and after any (organizational) change, and have an indication of success.
  • ….

Enhancements before any release

  • Allow sub-reports in JasperReport report generation.
  • Re-do & refactor some parts of the Java Implementation.
  • Sanitizing the JavaScripts, refactor into correct names and sharing lib’s
  • Externalize namespace translation into property file or alfresco-global.properties to also deal with custom name spaces without Java coding.
  • Create some sample reports
  • Create some sample scripts for frequently filling the reporting DB
  • Create some sample scripts for generating reports (create PDF/DOC/XSL/HTML as new object (unique name with timestamp, as new version, or just replacing the content element of the object.
  • Include report generation using Pentaho.

1 Response to “(How) Is your Alfresco repository being used?”

  1. 1 Bhavana July 13, 2011 at 05:42

    i am also working on alfresco reports generation ……. i checked your blog and it say this works fine with alfresco 3.3g where as i am working on alfesco 3.4.1 are there any changes to be made to make the reports work ..

    thanks in advance

Comments are currently closed.