Open Source scanning with Ephesoft and Alfresco

A document management solution is good in managing ‘content’, control access, process the flow of content, perform transformations, give overview and control. But how does the content enter the system? One stream into the system can be fully digital; integrations with other IT systems, from email, from the office environment. However, there is a world of paper to manage as well. And how does the paper end up in a DMS? Scanning.

Wouldn’t it be great to have a full blown open source stack from scanning, through validation and indexing, pushed into the DMS and managed until it can be destroyed? Now you can, and Ephesoft is the entrance!

Ephesoft is an Open Source tool that does just that, and more. Ephesoft is founded by former Kofax employees and has quite some domain knowledge on board. They have the capacity to quickly build this product and make it do what it should do. This approach has some similarity with Alfresco. If you can build a product from the ground up, do it the best way one can imagine giving current technology and methodology (open source). Learn from the mistakes and lessons learned from the past, and improve.

Back to the product. Your scanner feeds batches of TIFF files into Ephesoft. In this (web-based) tool people can validate and index the scans. The system is able to recognise types of documents, and from there extract entities like customer numbers, name and address, or other details. This extracted metadata can be used as metadata in the DMS later on. After processing, and validation/correction the documents can be pushed into the DMS using the CMIS standard. This standard enables the system to push the content (TIFF or PDF) together with the metadata to any CMIS compliant DMS, for me that is Alfresco of course.

Having all components Open Source, and having all components web-based gives a lot of freedom to implement according your organizations  architecture. Since the mailroom automation tool (Ephesoft) is web-based, one can easily scan on one location, but perform indexing by people scattered around the world. The people responsible for validation/indexng do not have to install additional software, a web browser is all that is needed!

Ephesoft functionally

Ephesoft is the tool for ‘intelligent document capture’. It has a ‘modular’ flow:

Document Ingestion
Ephesoft receives the content from some source. Obviously these can be batches of (tiff) images originating from a scanner. Any folder that contains a set of tiff images is considered a batch. Alternatively other sources of content can deliver some input as well, think of  email, a DMS, or any other system that can provide content. In the current release the system processes tiff only, but the upcoming 1.9 release is designed to process (index, classify and convert to pdf) MS-Office (Word, Excel) and PDF as well. This has the important benefit that no matter where content originates from, there is one route to classify content, extract entities and produce full text searchable pdf output.

OCR & Barcode
Next step in the process is to get the actual text from the image or document. Considering we are working against an image, this means OCR (Optical Character Recognition). Ephesoft is build using a modular approach, and there is a choice of OCR engines that can do the job. The community version is shipped pre-configured with Tesseract, the Enterprise version is shipped with Recostar Professional, a closed source OCR tool. Alternatively one can use ORCopus as well, it this will fulfill the business needs.

Documents that originate from a controlled environment (e.g. sent out by the organisation itself) can already be printed with some sort of bar code. The Ephesoft system is shipped with tooling to recognize  Code39 bar codes (the ‘regular’ 1 dimensional bar codes used in products from the grocery store), QR codes (2 dimensional codes) , and datamatrix codes (also 2 dimensional). The one dimensional code can typically contain some sort of document/customer number, and the 2 dimensional versions can contain a lot. (QR codes are also used to contain links to internet pages)

Classification (and Separation)
The Ephesoft software is able to learn what a document start-page is, and what how the document’s last page looks like. Using this information the system is able to import a batch of single page scans, and determine when a new document starts, in other words what single pages make up a document, and will be treated like a single document from that point in time. The system can deal with all kind of different document types having different start and end pages that will be recognized by layout and occurrence of any other clues like keywords or bar codes and alike.

Benefits are that the mailroom does not have to bother about separator pages and  alike. A batch of paper originals can be put into the scanner and off you go.

One of the cool things about Ephesoft is that it is able to recognize entities inside a document. Things like names, zip codes, document numbers, you name it can be recognized and stored in metadata fields. The additional value of having these types of information in metadata fields is enormous. Any DMS-like system can use these fields to store the resulting document in a vault of choice, ordered by customer number, business processes, however your DMS is ordered, and whatever reporting you need to execute against it.

Ephesoft community is shipped with two end points for the finished PDF’s. These PDF’s contain a text layer, therefore can be indexed by any search engine (often included in a DMS) to allow for full text searching. One of these end points is a folder on some network, or for example a CIFS folder in Alfresco. Disadvantage is that storing a file, means losing the metadata extracted. There are of course business uses where one doesn’t care about these additional metadata fields. However, I often do.

The alternative is to use CMIS as a mechanism to store the PDF. CMIS is an open standard maintained by the OASIS consortium, meaning Content Management Interoperability Standard, which I decribed earlier. Using CMIS, applications can interchange documents and related metadata. Ephesoft can use CMIS to export the PDF and the extracted metadata to any CMIS-compliant repository. (See more on “Ephesoft and Alfresco integration using CMIS”).

User interface for managing Batches
Ephesoft comes with two web based interfaces. One UI is meant for people guarding the quality of the indexing process. The Ephesoft engine tries hard to recognize the entities as defined. However, it will never be 100%. This interface allows people to have overview of the scanned batches, in terms of priority and status. Per batch one can inspect all documents, and correct the metadata fields that Ephesoft was unable to determine. In this interface one can rotate pages, move pages from one document to another,  etc. This interface has nice keyboard shortcuts to increase effectivity. (For all details, see the User Manual)

User inferface for configuring the system (e.g. admin interface)
This system needs to be administered of course. The Admin interface allows to configure how batches behave. Modules can be enabled and configured per batch class. One batch class can have a CMIS endpoint into system A, another to system B, and the third to some file system folder in the organizations network.  (For all details, see the Admin Manual).

Here the admin can configure the current Batch Class. Each of the modules can be selected and configured.

Downside in the current release is that user management is delegated to the application server. From a business point of view this can be a good thing. From a demo point of view one is directed to the user.xml file of Tomcat.

Ephesoft technically
Ephesoft is a Java based tool that runs as a web application on a server. This means that the end users only need a web browser! The server uses JBPM as the basis flow of actions. Each module can be configured, and easily switched on or off, or even replaced by a different module providing the same functionality.

The server is build for clustering (reducing a single point of failure, high availablility) and usage in the cloud. This allows various robust and up-to-date architectures. Why run on-premise if it is sufficient ‘in the cloud’? Parameters like scan volume, available bandwidth and organizational distribution are all to consider to draw your ideal architecture. The tool can adapt to many of the possibilities, it will not sit in your way.

The Ephesoft application is a Java application and runs in an Application Server. Community is shipped inside Tomcat, but specific versions of JBoss, Weblogic and Websphere are supported as well. The application uses ImageMagick for all graphical transformations (mainly scalling), which it assumes to exist on the server. The OCR tool is a module in Ephesoft that can be replaced by any other. Enterprise comes with Recostar Processional, Community comes with Tesseract by default. Especially on Linux I see it easily be replaced by ORCopus as well. hOCR is the tool that can join a multi page Tiff with the hORC files (these files contain the actual text on the tiff, and the location of this text on the page). This tool is available for Linux as well as Windows. The good news of the current Windows-only Community distribution is that it is fulle pre-configured.  The database is currently MySQL, but in the near future one can expect MSSQL, Oracle, Postgress and DB2 as well. one can find all the details online of course.

The business model
Ephesoft has a similar subscription model as Alfresco. The Community Edition is shipped with the GPL v2 license, thus free to use, but with no support nor guarantees what so ever. The Enterprise Edition is better QA tested, and comes with a commercial OCR solution. The subscription is CPU based, not by number of scans nor users. One can get various levels of support, from logging calls to 24×7 and more. See their website for more details.

The proof
I have been playing a lot with Ephesoft to make it technically work. I like to learn the boundaries of a system by breaking them, and I did. Up front I had a few goals:

  1. Make Ephesoft store the end result PDF’s and related metadata into Alfresco using CMIS. This worked, although not to the level of detail I hoped for. See “Configuring Ephesoft and Alfresco using CMIS integration” for the details. The conclusion is that it works for the simple types of attributes. In the current community version ( CMIS transport fails for data types like Long and Double (CMIS only knows Integer and Decimal). To be worked out later.
  2. Have Ephesoft and Alfresco share a single Tomcat install. This scenario is not real-life at all, in a production environment these will be split ‘by design’. however, it does give an insight in how Ephesoft is configured, what it needs and how to tweak it. It is also nice for demo purpose, better have 1 VM running than 2) I tried this on the Windows platform, and the conclusion is: “Almost there”. I was able to put them both into one Tomcat, and eventually the final step fails (merging the tiff and the extracted text layer into the (multi page) PDF.
    [update 03 jan 2011] This blog is discontinued  due to time restrictions and priorities-in-life in general.
  3. Being Open Source adept, a server runs on Linux of course. ‘Unfortunately’, the current customer base of Ephesoft runs Windows, and they have not had requests for Linux support (until the Aflresco DevCon they were presenting in New York, november 2010) . Since it is a Java Application running in Tomcat, this could not be a big problem. In order to do so I first made Tesseract OCR work on Linux (See “Alfresco using Tesseract OCR on Ubuntu Linux“), since Ephesoft Community comes with a pre-compiled Tesseract for Windows only. My next step would be to make the web application work on Linux (see “DRAFT: Install Ephesoft on Ubuntu Linux” for the details – remind: ‘draft’). The latter did not succeed, because it appeared there are some hard coded windows path’s in the Java logic, and I did not have the time to go to SVN and fix it myself (this is an after-work-hours project after all).

The near future
[updated 03 jan 2011] Last week version 1.9.x was released (29th of december 2010). It contains among others:

  • email import with conversion of attachments to tiff. it will convert doc, xls, pdf, etc…
  • Form processing, zone OCR/ICR/OMRPDF/
  • Multi page tiff import
  • Table extraction
  • much more…

I am really looking forward to this! Next to that user groups will be introduced so batches can be assigned to particular groups (instead of every member being able to see/process all batches as is the only possibility right now) in the near future.

My conclusion
First of all I love the product, and I love the company for their prompt responses, and the time they take to answer my questions and help me out with errors I ran into (often meaning mis-configurations of mine).

From an architecture point of view I like the fact that it is a web-application with just a server component, and a web browser as clients. The architectural possibilities are great, I can see this working.

I really love the concept of ‘document inception’. Of course it doesn’t matter where documents originate from, but they should be equally classified and entity-extracted. The source can be a scan, email or a DMS-like application. The ‘pipe’ to get the metadata is the same. And Ephesoft can provide it.

Downside is that the product is relatively new. This means there is not an alive and kicking community yet. I noticed the issue mgmt solution is under construction, but then again, the software need to be developed simultaneously. The same is true for the documentation, the user guide and admin guide are there, but the wiki can stand some active members. The website communicates clearly what is in the next release and when to expect it. Up till now they quite delivered according plan.

Point of attention can be the user management. From one point of view, not every application should have its own user management. On the other hand, if there is no decent application server, it would be nice if there is a bare minimum of user management, that can be configured ‘in use’, or delegated to some Application Server/Ldap/AD-alike service.

I was a bit dissapointed by the CMIS integration. Mapping my PDF output onto the default cmis:document worked well, but the fun starts only when all my captured, valuable metadata comes to my DMS as well. It appears some details are a bit buggy, but bottom line is that using current Community Edition, it is impossible to get the ‘demo documents’ with all their metadata into Alfresco.

I like the tool, I am surprised by the breath of functionality, and have a warm feeling about the company. Contacts are great, and the staff very helpfull.

Stuff I have to do

  • I still have to figure out how the actual quality of the OCR and the entity extraction is, compared to the closed source competition.
  • I need to validate if system calls are hard coded indeed. Possibly fix this to properties that are configurable. If this is fixed, I can continue my Linux experiment.

[05 jan 2011: removed recommendations]
[16 jan 2011: added link to draft blog Ephesoft on Ubuntu linux]

5 Responses to “Open Source scanning with Ephesoft and Alfresco”

  1. 1 RPM January 19, 2011 at 06:43

    Thanks for documenting your journey. I really appreciate your sharing what you have discovered and the tips for using ephesoft, I had never heard of it before.

  2. 2 Don Field March 14, 2011 at 18:18

    Very insightful comments. Thanks for sharing.

  3. 3 Thomas Brandner October 3, 2011 at 13:24

    Never heard before – tnx!

  4. 4 Will April 23, 2013 at 00:30

    Successful test when using Ephesoft Enterprise Edition exporting to Alfresco. There is also a CMIS config for using or testing Ephesoft’s Cloud option and exporting to Alfresco.. Welcome to Ephesoft 2013!

Comments are currently closed.