Alfresco using Tesseract OCR on Ubuntu Linux

In this post I will describe what to download and install to get Tesseract OCR onto an Ubuntu box, and how to integrate it into Alfresco. The goal of this blog is to have Alfresco and a custom transformer that can transform tiff to pdf, where the PDF also has a text layer.

This blog is a setup for the next one, how to combine Ephesoft and Alfresco on one Linux box. Ephesoft needs Tesseract for its OCR functionality

Alfresco Community Edition installer comes with a JRE these days. If you dont want to use the JRE provided (or need the JDK for Ephesoft as described in “Ephesoft and Alfresco on one Linux box”) you have to install Java yourself. First add a repository where Sun (Oracle?) Java found its home (since Ubuntu appears to be featuring OpenJDK)

sudo /etc/apt/sources.list

Add this line to the end of the file

deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get upgrade

Next, install all these packages:

sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev libungif4-dev imagemagick exactimage pdftk build-essential
Lets get Leptonica from Googlecode. This package provides all kind of ‘helper operations’ that are image recognition related. (Version 1.67 is the latest at time of writing, feel free to get a more recent one…)
cd /opt
sudo wget http://leptonica.googlecode.com/files/leptonlib-1.67.tar.gz
sudo tar -xvf leptonlib-1.67.tar.gz

And configure and compile the package. All dependencies should be met.

cd leptonlib-1.67
/configure
make
sudo make install
Next, get Tesseract, the actual OCR engine (Version 3.00 at time of writing.)
sudo wget http://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz
And configure and compile it
./configure
make
sudo make install
You do need training data, and I need English and Dutch. The directories need to end up in the TESSDATA folder.
sudo wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz
sudo wget http://tesseract-ocr.googlecode.com/files/nld.traineddata.gz
gunzip eng.traineddata.gz
gunzip nld.traineddata.gz

Find out where your tessdata folder is located (I bet in /usr/local/share)

whereis tessdata
mv nld.traineddata /usr/local/share/tessdata/nl.traineddata
mv eng.traineddata /usr/local/share/tessdata/en.traineddata
That’s it, you are set and done. Lets test if it is actually working… find yourself a nice TIF file that you can use to test Tesseract. Then enter on the command line (being in the directory containing the file source.tif):
tesseract source.tif outputname hocr -l en
‘outputname’ is the text based file Tesseract will create for you. If you append ‘hocr’ to the command, it wil create an output file according the hocr format, including the location of the sentences on the page. You need this to allow text selection in a pdf (or to perform better quality entity extraction using Ephesoft later on). If you leave it out. you will get a plain text file. The English (as declared above) is the default, so you can leave that out. I am interested in the Dutch language, so I will need:
-l nl
Now, how to get a pdf file from the tiff and the hocr output…
hocr2pdf -i source.tif -o output.pdf <outputname.html

Tada!

Stuff to do:

  • Validate how to deal with muli-page tiff’s
  • How to decently add a text layer to an existing graphical pdf

Thats one. Now, how to integrate this into Alfresco? First, lets create a command line script that will automatically transform a tiff into a pdf and does some house keeping in cleaning up the mess…

#!/bin/bash
###################################### convert2pdf.sh #####################################
#Number of Parameters
PARAM=$#
#Test parameters
if [[ $PARAM -lt 2 ]];
then
  echo "Usage: $0 out.pdf input*.png"
  echo "out.pdf is the desired output file"
  echo "input*.png or input*.tif is a list of files to be converted"
  exit 1
fi
#Outputfile
OUTFILE=$1shift
#PDF output?
ATEST=$(basename $OUTFILE)
BTEST=$(basename $ATEST .pdf)
if [ $ATEST = $BTEST ]; then
  echo "File $OUTFILE is not a pdf-File."
  exit 2
fi

#Do not overwrite existing output files
#if [ -e $OUTFILE ]; then
#  echo "File $OUTFILE exists - not overwritten."
#  exit 3
#fi
LIST=
#Convert input files
for (( I=1; $I < $PARAM; I++ ))do
  FILE=$1
  shift
  echo "Working on: $FILE"
  #Convert
  # /home/user/tesseract-ocr/api/tesseract $FILE $BTEST nobatch hocr
  tesseract $FILE $BTEST nobatch hocr  /usr/bin/hocr2pdf -i $FILE -o $OUTFILE -s < $BTEST.html
#Eingangsliste der PDFS anlegen
  LIST="$LIST$BTEST.pdf "
done
#Erstelle Gesamt-PDF
#echo "Concatenating PDF as $OUTFILE"
#/usr/bin/pdftk $LIST output $OUTFILE
#Aufrauemen
#echo "Cleaning up"for FILE in $LIST
do
  F=$(basename $FILE .pdf)
  rm -f $F.html
done
echo "Finished."
I have put the file in /opt/ocr/convert2pdf.sh. Remind to make the file executable
chmod ug+x /opt/ocr/convert2pdf.sh
Test your script to make sure it works!
I have downloaded Alfresco Community Edition, and installed in /opt/alfresco. Personally I used the MySQL  from the package manager, not the one shipped with de CE installer:
sudo install mysql-server mysql-client
If you install without the all-in-one installer, you also need:
sudo apt-get install sun-java6-jdk imagemagick swftools  openoffice.org-headless
But the installer will provide all these things for you.
The description in this blog will work for the Community Edition as well as for the Enterprise Edition.
To run the installer. This will indicate the only option is to use the bundled MySQL. This is not true, the second round of questions allows you to reuse the existing MySQL install…
sudo ./alfresco-community-3.4.b-installer-linux-x32.bin

I had on Windows the problem that the installer did not create my database correct. If so, create your database and user using (I love HeidiSQL as a tool):

create database alfresco default character set utf8 collate utf8_bin;
grant all on alfresco.* to 'alfresco'@'localhost' identified by 'alfresco' with grant option;
grant all on alfresco.* to 'alfresco'@'localhost.localdomain' identified by 'alfresco' with grant option;

If you cannot login using admin/admin (I couldn’t, or I did not read the installer screens good enough), find yourself a MD4 hash generator, create a password of choice, and paste it in tomcat/shared/classes/alfresco-global.properties in the attribute value for alfresco_user_store.adminpassword

http://gtools.org/tool/md4-hash-generator/
Validate your system is setup correctly. It starts as a service these days, remind to restart if you modify these property or XML files. You should be able to login using admin/admin:
http://localhost:8080/alfresco

Next increase the amount of memory assigned to Java. Although Alfresco is happy with 512MB of memory in a linux image, you need at least 1GB if you play with this OCR script (I noticed). Navigate to /opt/alfresco/alfresco.sh, I changed this parameter to 1024MB: -Xmx1024m

Create the file /opt/alfresco/tomcat/shared/classes/alfresco/extension/img2ocr-transform-context.xml and use this content:
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
  <bean id="transformer.worker.img2ocrpdf">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean>
        <property name="commandsAndArguments">
          <map>
            <entry key=".*">
              <list>
                <value>ls</value>
                <value>/opt/ocr/convert2pdf.sh</value>
              </list>
            </entry>
          </map>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean>
        <property name="commandsAndArguments">
          <map>
            <entry key=".*">
              <list>
                <value>/opt/ocr/convert2pdf.sh</value>
                <value>${target}</value>
                <value>${source}</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2,3</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>
        <bean>
          <property name="sourceMimetype"><value>image/tiff</value></property>
          <property name="targetMimetype"><value>application/pdf</value></property>
        </bean>
      </list>
    </property>
  </bean>
  <bean id="transformer.img2ocrpdf" parent="baseContentTransformer">
    <property name="worker">
      <ref bean="transformer.worker.img2ocrpdf" />
    </property>
  </bean>
</beans>
Test if your transformer is there, restart Alfresco, get a webbrowser and point to (Valid for Alfresco version 3.4 and above only!):
http://192.168.30.128:8080/alfresco/service/mimetypes
Start Alfresco, upload a tiff file into a space, and invoke “Run action” in the action menu on the right on the details page. Select “Transform and copy content” (NOT image!). Select a space where the PDF should end up, and off you go!

With kudo’s to my colleague Kees van Bemmel for triggering and bits and pieces

[update feb 01 2012: added name of the transformer bean: img2ocr-transform-context.xml]

Advertisements

15 Responses to “Alfresco using Tesseract OCR on Ubuntu Linux”


  1. 1 arnaud December 23, 2010 at 15:06

    Thank you for this very interesting article.

    I followed it up bu I can’t get tesseract working for hocr output -> read_varaibles_files : can’t open

    do you have any clue about that ?
    how did you get it work ?

    I’m very impatient to see the following article about ephesoft

    • 2 tpeelen December 23, 2010 at 16:49

      I hope your impatience is satisfied already 😉
      Can you share a bit more detail about your issue? What os are you on? What did you do?

      • 3 arnaud December 24, 2010 at 10:17

        plenty satisfied about the followings articles

        sorry for the lack of information I wasn’t sure about the fact it was the right place to post some aspect of technical issue.
        but I guess it can be intersting for other people who wants to follown your setup guidelines.

        I’m on ubuntu (vmware) 10.10 server edition

        I followed your steps until :
        tesseract source.tif outputname hocr -l en
        which is not working although
        tesseract source.tif outputname -l en
        is working

        don’t bother too much in helping me, I’ll try to handle it myself

        moreover it seems google is aware about this problem

  2. 4 Christophe January 28, 2011 at 16:00

    I can’t say more than Arnaud !

    Same interest…and same problem with the “hocr” option ;o)

    I tried to create an hocr config file in /usr/local/share/tessdata/configs or /usr/local/share/tessdata/tessconfigs or /usr/share/tesseract/tessdata/tessconfigs/ or /usr/share/tesseract/tessdata/configs/ with the option “tessedit_create_hocr 1″…no luck.

    I also have a problem with “/usr/bin/hocr2pdf -i $FILE -o $OUTFILE -s < $BTEST.html" :
    $BTEST.html: no such file

    I would be happy if you could tell me more about this.

    Regards

    • 5 pwizzle August 4, 2011 at 16:06

      I think hocr2pdf was specifically designed for cuneiform in mind. This is a snippet from their website

      hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr

      As for Tesseract. The default installation of 3.0 doesn't include the config file called hocr (on windows, anyway). You'll need to go to your Tesseract/tessdata/configs/ folder and create a file (no extension) hocr and copy tessedit_create_hocr 1 on line 1 of it's contents.

  3. 6 Richard December 27, 2011 at 23:46

    Hello, I’m trying to just generate the txt file but I always get a blank file, I’m using a simple thing like “tesseract $1 $2”, could you point out what I’m doing wrong? I think I have to copy the contents from one file to another but I’m lost here. Thank you.

  4. 8 gekookteworst January 25, 2012 at 20:12

    /opt/alfresco/tomcat/shared/classes/alfresco/extension is a directory, not a file. In it there are multiple configurations one can use to override specific settings. So question, where do you put the “beans” xml and with what title.

    • 9 Tjarda Peelen February 1, 2012 at 00:34

      Good point, this file name appeared to be missing indeed. Just added it (img2ocr-transform-context.xml) to the text. Thanks!
      (Actually, any name will do, as long as it ends on -context.xml)

  5. 10 acurs April 12, 2012 at 01:48

    Thanks a lot for this post really useful, however for the installation in Ubuntu 10.10 I had to install automake libtool, and couple more things.
    For alf community 3.4.d I had to change the transformer context file to make it work since it didn’t startup due to errors in the definitions of the beans. In the main bean declaration I had to add class=”org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker”> and in the properties bean
    and
    and in the very last bean –>> class=”org.alfresco.repo.content.transform.ProxyContentTransformer”
    parent=”baseContentTransformer”>

    Hope it helps someone…
    Cheers

    • 11 Tjarda Peelen April 15, 2012 at 20:58

      Hhmm, interesting. I am on the Enterprise versions usually. Haven’t seen this one, but thanks for adding the solution in the comments!!!

      Cheers,

      Tjarda

  6. 12 Enrique April 23, 2012 at 12:23

    Hello, I’m testing your *.sh script, and it appears a bunch of errors when I execute it

    basename: missing operand
    Try `basename –help’ for more information.
    ./convert2pdf.sh: line 18: [: =: unary operator expected
    Working on: /home/bitnami/Desktop/test.tif
    ./convert2pdf.sh: line 36: .pdf.html: No such file or directory
    ./convert2pdf.sh: line 45: syntax error near unexpected token `do’
    ./convert2pdf.sh: line 45: `do’

    Is there a solution?
    Thanks for your attention

  7. 13 Saif Eddine Romdhane June 29, 2012 at 00:34

    Thanks a lot it was really useful, i had problems with the script : ./convert2pdf.sh: line 18: [: =: unary operator expected
    Working on: File.TIF
    ./convert2pdf.sh: line 36: TestFile.pdf.html: No such file or directory
    ./convert2pdf.sh: line 45: syntax error near unexpected token `do’
    ./convert2pdf.sh: line 45: `do’
    so i have made some modifications and it work well.

    #!/bin/bash
    ###################################### convert2pdf.sh #####################################
    #Number of Parameters
    PARAM=$#
    #Test parameters
    if [[ $PARAM -lt 2 ]];
    then
    echo “Usage: $0 out.pdf input*.png”
    echo “out.pdf is the desired output file”
    echo “input*.png or input*.tif is a list of files to be converted”
    exit 1
    fi
    #Outputfile
    OUTFILE=$1
    shift
    #PDF output?
    ATEST=$(basename $OUTFILE)
    BTEST=$(basename $ATEST .pdf)
    if [[ ${OUTFILE: -4} != “.pdf” ]]
    then
    echo “FILE $OUTFILE is not a pdf-file.”
    exit 1
    fi

    #Do not overwrite existing output files
    if [ -e $OUTFILE ]; then
    echo “File $OUTFILE exists – not overwritten.”
    exit 3
    fi
    LIST=
    #Convert input files
    for (( I=1; $I < $PARAM; I++ ))do
    FILE=$1
    shift
    echo "Working on: $FILE"
    #Convert
    # /home/user/tesseract-ocr/api/tesseract $FILE $BTEST nobatch hocr
    tesseract $FILE $BTEST$I -l en hocr
    /usr/bin/hocr2pdf -i $FILE -o $I.pdf < $BTEST$I.html
    #Eingangsliste der PDFS anlegen
    LIST="$LIST$I.pdf "
    done

    #Concatination
    echo "Concatenating PDF as $OUTFILE"
    /usr/bin/pdftk $LIST output $OUTFILE

    #cleaning up
    echo "Cleaning up"
    rm *html
    for (( J=1;J<PARAM;J++ ))do
    rm $J.pdf
    done
    echo "Finished."


  1. 1 Open Source scanning with Ephesoft and Alfresco « Open Source ECM/WCM Trackback on December 23, 2010 at 16:40
  2. 2 DRAFT: Ephesoft on Ubuntu Linux « Open Source ECM/WCM Trackback on January 16, 2011 at 10:02
Comments are currently closed.