In this post I will describe what to download and install to get Tesseract OCR onto an Ubuntu box, and how to integrate it into Alfresco. The goal of this blog is to have Alfresco and a custom transformer that can transform tiff to pdf, where the PDF also has a text layer.
This blog is a setup for the next one, how to combine Ephesoft and Alfresco on one Linux box. Ephesoft needs Tesseract for its OCR functionality
Alfresco Community Edition installer comes with a JRE these days. If you dont want to use the JRE provided (or need the JDK for Ephesoft as described in “Ephesoft and Alfresco on one Linux box”) you have to install Java yourself. First add a repository where Sun (Oracle?) Java found its home (since Ubuntu appears to be featuring OpenJDK)
sudo /etc/apt/sources.list
Add this line to the end of the file
deb http://archive.canonical.com/ lucid partner" sudo apt-get update sudo apt-get upgrade
Next, install all these packages:
sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev libungif4-dev imagemagick exactimage pdftk build-essential
cd /opt sudo wget http://leptonica.googlecode.com/files/leptonlib-1.67.tar.gz sudo tar -xvf leptonlib-1.67.tar.gz
And configure and compile the package. All dependencies should be met.
cd leptonlib-1.67 /configure make sudo make install
sudo wget http://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz
./configure make sudo make install
sudo wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz sudo wget http://tesseract-ocr.googlecode.com/files/nld.traineddata.gz gunzip eng.traineddata.gz gunzip nld.traineddata.gz
Find out where your tessdata folder is located (I bet in /usr/local/share)
whereis tessdata mv nld.traineddata /usr/local/share/tessdata/nl.traineddata mv eng.traineddata /usr/local/share/tessdata/en.traineddata
tesseract source.tif outputname hocr -l en
-l nl
hocr2pdf -i source.tif -o output.pdf <outputname.html
Tada!
Stuff to do:
- Validate how to deal with muli-page tiff’s
- How to decently add a text layer to an existing graphical pdf
Thats one. Now, how to integrate this into Alfresco? First, lets create a command line script that will automatically transform a tiff into a pdf and does some house keeping in cleaning up the mess…
#!/bin/bash ###################################### convert2pdf.sh ##################################### #Number of Parameters PARAM=$# #Test parameters if [[ $PARAM -lt 2 ]]; then echo "Usage: $0 out.pdf input*.png" echo "out.pdf is the desired output file" echo "input*.png or input*.tif is a list of files to be converted" exit 1 fi #Outputfile OUTFILE=$1shift #PDF output? ATEST=$(basename $OUTFILE) BTEST=$(basename $ATEST .pdf) if [ $ATEST = $BTEST ]; then echo "File $OUTFILE is not a pdf-File." exit 2 fi #Do not overwrite existing output files #if [ -e $OUTFILE ]; then # echo "File $OUTFILE exists - not overwritten." # exit 3 #fi LIST= #Convert input files for (( I=1; $I < $PARAM; I++ ))do FILE=$1 shift echo "Working on: $FILE" #Convert # /home/user/tesseract-ocr/api/tesseract $FILE $BTEST nobatch hocr tesseract $FILE $BTEST nobatch hocr /usr/bin/hocr2pdf -i $FILE -o $OUTFILE -s < $BTEST.html #Eingangsliste der PDFS anlegen LIST="$LIST$BTEST.pdf " done #Erstelle Gesamt-PDF #echo "Concatenating PDF as $OUTFILE" #/usr/bin/pdftk $LIST output $OUTFILE #Aufrauemen #echo "Cleaning up"for FILE in $LIST do F=$(basename $FILE .pdf) rm -f $F.html done echo "Finished."
chmod ug+x /opt/ocr/convert2pdf.sh
sudo install mysql-server mysql-client
sudo apt-get install sun-java6-jdk imagemagick swftools openoffice.org-headless
sudo ./alfresco-community-3.4.b-installer-linux-x32.bin
I had on Windows the problem that the installer did not create my database correct. If so, create your database and user using (I love HeidiSQL as a tool):
create database alfresco default character set utf8 collate utf8_bin; grant all on alfresco.* to 'alfresco'@'localhost' identified by 'alfresco' with grant option; grant all on alfresco.* to 'alfresco'@'localhost.localdomain' identified by 'alfresco' with grant option;
If you cannot login using admin/admin (I couldn’t, or I did not read the installer screens good enough), find yourself a MD4 hash generator, create a password of choice, and paste it in tomcat/shared/classes/alfresco-global.properties in the attribute value for alfresco_user_store.adminpassword
http://gtools.org/tool/md4-hash-generator/
http://localhost:8080/alfresco
Next increase the amount of memory assigned to Java. Although Alfresco is happy with 512MB of memory in a linux image, you need at least 1GB if you play with this OCR script (I noticed). Navigate to /opt/alfresco/alfresco.sh, I changed this parameter to 1024MB: -Xmx1024m
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
<bean id="transformer.worker.img2ocrpdf">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean>
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>ls</value>
<value>/opt/ocr/convert2pdf.sh</value>
</list>
</entry>
</map>
</property>
</bean>
</property>
<property name="transformCommand">
<bean>
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>/opt/ocr/convert2pdf.sh</value>
<value>${target}</value>
<value>${source}</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2,3</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean>
<property name="sourceMimetype"><value>image/tiff</value></property>
<property name="targetMimetype"><value>application/pdf</value></property>
</bean>
</list>
</property>
</bean>
<bean id="transformer.img2ocrpdf" parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.worker.img2ocrpdf" />
</property>
</bean>
</beans>
http://192.168.30.128:8080/alfresco/service/mimetypes
With kudo’s to my colleague Kees van Bemmel for triggering and bits and pieces
[update feb 01 2012: added name of the transformer bean: img2ocr-transform-context.xml]

Thank you for this very interesting article.
I followed it up bu I can’t get tesseract working for hocr output -> read_varaibles_files : can’t open
do you have any clue about that ?
how did you get it work ?
I’m very impatient to see the following article about ephesoft
I hope your impatience is satisfied already
Can you share a bit more detail about your issue? What os are you on? What did you do?
plenty satisfied about the followings articles
sorry for the lack of information I wasn’t sure about the fact it was the right place to post some aspect of technical issue.
but I guess it can be intersting for other people who wants to follown your setup guidelines.
I’m on ubuntu (vmware) 10.10 server edition
I followed your steps until :
tesseract source.tif outputname hocr -l en
which is not working although
tesseract source.tif outputname -l en
is working
don’t bother too much in helping me, I’ll try to handle it myself
moreover it seems google is aware about this problem
I can’t say more than Arnaud !
Same interest…and same problem with the “hocr” option ;o)
I tried to create an hocr config file in /usr/local/share/tessdata/configs or /usr/local/share/tessdata/tessconfigs or /usr/share/tesseract/tessdata/tessconfigs/ or /usr/share/tesseract/tessdata/configs/ with the option “tessedit_create_hocr 1″…no luck.
I also have a problem with “/usr/bin/hocr2pdf -i $FILE -o $OUTFILE -s < $BTEST.html" :
$BTEST.html: no such file
I would be happy if you could tell me more about this.
Regards
I think hocr2pdf was specifically designed for cuneiform in mind. This is a snippet from their website
hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr
As for Tesseract. The default installation of 3.0 doesn't include the config file called hocr (on windows, anyway). You'll need to go to your Tesseract/tessdata/configs/ folder and create a file (no extension) hocr and copy tessedit_create_hocr 1 on line 1 of it's contents.
Hello, I’m trying to just generate the txt file but I always get a blank file, I’m using a simple thing like “tesseract $1 $2″, could you point out what I’m doing wrong? I think I have to copy the contents from one file to another but I’m lost here. Thank you.
Sorry, this is located on an old VM. Since I am quite busy these days I have not had the time to get back to you…
/opt/alfresco/tomcat/shared/classes/alfresco/extension is a directory, not a file. In it there are multiple configurations one can use to override specific settings. So question, where do you put the “beans” xml and with what title.
Good point, this file name appeared to be missing indeed. Just added it (img2ocr-transform-context.xml) to the text. Thanks!
(Actually, any name will do, as long as it ends on -context.xml)
Thanks a lot for this post really useful, however for the installation in Ubuntu 10.10 I had to install automake libtool, and couple more things.
For alf community 3.4.d I had to change the transformer context file to make it work since it didn’t startup due to errors in the definitions of the beans. In the main bean declaration I had to add class=”org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker”> and in the properties bean
and
and in the very last bean –>> class=”org.alfresco.repo.content.transform.ProxyContentTransformer”
parent=”baseContentTransformer”>
Hope it helps someone…
Cheers
Hhmm, interesting. I am on the Enterprise versions usually. Haven’t seen this one, but thanks for adding the solution in the comments!!!
Cheers,
Tjarda
Hello, I’m testing your *.sh script, and it appears a bunch of errors when I execute it
basename: missing operand
Try `basename –help’ for more information.
./convert2pdf.sh: line 18: [: =: unary operator expected
Working on: /home/bitnami/Desktop/test.tif
./convert2pdf.sh: line 36: .pdf.html: No such file or directory
./convert2pdf.sh: line 45: syntax error near unexpected token `do’
./convert2pdf.sh: line 45: `do’
Is there a solution?
Thanks for your attention