The main libraries and document archives are digitizing their collections. Most of them are scanning the documents and publishing the resulting images without their corresponding transcriptions. This limits seriously the document exploitation possibilities. The problem is even more serious when taking into account ancient documents, both printed (especially from the 15ht to 17th centuries) and handwritten. The commercial OCR systems offer poor performance on these kinds of documents. When the transcription is absolutely necessary, the transcription is performed manually by experts, which is a very expensive and error-prone task. Having appropriate tools, such a specialized OCR systems and handwritten recognition engines, would be very helpful to preserve, study and publish the cultural legacy.
We intend to improve the existing printed and handwritten text recognition systems with engines that allow specialized recognition for ancient document while enabling adaptation to particularities of each book or period and learning new glyphs, new lexicons, new language models… Even with these expected improvements, the results of an all automatic recognition will not be perfect. Obtaining transcriptions to the level of required quality demands the intervention of human experts to review and correct the resulting text. It is extremely useful to provide interactive tools to obtain and edit the transcription. The expert user may introduce text by writing it directly on the screen where the error is seen with the help of an stylus sensitive device. The on-line handwritten text and gestures input can feed an interactive loop between the handwritten system and be exploited in a multimodal, combined recognition system.
The HITITA project seeks to reduce the problem on two fronts: through the development of uni and bi-modal recognition engines that can be adapted to the peculiarities of these documents (both printed and handwritten texts), and by extending an existing tool with a multimodal interface that reduces the effort necessary to correct transcription errors. The project involves the construction of a working system that extends one already available. In order to achieve these goals, we rely on close collaboration from historians and document archive institutions. Since the current best recognition systems for on-line handwritten characters and continuous off-line handwritten text have been developed by our group, we expect to maintain leadership in this field and to ease technology transference actions.