UbiEst
© François PESSAUX 1999 - 2007
francois.pessaux@inria.fr , francois.pessaux@free.fr

Table of contents

Forewords and concepts
Licence
Usage
UbiEst query algebra
Caveat
Distribution content
Contact me



Forewords and concepts

UbiEst is intended to be a indexer, allowing structural and compound searches into a bunch of documents. What is an indexer ? An indexer is a program recording the occurrences of words among documents and enable to point on them (quickly if possible).

Structural

By structural indexing, we mean that UbiEst not only maps words onto their locations, but also structures of the indexed documents. Hence, it is possible to locate :
For this purpose, documents are mapped onto a generic structure, according to their format. For example, you may easily think about an HTML document, where <TITLE> is mapped onto "title", <H1> onto a first-level-section. In the same way, LaTeX documents will respectively map \title and \section onto this previous structures.

Of course, for some document types, such a structure is more fuzzy, and sometimes does not even exists. For example, a plain text-file will hardly maps onto a real stucture. In this case, structural information will just be void, and you will send request with no structural specifiers.

Request specifiers related to structural annotations are summarised in the query algebra section.

Compound

By compound indexing, we mean that UbiEst does not allow to request to find occurrences of simple words among the indexed document. requests can be combined by the mean of operators. By combining requests, yo not only get simple occurrences of words, but rather extends (i.e. piece of text) containing information fitting you request. From now, results provided by UbiEst will be refered as extends, not as words. Available operators are of three kinds:
  1. Boolean operators,
  2. Sequence operator,
  3. Inclusion operators.
With the formers, you may ask UbiEst to report hits of data containing some extends AND / OR some other extends.

The second type of operator allows to specify an order on the extends. It enables to search for extends followed by some extends. It's pretty like performing a OR, but the order is significant. For example, looking for "Foo" followed by "Bar" won't give the same result than looking for "Bar" followed by "Foo". As a consequence of the semantics of followed-by, results are included in those than a or-request would provide.

Last are the containment operator allowing take care of nesting information. They allow to look for extends containing / contained in / not containing / not contained in other extends.

As previously saw, an extend is a kind of interval od data (mostly text for human reading :).  UbiEst will always to attemp to report the smallest extends matching your request. Of course, in a text like:

       man formats and displays the on-line manual pages.  If you specify sec-
       tion,  man  only looks in that section of the manual.  name is normally
       the name of the manual page, which is typically the name of a  command,
       function,  or  file.   However,  if  name contains a slash (/) then man
       interprets it as a file specification, so that you can do  man  ./foo.5
       or even man /cd/foo/bar.1.gz.

the most trivial extend matching "which" followed by "typically" would be the whole text ! But this is not interesting because it does not reduce the amount of data to read and does not pinpoint accurately the location of your expected result. Indeed the most accurate result would be "which is typically" in the middle of the text. This behaviour sometimes lead to fine but complex results, especially when requests are complex. This is  mostly because by just quickly reading the text, humans often miss some occurences of words, especially when the same words are repeated several times.

Operator with their syntax are summarised in the query algebra section.

Stoplist

Indexing files allows keeping trace of words. Right ! The more words and the more occurences of each word, the bigger the index database. Moreover, some words are not relevant in files because their appear too often and don't allow do discriminate efficiently on documents. UbiEst allows the user to take care of these two points by accepting stoplist files. A stoplist is simply a list of words to ignore during indexing. In a standard such a list, you may find the most common words like (in English) "a, the, he, but, since, who" ... or (In French) "un, le, la, les, il, elle, est, a, à"... Of course, the user is free to provide the list of words he wants.

Stoplist files can be accumulated : it is possible to provide several files. Words of all files will then be ignored.

Licence

UbiEst is free software, here is the copy of the LICENCE file of the distribution.

This software as a whole, binary parts of it or source code parts of it can be freely used in free or commercial products, as long as an explicit mention of this use is made (name of this software and name of its author) in both:
Thank you in advance for respecting this whish.

The kernel of this indexer is an extension and an implementation of the framework initially described by Charles L. A. Clarke in his Phd. Thesis. His work was done in collaboration with G. V. Cormack and F. J. Burkowski.

UbiEst was written in Objective Caml, using Glade to design the user interface and MlGlade to translate the GUI sources into Ocaml sources. Icons are either home-made, or build from parts of icons available in the standard Linux RedHat distribution (thanks to Gimp to render them).

The author.

Usage

UbiEst can be used in two ways. Either by a set of commands in a shell, or via a graphic user interface.

Command line interface

This package contains 4 command-line executables:

createindex

Usage
This command allows to create an index database from a set of files. The builder walk through all the files, recording information suitable to allow requests to be performed later. Hence, this is a batch process, leading to a database that can be used later. Note that currently, database creation is not incremental. This means that you cannot extend an existing database. Indexed files are not stored inside the database. It records only information to locate extends in the original files. Hence, if files are modified after indexing, then extend results of request may be wrong, since they will refer to the state of the files when they were indexed. Same way, if files are deleted after an index database was created, then the requests can still be performed, but if you try to access the file, it will fail  because this file doesn't exist phycically anymore. Note that options described below can appear several times.
Options
     createindex [-s | -r | -R | -g | -F | -f | -o <file>] <files>
-i ".*\.cm.?" -i ".*\.cmx.?" -i ".*\.\(o\|a\|bmp\|xpm\|xbm\|gif\|jpg\|jpeg\|rpm\|zip\|tar\|gz\|tgz\|lha|\lzh|\arj|\tiff\)" -i "s\..*"
will ignore compiled Ocaml files, standard image files, common archive files, common compressed files and rpm files.

dumpindex

usage
options

monkeytest

usage
options

queryindex

usage
options

Graphic user interface

To run UbiEst with the interface, just run the command:

The main window

[index-file] between braces means you can directly specify on the command line the name of the file containing the index database you want to use. If none specified, then you won't be able to send requests until you have loaded an index database file with the "Project" menu (see Menus section). You will get into the main window, looking like:

Main window screenshot

Before addressign the menus, let's examine the different areas of the display.
At the top is the "queries entry zones". Here you may enter queries according to the syntax describded in query algebra section.
Then, clicking on the "Submit" button bellow will actually send the query to the engine.
The result will be displayed in the middle "found occurrence(s) list". In this list, lines will show the different occurrences of extends matching you request among the files contained in your index. This list summarize the file location,  the starting and ending positions of each occurrence of extend matching your request.
Clicking on one of these lines will first display a minimal context un the "context display". The context is the extend matching your request. Then, selection an occurrence line in the "found occurrence(s) list" also enable you to act with the toolbar buttons.
These buttons are used for the following actions:
Last, the status bar in the bottom part of the window will keep you informed about the current state of the program with various messages.

Menus

Let's now explore the 2 menus located in the main window's upper part.
First menu, "Project", (left side) contains 4 entries:
Second and last menu, "?", (right side) contains 2 entries:

Index database creation window

Index creation window screenshot

Preferences window

UbiEst query algebra

Basic request

A basic request, written R in the following, consists in a simple sequence of letters to search for (i.e. a word). For instance :

Logical combinaisons

Meaning
Syntax
R1 and R1
R1 & R2
R1 or R2
R1 | R2
R1 followed by R2
R1 <> R2
R1 contained in R2
R1 < R2
R1 containing R2
R1 > R2
R1 not contained in R2
R1 ~< R2
R1 not containing R2 
R1 ~> R2

Grouping and associating


Structural annotations

Meaning
Syntax
R in a comment
@comment (R)
R in a title
@title (R)
R in a first level section
@section (R)
R in a second level section
@ssection (R)
R in a third level section 
@sssection (R)
R in a table
@table (R)
R emphasized (bold, italic, underline, ...)
@relief (R)
R in an enumeration
@enum (R)

Caveat

MsDos files and newline characters

Some files (MsDos files for example) encode newlines with 2 characters (CR + LF).  In this case, Emacs only count 1 character for these 2 ones. Hence in this case, consulting the results via Emacs by jumping to the character offsets give the feeling there is something wrong. In fact this is only because of the way Emacs count these newlines.

Emacs and characters numbers

Emacs numbers characters in a file starting from 1 to file-length, and not from 0 to file-lenght - 1. Opposite, UbiEst provides results (i.e. extends) numbering characters from 0 to file-length - 1. So take care of this shift.

Distribution content

In the current state, the distribution is organised as follows:
ubiest
|-- Common                    : Contains common source files
|   |-- Makefile
|   |-- filemap.ml
|   |-- filemap.mli
|   |-- gclist.ml
|   |-- gclist.mli
|   |-- index.ml
|   |-- index.mli
|   |-- infinity.ml
|   |-- infinity.mli
|   |-- markup.ml
|   |-- markup.mli
|   |-- request.ml
|   |-- request.mli
|   |-- stoplist.ml
|   |-- stoplist.mli
|   |-- token.ml
|   `-- token.mli
|-- Create
                    : Contains source files for building stuff
|   |-- Makefile
|   |-- create.ml
|   |-- create.mli
|   |-- fformat.ml
|   |-- fformat.mli
|   |-- fileformat.mll
|   |-- globalStoplist.ml
|   |-- globalStoplist.mli
|   |-- lexcfile.mll
|   |-- lexhtml.mll
|   |-- lexlatex.mll
|   |-- lexplaintext.mll
|   `-- main.ml
|-- Dump
                     : Contains source files for dump stuff
|   |-- Makefile
|   `-- dump.ml
|-- Gui
                      : Contains source files for the GUI-based version
|   |-- Makefile
|   |-- Makefile.am
|   |-- convenient.ml
|   |-- convenient.mli
|   |-- gui_glade_main.ml
|   |-- gui_mine.ml
|   |-- guiglobal.ml
|   |-- guiglobal.mli
|   |-- indexing.glade
|   |-- main.ml
|   |-- pixmaps              : Contains icons for GUI
|   |   |-- error.xpm
|   |   |-- increase_quote.xpm
|   |   |-- info.xpm
|   |   |-- info_file.xpm
|   |   |-- info_okay.xpm
|   |   |-- panel-folder.xpm
|   |   |-- save_context.xpm
|   |   |-- send_to.xpm
|   |   |-- stop.xpm
|   |   |-- ubiest_text.xpm
|   |   `-- warning.xpm
|   |-- post_mlgade.sed
|   |-- preferences.ml
|   `-- preferences.mli
|-- HOWTORUN
|-- INSTALL                  : Concise "how to build and install"
|-- LICENCE                  : Licence information
|-- Makefile                 : Root Makefile
|-- Monkeytest
|   |-- Makefile
|   `-- main.ml
|-- Query
                    : Contains source files for queries stuff
|   |-- Makefile
|   |-- lexrequest.mll
|   |-- main.ml
|   |-- parserequest.mli
|   `-- parserequest.mly
|-- TODO
|-- bin                      : Binaries destination after built
|-- doc                      : Documentation
|   `-- UbiEst_Documentation.html
|-- en-stoplist              : Sample stoplist file for English data
`-- fr-stoplist              : Sample stoplist file for French data


Contact me

Simply mail me at: francois_pessaux***HAT***yahoo.fr (replace "***HAT***" by @).