To the Fun Science Gallery Contents

 

LEXICAL ANALYSIS OF TEXTS

C. Poli and G. Carboni, June 1998
Translation: Jennifer Spears

 


CONTENTS


PRESENTATION

In this article we deal with a topic different from those customary of Fun Science Gallery: the analysis of texts. I must confess that linguistics is not a field I move easily in. You may be asking yourself why I have ventured in it. In fact there is an explanation for this imprudent trespassing. Apart from a natural interest in language at least to the level of curiosity which we all more or less have, what pushed me in this direction was a professional reason. For my job I have had to do much bibliographical research.

The main problem with interrogating databases is adjusting queries to obtain only those documents you want. What usually happens instead is you become buried in an avalanche of articles which have little or no connection with what you are looking for. If you have not completely erred, among these articles may also be the ones you are looking for, but locating them manually is practically impossible. Finding the keywords needed to get only the documents that interest you from the million that are stored in databases is therefore an enterprise which is anything but easy! To succeed in these searches it is necessary to prepare an effective query. Internet searches are also not easy to complete and with the quick growth of cyberspace, this problem will likely be aggravated.

From these bibliographical searches, I have felt the need for programs to help me find more adequate keywords. Normally, when doing a bibliographical search, you do not start from scratch; you already have some articles, the fruits of a manual search in a library or obtained from colleagues. These articles can be the departure to finding keywords. To be able to do this with the aid of programs, it is necessary to acquire these texts in electronic format, for example, by means of a scanner. How do you find keywords through computer programs? Generally, a keyword is a term that has a high frequency in the text under examination and a low one in everyday language. Having the frequency of the terms of everyday language, it is easy for a computer to extract the specialized terms of a text. Another method for isolating keywords is to find repeated expressions. In fact, many specialized terms are formed by two or more words. Isolating these locutions and pointing out the most frequent can supply you with important keywords for your search. Through these automatic systems, terms emerge which you had not thought of and which can contribute much to the success of your interrogation.

In considerations of these things, I have spoken with a computer science expert colleague, proposing to him the creation of the necessary programs. We have worked together over two months and now we have obtained simple but powerful programs. We created these programs in the ACCESS environment, taking advantage of the power of its routines of database management (electronic archives) and of graphical management of windows. Access is the database module of the Microsoft Office. It is probably also possible to perform programs of this type with other software.

I have spoken to a Bologna high school teacher about these programs. This teacher found the work interesting and has agreed to use it for an analysis of literary texts to be presented during the VIII Week of Scientific and Technological Culture (March 23-29, 1998) promoted by the Italian Ministry of University and Scientific and Technological Research (MURST). This engagement has spurred us to improve the programs and to suit them to a more general use overall in the linguistic field. The interaction with students and teachers has been useful in testing the programs and in better refining them. The result is a set of programs which can be used in a variety of text analyses and comparisons, and in linguistic research in addition to the bibliographical uses.

PAY ATTENTION: you can download the program needed for these analysis by clicking here: Analysis.zip.

But what do these programs consist of and what do they do exactly? Here is a short description of each:

Normalize insert a text file into a table
Frequency determines the recurrence and frequency of a text's terms
A - B extracts the terms of text A not present in text B
A x B extracts the terms common to both texts A and B
A + B adds the terms of texts A and B
A => Thesaurus adds the terms of text A to the thesaurus you have selected
A <= Restore removes the terms of text A from the selected thesaurus
Text - GrammEn removes grammatical terms from a text
Locutions connects in sequences the words of a text
Readability determines the readability of a text
Sentences produces a table of sentences
Paragraphs produces a table of paragraphs
Punctuation produces a table of punctuation marks

You may want to know what can be done with these programs before moving to read the handbook. As the high school example above shows us, these programs can be used in various ways, even those which we had not considered. Here are some examples of possible use:

I hope our work can provide useful information to people, schools or lovers of linguistic studies.
(g.c.)

REQUIREMENTS AND LIMITATIONS

REQUIREMENTS
These programs require the MS ACCESS database management system of the MS Office (97 version or subsequent).

LANGUAGE LIMITATIONS
With these programs you can treat texts written in Roman characters, more precisely, with the first 256 ASCII characters. Therefore, you can examine texts in English, German, French, Spanish, Italian, etc. However, our tests have been limited to texts in English and Italian. Anyway, if you want to use these programs to analyze texts in Italian, please use the database for the Italian version of this article. If you want to use these programs for other languages, for each of them you have to make a new database. To do this, copy our database and adapt it to the new language. The code of the programs is accessible and you can modify it.

PRELIMINARY KNOWLEDGE
For more effective use of these programs, it is necessary to be quite familiar with Access:

window management: - opening, closing, moving, resizing windows
table management: - copying, moving, deleting tables (similar to file management); - creation of tables; - sorting data; - data input; - deleting rows and columns; - modifying of table structure
other: - execution of a query and its transformation into table; - obtaining diagrams.

PREPARATION OF THE COMPUTER

Install the program: Access (Office 97)  
Download the file: Analysis.zip  
Create the directories: C:\Lexicon\  
  C:\Lexicon\Handbook\ (to keep this page)
  C:\Lexicon\Texts\ (for the texts to be analyzed)
  C:\Lexicon\Reports\ (for your reports)
Unzip the file: Analysis.zip  
Insert Analysis.mdb in: C:\Lexicon\  
Following our instructions download this page here:
(I.Explorer will put the diagrams in an underdirectory)
C:\Lexicon\Handbook\ (to keep a copy of our
handbook in your computer)
Insert the original texts (.zip .doc .html .txt) in: C:\Lexicon\Texts\  

NAMES OF THE FILES

Name.zip compressed text
Name.doc text in Word format
Name.html text in hypertext (HTML) format
Name.txt text in "text" format, with punctuation marks
Analysis.mdb database. Contain all tables and programs

NAMES OF THE TABLES

Name table of all the words
Name_freq table of the frequencies
NameA-NameB_freq table of the terms present in A but not in B (frequencies)
NameAxNameB_rats table of the terms common to A and B (frequency ratios)
NameA+NameB_freq table sum of the terms of A and B (frequencies)
GrammEn table of the grammatical terms, auxiliary verbs, etc.
Name-G table with the terms present in GrammEn subtracted
Name_n table with the words taken in sequences of n words
Th_Name-freq thesaurus of terms and frequencies
Th_Name_list list of the texts included in the thesaurus
Statistics table produced by Readability. Holds the name of the documents examined and the data obtained
ZZ_... system tables (model of tables, lists, etc.)

TYPE OF TABLES

We can distinguish tables as:

simple tables
tables of frequencies
other tables

SIMPLE TABLES

Name
Name-G
Name_n
Name_sent
Name_para
Name_punc
GrammEn

For simple tables we mean those that have the fields: ID, Words, Field1.
Simple tables are those that derive from the programs Normalize, Text-GrammEn, Locutions, Sentences, Paragraphs, Punctuation. These tables are suited to be used with the program Frequencies. The tables produced from Normalize are also suited to be used with Text-GrammEn.

TABLES OF FREQUENCIES

Name_freq
Name_rats
Th_Name_freq

The tables of frequencies can have more fields (or columns), but among them must be the following: Terms, Recurrence, Frequencies. Those derived from the programs: Frequencies, A-B, AxB, A+B, A=>Thesaurus are tables of frequencies. They are adapted to being used by the programs: A-B, AxB, A+B, A=>Thesaurus.

OTHER TABLES
The tables Statistics and Th_Name_List contain summarizing data. The tables of type " ZZ _ " are for the system. These tables are not suited to be used with the programs we have cited unlike the simple tables or the frequency tables.

WARNINGS
Never erase the tables GrammEn, Statistics and all those of the type ZZ _... You may modify the dimensions of the columns or the rows of all the tables. You can delete or add records to all the tables excluding those of the type ZZ _...

NAME OF THE PROGRAMS

Normalize normalizes a file of text and inserts it in a table
Frequencies calculates the recurrence and the frequency of the terms of a table
A - B obtains the terms of table A which are not present in table B
A x B obtains the terms which are common to tables A and B and calculates the ratio of the frequencies
A + B adds the terms of tables A and B and recalculates their frequencies
A => Thesaurus adds the terms of table A to the thesaurus you have chosen, recalculates the frequencies, and adds the name of the table to the list
A <= Restore removes the terms of table A from the thesaurus you have indicated and corrects the list (to use in case of errors)
Text - GrammEn removes the terms present in GrammEn from a table of normalized text
Locutions associates the words of a text in sequences
Readability determines the readability of a text
Sentences produces a table of sentences
Paragraphs produces a table of paragraphs
Punctuation produces a table of punctuation marks

TERMINOLOGY

Words words of a text, including repeated
Terms words taken only once
Tw total of the words
Tt total of the terms
Rec recurrence: number of times a term is present in a text
Freq frequency: ratio between the recurrence and the total of the words. Freq = Rec/Tw
Rats Ratio of the frequency that the same term has in two different texts. Rats = Freq(A)/Freq(B)
Thesaurus list of terms. In our case, the thesaurus holds even the frequencies of the terms
Mask dialog window of a program
Texts/Tables as text is inserted in tables, performing operations on the tables means to perform them on the texts as well

SEARCHING TEXTS

Get some texts in electronic format. You can get them through the following methods:
- searching the web
- using a scanner + OCR
- typing them on the computer

SITES WHERE YOU CAN DOWNLOAD TEXTS OR ARTICLES
ATHENA Texts in electronic format in many languages (click on "Books") (http://un2sg4.unige.ch/athena/html/athome.html) (Univ. of Geneva)
Project Gutenberg Project Gutenberg (click on "Etext Listings")
(http://www.promo.net/pg/)
OUSIA Many things for the Bibliophile (http://www.freshnet.de/user/will/ousia/Bibliophile.html)
NlightN service Up to 300 DB of books, magazines, newspapers (http://www.nln.com/)
Books and newspapers Books and newspapers (gopher://gopher.tc.umn.edu:70/11/Libraries/)
News and Media Newspapers and magazines (http://www.yahoo.com/News/)
News and Media, Indices Indexes of newspapers and magazines
(http://www.yahoo.com/News_and_Media/Indices/)
Yahoo Electronic Literature, Indices Yahoo's indexes of books in electronic format
(http://www.yahoo.com/Arts/Humanities/Literature/Electronic_Literature/Indices/)

In order to learn how to use these programs, use texts such as the following:

In our tests we used these texts:

S. W. Maugham The Moon and 6 Pences 1919
H. G. Wells The Time Machine 1895
J. K. Jerome Three Men in a Boat 1889
M. Twain Tom Sawyer 1876
C. Lewis Alice in Wonderland 1865
C. Darwin The Origin of Species 1859
C. Brontė Jane Eyre 1847
W. Scott Ivanhoe 1819
W. Shakespeare Romeo and Juliet 1594

Save every document in text format: Name.txt
On the web you will find documents in HTML format. HTML texts contain many commands or Tags (i.e: <b></b>). To remove them you need to open the html document with a browser and save it in text format.

STARTING THE DATABASE

A database is a set of tables, queries, programs, etc. The programs you need are already in the database: Analysis.mdb. To launch the DB follow these steps:

From File Manager
go to: C:\Lexicon\
Click on: Analysis.mdb

NORMALIZATION

The scope of normalization is inserting the words of a text into a table so it can be examined by the programs we have designed. To do this, use the program Normalize which removes all characters not alphabetical (punctuation marks, mathematical symbols, etc.) and transforms uppercase letters into lowercase ones. As before said, at the end the program creates a table of all the words in a text. In this table words keep their reading order (figure 2). A document used with Normalize has to be in text format.

With Access, open Analysis.mdb
- go to the folder Masks
- launch the program Normalize
- choose the text file to be normalized
- click on the Start button
(The program will make a table which contains all the words of the text keeping the original order. The name of this table will not have a file extension.)

- Close the program Normalize
- open the table to check it
- sort it in alphabetical order
- remove the possible empty records
- resort the table in the reading order (col. ID)
- save and close the table
 

CALCULATION OF RECURRENCE AND FREQUENCIES

When applied to a table of normalized text, the Frequencies program determines how many times every term is in a text and it calculates their frequencies (figures 3 and 4).

Launch the program Frequencies
Choose the table to be used
Click on the Start button
(The program generates a table with the extension _freq)

 

DETERMINATIONS: Determine the recurrence and the frequency of terms in some of the following documents:

- a composition of your own
- a literary text
- a scientific or technical text

EVALUATION OF DATA OBTAINED: Putting the table in alphabetical order, you can observe which terms were used and their related forms. You can verify if they were used as synonyms.

Putting the table in order of frequency, you can see which terms are used the most and those which are used only once. Take note of the most frequent terms and make observations of them. Do the same with the least used terms. In examining the frequency of terms, it is useful to distinguish between function words (articles, prepositions, pronouns, etc.) and content words (nouns, adjectives, verbs).

If you are interested, you can determine lexical wealth (LW), the relationship between the number of terms (TT) and the number of words (TW):

LW = TT/TW

The value of this ratio goes from zero to 1. Keep in mind that LW tends to diminish with an increase of the size of a text, so you must compare texts of analogous size, for example 2000 words. If a text is longer, keep only 2000 words of it to effect this determination.

LOGICAL OPERATIONS

By means of logical operations computed between two texts (figure 5), we can obtain:

- the terms present only in text A A - B
- the terms present only in text B B - A
- the terms common to A and B A x B
- the sum of the terms present in A and B A + B

TERMS PRESENT IN ONLY ONE OF THE DOCUMENTS (A - B)

The examination of terms present in only one of the documents and not in the other allows you to obtain interesting things.
If you are testing two documents of which one is contemporary and the other is ancient, the modern terms or the ancient terms can be pointed out.
If you are testing two documents of which one is literary and the other is scientific or technical, the specialized technical terms can be pointed out. However, even a literary text can possess its own peculiar terms (figure 6).

 

Launch the program A-B
Choose table A (must be a table of frequencies)
Choose table B (must be a table of frequencies)
Hit the Start button
(the program creates a new table with the name: NameA-NameB_freq)

The table created by program A-B contains the terms present only in A. The values of the recurrence and frequencies are the same of table A.
This program can also be used to subtract the terms of a thesaurus from a table of frequencies for when you want to isolate ancient terms or specialized terms in a document (figures 7 and 8).

DETERMINATIONS:
Compare two literary documents of which one is ancient and the other modern (figure 7).
Compare two documents of which one is scientific or technical and the other is literature (figure 8).

TEST OF THE DATA:
Observe which terms are present only in the ancient document or in the modern.
Observe which terms are present only in the scientific document or in the literary.

Distinguish between two types of terms:
   - grammatical (these give indications on style)
   - content (these give indications on content)

Later on, when we deal with thesauruses, we will describe a more effective method for extracting specific terms of a text.

TERMS COMMON TO BOTH DOCUMENTS (A x B)


The terms common to both documents give indications on common characteristics of the two authors. If you are testing two writers belonging to different eras, the ancient terms which have been preserved will be highlighted. In each document these terms have different frequencies. The ratio between the frequencies of each term provides an indication of the tendency to use a term more in past times or, on the other hand, in recent times, therefore, on the tendency to abandon or retain this term.

This test can complete the one made on the terms present in only one of the two documents following the scheme:

A     AxB     B

Meaning the evaluation of terms present only in A, then those common to both, then those present only in B.

To determine the terms common to both documents, use the program AxB, which will also calculate the relationship of the frequencies that the same term has in both documents.

Launch the program AxB
Choose table A (must be a table of frequencies)
Choose table B (must be a table of frequencies)
Hit the Start button
(the program will create a new table with the name: NameAxNameB_rats
This is also a table of frequencies)

The table created by program AxB contains the terms present either in A or in B but not those present only in A or only in B. This is the result of the Boolean operation of intersection, also indicated by the logical operator AND.

This program also determines the ratio of frequency with which the common terms are used in the two documents under examination:

Ratio (A/B) = freq (A) / freq (B)

The ratio of frequencies if very useful in highlighting the common terms used more in A than in B or vice versa. It is also useful in being able to observe the common terms used with equal frequency (ratio = 1). To facilitate these operations, arrange the column of ratio of frequencies, then observe the terms with the highest ratio, those with the lowest ratio, and those with a ratio close to 1.

Examining two texts, a scientific or technical one (A) and a literary one (B), the most frequent terms of A will primarily be specialized terms while those of B will be literary.

Examining two texts, a modern one (A) and an ancient one (B), the most frequent terms of A will be terms used more in recent times while those of B will be terms used more in antiquity. In the intermediate position will be terms whose frequencies are equal and whose use has remained constant throughout time (figure 9). Furthermore, these terms give some indication on the analogy of style and content.

DETERMINATIONS:
Compare two literary documents of which one is modern and the other ancient. Compare two documents of which one is literary and the other scientific.

EVALUATION OF THE DATA:
Observe which of the common terms are most frequent or least frequent in both documents. The differences of frequency give indications of the use of different terms.

Examine:
   - the most frequent terms in A (most typical to A)
   - the most frequent terms in B (most typical to B)
   - the terms with equal frequency (rapp = about 1)

Distinguish between two types of terms:
   - grammatical (gives indication on the style)
   - content (gives indication on content)

NB: the examination of recurrence of each term can lead to erroneous conclusions if the length of the two documents is different. For this purpose, it is necessary to execute the comparison of the frequencies expressed by their ratio.

Other than on the basis of the frequency ratio, the terms common to two documents can also be arranged on the basis of frequency. Normally, the most frequent ones are of the grammatical type. Those less frequent are terms of regular use. As little by little the frequency diminishes, the common terms tend to acquire a function of content.

THE SUM OF TWO TABLES (A + B)

Given two tables of terms A and B, recurrence and frequencies, it is possible to add the two into one table C. To add the two tables, use the program A+B which recalculates the values of the recurrence and frequencies as if the terms came from one document.

Launch the program A+B
Choose table A (must be a table of frequencies)
Choose table B (must be a table of frequencies)
Hit the Start button
(the program will create a new table with the name: NameA+NameB_freq)

Successively adding more tables, the name of the resulting table always becomes longer. You can rename the table. This program can be used to create thesauruses, but to do this there is a more suitable program.

CREATING OF THESAURUSES

Thesauruses are simple lists of terms without definition. In our case, thesauruses are equipped with frequency and are tables of frequencies.

A => THESAURUS

For the realization of thesauruses, use the program A=>Thesaurus which is more suitable than A+B. This program adds terms of table A to the indicated thesaurus (ex. Th_Literature_freq), maintains an up-to-date table (ex. Th_Literature_list) containing the name of the texts added to the thesaurus (figure 10). The recurrences of the departure table are added to those of the thesaurus, and, in the end, the frequencies of all the terms are recalculated.

Here are some examples of thesauruses you can create:

- Thesaurus of grammatical terms
- Thesaurus of common terms (terms of common use, not grammatical)
- Thesaurus of cultivated terms (not common terms but of general use: ex. abstract and philosophical terms)
- Thesauruses of specialized terms (ex. thesaurus of biology)
- Thesauruses of jargon (ex. bureaucracy, politics)
- Thesauruses of literary terms (ex. limited to the 800's, the 900's, contemporaneous)
- General thesaurus of the English language (obtained by adding several texts of every type)
- Thesauruses of other languages

A <= RESTORE

In cases where, by error, a text is added to a thesaurus, this program allows you to extract it putting you back at the preceding state.

THESAURUSES OF THE ENGLISH LANGUAGE

To single out of specialized terms and keywords, it is necessary to have a reference thesaurus at your disposal such as a thesaurus of literary terms (figures 11 and 12). To create this thesaurus you should gather only literary texts. Thus you will have terms of common and traditional use at hand. The comparison, effected by program A-B, of a technical text to a thesaurus compiled as such points out the non literary terms contained in the document: in general, specialized terms (figures 7 and 8). Operating in this manner, you have the disadvantage that the specialized terms which for some reason are also present in the thesaurus will not be represented in this collection. We will soon see how it is possible to avoid this problem.

Another way to single out specialized terms and perform other determinations such as the readability index of a text requires comparison with a thesaurus that does not include just literary terms but also all others. We can define this as the thesaurus of frequencies of the terms of the written English language.

Such a thesaurus would be a table obtained by adding all documents written in English until present-day and calculating the frequencies of all the terms. For evident practical reasons, we cannot obtain such a table. Nevertheless, with reasonable commitment adding a certain number of texts, it is equally possible to get a table of frequencies of a large quantity of terms. The frequencies of the terms of this table do not depart in a significant manner from the ideal ones, therefore the thesaurus thus obtained still proves useful for several practical purposes, and we can successfully use it to single out the specialized terms of a text. We will define this real thesaurus as the general thesaurus to indicate that it can consist of all terms without distinction.

During the realization of this thesaurus, it is necessary to avoid inserting an excessive number of texts which speak about the same topic. If, for example, primarily architectural texts are added, the terms of that discipline will be represented with a higher frequency than in the general use. On the contrary, if heterogeneous texts are added, the specialized terms will assume a low frequency, closer to the "true" value, while the terms of common use will maintain a high frequency.

SINGLING OUT SPECIALIZED TERMS BY MEANS OF THESAURUSES

METHOD 1
With the program A-B, subtract the thesaurus of literary terms (B) from the text under examination (A)
Arrange the column of frequencies of the resulting table

METHOD 2 (advised)
With the program A=>Thesaurus, add the text to the General Thesaurus
With the program AxB, calculate the frequency ratio among the terms of the text and those of the the General Thesaurus
Arrange in decreasing order the column of the frequency ratio of the resulting table

Also with a general thesaurus you can isolate the specialized terms of a document. In this case, you have to add the document to the general thesaurus before beginning your analyses (program A => Thesaurus). Then to point out specialized terms, use the program AxB, which searches for common terms. But if all the terms of your document are common to those of the thesaurus (you added them to the thesaurus before beginning), how will you point out the specialized terms? To do this you base yourself in the fact that specialized terms have a relatively elevated frequency in the document but a low one in the thesaurus. The program AxB calculates the ratio of frequencies that each term has in both documents. Now, arranging in declining order the column of frequency ratio, the specialized terms of that document will surface.

After having arranged the table in declining order, the terms will fall into 2 categories:
1) the terms which before adding text A to the thesaurus were exclusive to A. All these terms will turn up with the same frequency value and they will be located at the top of the table
2) the terms which before adding text A to the thesaurus were common to both. These terms are arranged in three categories:
2-1) the common terms which are most frequent in A (ratio > 1)
2-2) the common terms that have a similar frequency in both documents (ratio = 1 about)
2-3) the common terms which are most frequent in the thesaurus (ratio < 1).

Realization of the two thesauruses
We advise the realization of both the thesauruses described. To do this, begin composing the thesaurus of literary terms by inserting literary texts which you can find on the Net or scan from a book. When this thesaurus has become substantial enough (circa 300,000 words for at least 15,000 terms), make a copy of it. Now you have two identical thesauruses at your disposal. Reserve the first only for literary texts (thesaurus of literary terms), however, add to the second scientific, technical, journalistic texts, and thus you will have a general thesaurus. In the database which we have provided, there is already a thesaurus of literary terms called Th_Literature_freq.

TABLE OF GRAMMATICAL TERMS

Collecting the terms common to more documents (program AxB), you can compose a table of most common terms. In this table, principally articles, prepositions, pronouns, adverbs, etc. will be present, terms which have a more grammatical function than one of content (figure 13). This table will be most useful in operations which we will look at shortly. In realizing it you need to employ texts that are not very large because otherwise content terms will also appear as common terms. You must also not use very small texts; otherwise you will leave out many grammatical terms. However, after creating a similar table, you need to examine it to manually eliminate the content terms which may be- present and to insert grammatical terms not included.

We have prepared a table of grammatical terms, including also certain terms of very common use such as auxiliary verbs. This table has the name GrammEn and it is a simple list of terms. You can remove or add words to this table, so you can adapt it to your needs.

The program Text - GrammEn subtracts the table GrammEn from a table of words (obtained with the Normalize program). This operation serves to prepare the research of repeated expressions which will be performed later.

The table GrammEn is a simple table and as such it cannot be used with the programs A-B, AxB, A+B. If you need to do this, make a table of frequencies out of it through the Frequencies program.

REPEATED EXPRESSIONS, OR LOCUTIONS

Often specialized terms are formed from expressions (terms composed of many words). These expressions are very useful for describing a document and for aiding bibliographical research because normally they are very specific. Taken separately, the individual words of which these expressions are composed are not as specific. For example the term "acid rain" belongs to the field of the environment protection research while the words "acid" and "rain" taken alone are not so specific for this topic.
The purpose of this procedure is that of singling out the repeated expressions in a document and of pointing out those used most frequently.

Used in an opportune manner, the program of which we will speak shortly is capable of also singling out repeated expressions in the literary field. This application can prove useful for linguistic analyses.

To do these analyses, use the program Locutions. This program associates words in sequences. These sequences can be 2 to 5 words long, according to what you have indicated. As you see in figure 14, words of a sequence are joined by the sign "_" to form a single string. At this point, the program Frequencies sees these strings as words and calculates their recurrence and frequencies. Arranging the column of recurrence or of frequencies in a declining order, the repeated strings will be gathered at the top.

As it has been said, in using the program you need to indicate the length of the sequences to be singled out. This value can be from two to five words. The texts to be examined with this program can be full or empty of grammatical terms.

If you want to single out repeated expressions in a literary text, you need to examine a complete text. In this case, sequences of two words have poor significance since substantives would be combined with their articles and the expression would be unrecognizable. Here, it is better to search sequences composed of four words (figure 14).

On the contrary, in the search of specialized expressions, the presence of grammatical terms can be useless and harmful. The program Text-GrammEn purposely serves to remove the grammatical terms from a table before subjecting it to the Locutions program. In a case where you examine a text subtracted of grammatical terms, it is advantageous that the length of the sequences to be searched be only two words (figure 15).

As was said, to single out repeated expressions, apply the Frequencies program to the table of locutions and arrange the resulting table in a declining order of recurrence: the repeated expressions will be gathered at the top. You can eliminate the non-repeated sequences (recurrence = 1).

Case1: research of literary expressions (complete text)
Prepare a normalized table for the text to be treated
Open the program Locutions
Choose the table to be treated (the words must be in reading order)
Indicate the number of words to be conjoined (3-5)
Start the program
(you will get the table: Name_n, where n is the number of words in each expression)

Apply the program Frequencies to the obtained table
(you will get the table: Name_n_freq)
Arrange the table according to recurrence
Erase table Name_n

Case 2: research of specialized expressions (text extracted of grammatical terms)
Prepare a normalized table of the text to be examined
With the program Text -GrammEn, subtract the grammatical terms
Open the program Locutions
Choose the table to be treated (the words must be in reading order)
Indicate the number of words to be conjoined (2)
Hit the Start button
(you will get the table: Name-G_2)

To the table obtained, apply the program Frequencies
(you will get the table: Name-G_2_freq)
Arrange the table according to recurrence
Erase table Name-G e Name-G_2

EVALUATION OF THE DATA:
Examine the repeated expressions (2 words long) of a specialized or scientific text and consider the usefulness of the most repeated ones.
Search for the repeated expressions (4 words long) of a newspaper article (distinguish the literary expressions from those relative to the news).

READABILITY index

Many authors have searched for methods to determine the degree of difficulty of reading a text. For this purpose parameters such as the length of the words and the length of the sentences has been taken into consideration.

Our program Readability takes these parameters into account:

- sentence length (from period to period)
- paragraph length (to new line)
- familiarity of terms employed (average value of frequency)

To establish the length of a document's sentences and paragraphs, the program examines a file of text (not a table) including punctuation; it counts the number of alphabetical characters, the number of non-repeated sentence closing marks (. ! ? to new line) and the number of non-repeated paragraph closing marks (to new line); it calculates the length in characters of the average sentence and the average paragraph. To calculate the familiarity, the program needs a table of frequencies made from this text (which words are present in the document and how many times they are used) and a thesaurus of frequencies (the value of the frequency of terms in English).

The Readability program determines the index of readability according to a formula which takes into account the average length of sentences, the average length of paragraphs and the familiarity of the terms employed.

At the end, the program shows its own determinations in the state window of the mask and adds a line to the Statistics table with the name of the document treated, the thesaurus used, and the obtained data (figure 16).

To do this determination, you can use the thesaurus of literary terms. In this case, when the program finds a term which it does not have, it assigns it a frequency of 0.00001.

Another way to perform the same determination is that of adding the text to be examined to the thesaurus before starting the Readability program. In this case, use the general thesaurus which also includes non-literary terms. With these two methods you may get slightly different results, a slightly inferior readability index results from using the thesaurus of literary terms.

Before beginning this analysis, verify that the text goes to a new line as laid out by the author. In fact, during the conversion of the text from initial format (.html, .doc, .wpd) to text (.txt), often the original format becomes altered. If necessary, arrange a part of the text by hand: at least 8,000 characters (around 3 pages) and copy this part in a new document. The determination of the readability index for this part of the text will give a result very similar to that which would have been obtained using the entire document. If the text is a book or a report, it is helpful to remove the index and all titles.

Verify the "go to new line" of the text file
Remove the possible index and save the file as text
The Readability program requires the presence of:
- the text (Name.txt)
- its table of frequencies (Name_freq)
- a thesaurus

you may add the text to the thesaurus
Open the Readability program
Indicate the name of the document to be examined
Indicate the reference thesaurus for the calculation of familiarity
Hit the Start button

Examine the data obtained
This same data gets inserted in the Statistics table

DETERMINATIONS: Determine the readability of a few texts, one of which is your own.

EVALUATING THE DATA OBTAINED: Compare the data obtained with different documents. In particular, compare the sentence length and paragraph length. Compare the familiarity of the terms used. Discuss the formula.

FORMULA FOR THE DETERMINATION OF THE READABILITY INDEX

IndexL = FFa + FSe + FPa

Where:

FFa = Cfa * (MFa - Fmin) function of familiarity
FPe = CSe * MSe * Exp(-(MSe ^ 2) / (2 * Sei ^ 2)) function of sentences
FPa = CPa * MPa * Exp(-(MPa ^ 2) / (2 * Pai ^ 2)) function of paragraphs

MFa = average value of the frequency of terms
MSe = average value of the length of a sentence
MPa = average value of the length of a paragraph

Fmin = 0.0058 minimum familiarity
Sei = 60 ideal sentence length
Pai = 160 ideal paragraph length
CFa = 16000 coefficient of familiarity
CSe = 0.82 coefficient of the sentence (such as CPe max = 30)
CPa = 0.103 coefficient of the paragragh (such as CPa max = 10)

In this formula, three parameters appear.
The first is proportional to the average value of the familiarity of the terms used (figure 17).
The second and third are the result of a function that is the maximum in correspondence with a certain value of the variable (ideal length of a sentence or paragraph) and tends to zero when the variable moves far away from this value (figure 18). This function is a little complicated, but it has the advantage of avoiding the drawback of linear function (straight line) for penalizing the readability index a lot when the average sentence or paragraph is too long.

The readability index which we determine in this manner takes into account only a few parameters even if they are in greater number than those taken into consideration by other indexes which we have become acquainted with. Nevertheless, the real readability also depends on other components which deal with superior intellectual functions among which are intelligence and the aesthetic sensibility. For this reason, at present, no program can define in a precise manner the readability of a text. In any case, even though limited to a few parameters, our readability index is not deprived of all meaning, so much so that generally it agrees with our judgment.

EVALUATION OF TEXTS (Sentences, Paragraphs, Punctuation)

We have prepared some programs to help the evaluation of texts. They can also be used to improve the readability of a text in the course of editing. These programs are called:

Sentences
Paragraphs
Punctuation

The Sentences program examines the text, extracts the sentences from it (definition of sentence = .!? new line), and produces a table (Name_sent). In this table there are the following fields:

ID index of position of the first word of a sentence
Words length of a sentence in words
ASCII content of a sentence (including punctuation and spaces)
Len length of a sentence in characters (including punctuation and spaces)
Field1 necessary to the system

The Paragraphs program examines the indicated text, extracts the paragraphs from it (definition of paragraph = new line), and produces a table Name_para. In this table there are the same fields as in the preceding table with "paragraph" in place of "sentence."

The Punctuation program examines the indicated text and produces a table Name_punc. In this table there are the following fields:

ID index of position of the word that precedes the character or sequence
Words character or sequence, including the signs < >
ASCII ASCII value of the character or sequence
Len length of the sequence in characters
Field1 necessary to the system

The sentences table can be arranged in the decreasing order of the sentence length (field 'words'). In this way, the longer sentences will be highlighted which can then be examined separately (figure 19). As much can be done with the paragraphs table. As was said, this can be useful for examining a document you are writing with the purpose of improving its readability.

You may notice it is not always true that a long sentence is hard to read. In fact, if such a sentence describes a list of things, it may be long but that does not render it difficult. It is not held that a short sentence is always preferable to a long one. In fact, examining longer sentences, you will see some which are logically well-constructed, composed of well-coordinated phrases. It is true that these sentences require closer attention from the reader, but they may be nicer to read. At this point, above all if you are editing a literary text, you may decide to sacrifice readability for beauty. On the contrary, if you are working on a scientific document, you should give priority to readability. A number of poorly constructed long sentences exist in which many subjects or complements contend for the same verb and whose logical structure is not easily traceable. These sentences bring to mind battle scenes and compel us to reread, reconstruct, and interpret. These are the sentences to rearrange so as to improve either their aesthetics or readability. In conclusion, readability and literary beauty are not the same thing and do not always agree. You can decide to sacrifice one in favor of the other, but you cannot renounce both. Clearly, we obtain the best results when we manage to combine beauty in writing and ease of reading.

The three tables _sent, _para, _punc can be examined as they are produced by their respective programs, but they can also be examined with the Frequencies program to obtain tables with:

- the number of sentences of equal length
- the number of paragraphs of equal length
- the number of punctuation marks of each type

CONCLUSION

At the end of these tests, we advise you to erase the tables which are not useful to avoid their proliferation and the subsequent confusion. Immediately after, compress the database to reduce its size (INSTRUMENTS/database utility/compact database).

The present edition of this work can be considered proof, a sort of beta version. You are invited to send us your observations and suggestions. If you have followed up till this point doing the tests we have described, you have succeeded in learning to use the programs we have provided. Now you can use them for your own interests and needs.

These programs can be useful to schools (especially those which offer a Humanist education); in performing linguistic research; to cybernauts doing bibliographic searches on the Internet; to compilers of hypertexts in identifying specific terms for search engines. As we said in the beginning, these programs were created to help prepare interrogations in databases during bibliographic searches; therefore, they can also be helpful to researchers who can use them in order to extract descriptors, in order to refer to their own articles or to compose the analytic index of a book. Finally, they can be of service to librarians in the work of indexing and classification of documents. In any case, used either for personal reasons or professional, we hope these programs are to your liking.


Send your opinion on the article


To the Index of FSG  To the top of this article  Download the article