Wednesday, January 7, 2009

HMT504: TextSTAT - guide

TextSTAT - Simple Text Analysis Tool

© 2001/2002 - Matthias Hüning

Version 1.52 (01.05.2002)




1. INTRODUCTION

TextSTAT is a concordance program which was designed to be user friendly and provide simple Internet functionality. Texts can be combined to form corpora (which can also be stored as such). The program analyses these text corpora and displays word frequency lists and concordances to search terms. The program is written in Python and offered here as a Windows program. TextSTAT is freeware.

With TextSTAT you can search any amount of text you like (well, not quite – the amount is limited by your RAM). You learn how often a certain word occurs or in what contexts it is used. Word combinations can also be examined.

TextSTAT is at the moment available in three languages – German, English and French. You can select the language via the menu entry 'Sprache ändern / Change language' (under 'Options'). To activate the new setting you will have to restart the program.

The program has been tested under Windows 98 (SE) and Windows XP. However, it should work on other Win32 versions.


[ TOP ]

2. CREATING YOUR OWN CORPORA

When you open TextSTAT, you will see a window with a menu bar and several 'tab sheets'. In the foreground is the tab sheet 'Corpus'. You can now add files and, in this way, put together a corpus. In addition to this, you have the following options:
- 'Add file' (via the menu entry or button)
- 'Add HTML page' (via the menu entry or button)

In both cases, it is essential that the file to be added consist of plain text without any formatting. In other words, MS Word files, for example, cannot be read in – such texts have to be converted first of all. As a rule, such a plain text will be available as an ASCII text (encoded as 'Latin-1' – that is also the default encoding for HTML pages). This is the default setting for TextSTAT. A different encoding can also be used. This, however, has to be set up prior to the file being read in (via the menu entry: 'Options > File encoding'). TextSTAT processes the texts internally in Unicode format.

HTML files can be read in directly from the Internet or from your own hard disk. In the first case, the complete WWW address (= URL) has to be entered including 'http://' and, of course, you have to have access to the Internet... If you choose the second option, you simply have to enter the file name including the path (e.g. 'c:\directory\file.htm'). The HTML codes are removed from the files by default. This can, however, be deactivated.


[ TOP ]

3. SAVE CORPUS / OPEN CORPUS

You can save the opened files so that you can use them again as a corpus at a later stage (via the appropriate button and/or menu entry). You can decide the name of the file that is then created. We recommend storing the corpora in a separate folder.


[ TOP ]

4. INTERNET CORPUS TOOL

If you want to put files from the Internet together to form a corpus, then the TextSTAT Corpus Tool can be useful. It enables any number of WWW pages from a website or any number of postings in a newsgroup to be downloaded. You start this help program via the menu entry 'Corpus > New corpus (web/news)'.

If you enter a URL at 'Web', this will be taken as the starting point - the links will be followed and the pages found will also be added to the corpus. You should make sure that the seek area is limited to the server or the appropriate subdirectory. For example, if you take as the starting point, then you will receive for the first option pages beginning with . For the second option, however, you will only receive pages beginning with . You can also specify the file encoding of the website(s) again here. As a rule, 'Latin-1', the default, will be the correct setting.

With 'news', you will first of all have to specify a news server to which you would like (to be able/allowed) to have access. Then, the name of a newsgroup (available on this server) will have to be specified and the number of reports/postings that should be read in (e.g. 500). By default, the quotes in the reports are removed (= lines beginning with '>'). News reports are always regarded as being 'Latin-1' encoded.


[ TOP ]

5. WORD FORMS

After compiling a corpus from one or several files or after loading an existing corpus, you can obtain frequency information on the word forms contained in the corpus with the help of the tab sheet: 'Word Forms'. To do this, click on the button: 'Analyze corpus'.

By default, all the words are converted to the lower case and then displayed in order according to decreasing frequency. You can have the word forms analyzed not only in lower or upper case, but also in different forms. This, however, causes problems when the words are put in alphabetical order since capital letters precede small letters. Retrograde sorting enables you, for example, to answer a question on which words in the corpus have a particular suffix. You can also limit the frequency range to be displayed. Here you should take into account that '0' means no restrictions (therefore: if min.=0 and max.=0, all word forms will be displayed). After the display options have been changed, you will have to 'Update list'.

If you double-click on a word form, then it will be searched for in the corpus and a concordance will be created.


[ TOP ]

6. SEARCH/CONCORDANCE

The tab sheet: 'Search/Concordance' shows a word form or a keyword in context. The terms found can be sorted according to different criteria, and the length of the context to be displayed can be determined. The search term is displayed in upper case by default. This marking can be deactivated.

When you enter a search string, it will be assumed by default that a word has been entered. This setting: search for 'whole words only' can be deactivated. A new search and/or a change in the display options can be activated with the button 'Search/Update'.

When searching, you can use regular expressions (see below).

If you double-click on a line of text, this will be searched for in the corpus and the citation (a text passage with more context) will be displayed.


[ TOP ]

7. CITATION

The tab sheet: 'Citation' will display a text passage in which the sought string will be shown in more context. Moreover, the name of the file from which the passage is taken, will also be displayed. The position (in characters) of the passage in the original file will be given in brackets.

A double-click on the file name opens the original file with the program that is linked with the file extension. In the case of websites, you are connected with the Internet and see the original file displayed in the browser.


[ TOP ]

8. SEARCHING WITH REGULAR EXPRESSIONS

When defining the search term (in 'Search/Concordance'), you can use so-called 'regular expressions'. Admittedly, these are not particularly user friendly, but extremely powerful. They allow you to define even very complex search requests. The most important special characters are included below:

  • '.'
    (the dot) stands for any character you like
  • '\w'
    stands for any alphanumeric character
  • '\W'
    stands for any non-alphanumeric character (e.g. space, punctuation marks)
  • '+'
    the preceding character is repeated once or any number of times
  • '*'
    the preceding character is repeated any number of times, including zero
  • '*?', '+?'
    make sure that '*' and '+' are not 'greedy' (see examples)
  • '|'
    stands for or
  • '[]'
    square brackets define a set of characters which are searched for alternatively.

Examples:

  • b\wr
    finds 'but', 'bit', 'bet' and 'bat'
  • b\w+r
    finds 'but', 'bit', 'bet', 'bat', 'boat' and 'built'
  • w[ao]nder
    finds 'wander' and 'wonder'
  • (this|that)
    finds 'this' or 'that'
  • so.+e
    finds the string 'sold me her house' in the text: 'My sister sold me her house'
  • so.+?e
    finds the string 'sold me' in the text: 'My sister sold me her house'
  • s.+r
    finds the string 'sister sold me her' in the text: 'My sister sold me her house'
  • s\w+r
    finds the string 'sister' in the text: 'My sister sold me her house'

As already stated, regular expressions are not easy, but extremely powerful. The examples shown here can only hint at the possibilities. Much more is possible!! A search with Google for 'tutorial regular expressions' will give you a list of useful websites.


[ TOP ]

9. PRINT RESULTS / EXPORT / SAVE

Word forms and concordances can be directly transferred to an MS Word document. There, they can be processed and also printed out (TextSTAT does not allow results to be printed out directly). The entry 'Results > MS Word' in the 'File' menu opens the word processing program with an empty document and transfers word forms and concordances to this document.

In addition/alternatively, TextSTAT offers you the chance to save the results in a text file. (The encoding of the text depends on the setting under 'Options > File encoding').

Finally, TextSTAT offers you the possibility to export your frequency data to MS Excel directly.


[ TOP ]

[ TOP ]

11. CONTACT

If you have any questions about (or problems with) TextSTAT, you can contact the author:

Matthias Hüning, <mhuening@zedat.fu-berlin>



sumber: http://www.niederlandistik.fu-berlin.de/textstat/TextSTAT-Doku-EN.html

No comments: