Creating an import format file

Summary: Only inflected and non-inflected text files require an import format file. Both types of import format file use a similar syntax to describe the format of the data file.

To create an import format file:

On the File drop down menu, click Import.

Choose either Inflected or Non-inflected in the Method box.
Click File, either name your new file or choose an existing import format file (.IMP).

Click Edit.

You will see a Notepad window, containing the existing format file, or blank for a new file. Type the text of the import format file (or edit existing).

If the file is inflected (tagged), see the paragraph below on Inflected import format files

If the file is non-inflected, see the paragraph below on Non-inflected import format files

When you have finished typing in the text of the import format file, click File, Exit.

You are asked whether you want to save the file. Click Yes.

See Importing Records for details of the rest of the import process.

If a field named in an import format file does not exist in the record, DScribe will attempt to append a field of that type to the record. If the field name is omitted entirely, the text for the field will be matched but not imported. This is useful for discarding fields in the source file that are not required in the Calm database. Field names may either be given on their own (e.g. Text) or with a subscript (e.g. Text[3] - referring to the third instance of a Text field in this record).

Inflected import format files

Non-inflected import format files

Regular expressions

The codes Calm uses to represent characters in import format files are called regular expressions. If you are familiar with UNIX or the text programming language AWK, you will probably have used regular expressions before. They are a powerful and logical way to specify tags and separators in import format files.

The simplest regular expression is a straightforward piece of text. The expression Asia matches only with the character string Asia (not ASIA or asia).

Some characters have special meanings, allowing special sequences of characters to be matched:

^ $ . [ ] - * + \

These are called “metacharacters”. Since a regular expression consisting of a single non-metacharacter matches that character, a situation might occur in which we may want to match one of the characters designated as metacharacters, for example the dollar sign. In this case, we prefix the character with a backslash. Thus, the regular expression \$ matches the character $. The meanings of the metacharacters are as follows:

^ Matches the beginning of a line

$ Matches the end of a line

. Matches any single character

\ Matches the character that follows as a normal character, not as a metacharacter. For example, if fields are separated by dollar signs, use the regular expression \$ (if you used a dollar sign on its own, it would match with the end of a line).

\b Matches backspace (ANSI character 8)

\f Matches form feed (ANSI character 12)

\s Matches space (ANSI character 32)

\t Matches horizontal tab (ANSI character 9)

\\ Matches a single backslash

\nnn Matches ANSI character nnn

Classes

[rr] Characters in square brackets form a “class”; a class in an import format file will match with any character within the class, e.g. [AEIOU] will match with any upper case vowel. This may be useful if a data file uses different separators in different records. If a ^ symbol follows the opening bracket (e.g. [^ABC]), the expression will match with any character not inside the brackets (i.e. any character other than A, B, or C). This use of ^ is distinct from its use to match the start of a line:

^[ABC] matches A, B or C at the start of a line.

^[^ABC] matches any character except A, B and C at the start of a line.

[^ABC] matches any character except A, B or C.

Inside a class, all characters have their literal meaning, except \, ^ at the start of the class, and –. Thus, [.] matches a full stop, and ^[^^] matches any character at the start of a line, except a caret (^).

– The hyphen is used to indicate ranges of values (in ANSI order) within a class, e.g. [0–9] matches any digit and [A–Za–z][0–9] matches any letter, whether upper or lower case, followed by a digit.

+ If r is a regular expression, r+ matches with any string consisting of one or more repetitions of r. For example, AB+C matches ABC, ABBC, ABBBC and so on.

* r* has a similar effect to r+, except it will also match with zero repetitions of r. Thus, AB*C will match with AC (A followed by zero repetitions of B, followed by C) as well as with ABC, ABBC and so on.

Regular expressions match text in one line only: you cannot use them to match text that spreads over more than one line. For example, ^$ is valid (to specify an empty line), but $^ is not (end then start of line).

The following examples may help you to choose a suitable regular expression:

Text to match Regular expression

C at the beginning of a line	^C
C at the end of a line	C$
A line consisting of the single character C	^C$
A line consisting of any single character	^.$
A line consisting of two spaces	^\s\s$
A line consisting of any number of spaces	*^\s$**
Any three consecutive characters	...
A backslash \ at the end of a line	\\$
A hyphen - or a backslash \	[\–\\]
Any string of one or more capital letters	[A–Z]+
Any string of the form 1 of 12, 2 of 12, etc.	[0-9]+\sof\s[0-9]+

Inflected import format files

A typical inflected (tagged) data file might look like this:

AN 44031535
AU MORRIS-M. COOPER-A-S.
TI BIVALVE ENHANCEMENTS
SO J TRAD ENG

AN 44031536
AU ROE-J.
TI EXTRAPOLATING HERRING FECUNDITY TO STOCK LEVELS
SO J FISH EGG COUNTING

The following are characteristics of inflected files:

The beginning of each field is identified by a tag, which is unique to that field (in the above example, the field tags are AN, AU, TI and SO).

The start of each record is indicated by a recognizable character or sequence of characters (in the above example, the start of a new record is identified by a blank line).

Each record need not necessarily have the same fields.

Provided a data file follows these rules, you can import any such text file into Calm records. To communicate this information, you need to create an import format file, and you need to have chosen a record type in Calm with the appropriately named fields. Calm does the rest.

Suppose that you choose the Calm record type called Article, with the following fields:

Author
Title
Source
Publisher

Note: It does not matter that the record type includes a field (Publisher) that is not present in the data file; these fields will be left blank.

The first line of the import format file specifies the record type into which you want to import the data. Type:

:Article

On the same line, you need to specify (after an equals sign) the tag or text string that identifies the end of each record. In this case, the tag that marks the start of the next record is just a blank line. In import format files, DScribe uses the code ^ to mean the start of a line and $ to mean the end of a line, so the tag indicating the beginning of each record is ^$ (other codes are listed in Regular expressions). Complete the first line so that it reads:

:Article=^$

The rest of the import format file specifies the tags identifying each field in much the same way. You can type the field lines in any order. The field in the inflected file that is destined to end up in the Author field is indicated by “AU” followed by a space. If you do not include the space in the defined tag, DScribe will assume it is part of the field and import it with the rest of the text. Thus to define the Author field, add the line:

Author=AU\s

immediately below the first line. (In import format files, DScribe uses the code \s to identify spaces – see Regular expressions.)

The title and source fields follow a similar pattern, so you would add the following lines to the import format file:

Title=TI\s
Source=SO\s

Finally, you need to tell Calm, via the import format file, that you do not want to import the record numbers (identified by the AN tag). Use the same format as before, but omit any field name:

=AN\s

The complete import format file reads:

:Article=^$
Author=AU\s
Title=TI\s
Source=SO\s
=AN\s

As a further example, suppose that you wanted to import the following records:

into Calm records of type Contact, with fields called Name, Department and Institute (if the Calm record type called Contact includes other fields as well, these will be left blank). You should end up with an import format file like this:

:Contact=^---\*$
Name=N:-
Department=D:-
Institute=I:-

Note: The backslash \ in the Contact line is necessary because the asterisk * on its own has a special meaning in import format files. See regular expressions for more details.

To prevent the | characters at the end of the fields from being imported, type | in the Ignore characters box before you import the records.

Non-inflected import format files

Non-inflected files are assumed to contain records of a single type. The fields within the records may vary, but there may not be more than one main record type. Each record consists of a series of fields, in a regular format, without tags. Missing or empty fields are indicated by two consecutive field separators.

The following are characteristics of non-inflected files:

Each record has the same number of fields in the same order.

Fields are not tagged, but are identified by the order or position in which they occur.

Each field is separated from the next by a separator (in the first example below, the field separator is a vertical bar; other non-inflected files may use line breaks, tabs or commas to separate fields).

Each record is separated from the next by a common separator (in the example below, the record separator is a line break; other non-inflected files may use blank lines to separate records).

Suppose that you are importing a data file with records in the following form:

You have already chosen a Calm record type called Contact, with the following fields to match those in the data file:

Name
Department
Organisation
Street
Town
State
ZIP

The first line of the import format file specifies the record type into which you want to import the data. Type:

:Contact

The rest of the import format file describes the fields and how they are separated, in the order in which they appear in the data file.

The first field corresponds to the Name field in the Calm record, and is separated from the following field by a vertical bar. Thus, type:

Name=|

in the import format file, below the :Contact line.

The next five address fields follow the same pattern, so add the following lines to the import format file:

The last field in the record, corresponding to the Calm ZIP field, is slightly different. It is not followed by a vertical bar, but by the end of the line. In an import format file, Calm uses the character $ to signify the ends of lines (other codes are listed here in Regular expressions). Thus the line for the ZIP field reads:

ZIP=$

The complete import format file reads:

As a further example, suppose that you wanted to import the following records from a different file into Calm Contact records:

Jay Burgdorf
Blantyre University
Alumni Drive
Berkeley
California
CA15642
Department of Semantics

Cathy Bannerman
URTF Inc.
Waybill Drive
Napa
California
CA25641

Lin Yu
Gordon Hotel
Belvoir Gardens
Daly City
California
CA67521
Catering Department

There are two things to note about this data file:

The fields are in different orders in the data file and the Calm record.
This does not prevent you from importing the records. Provided the order of the fields in the import format file matches that of the data file, the order of the fields in the Calm record is unimportant.
The department field (the last) is missing from the second record.

This is still a valid non-inflected file; the separator (a line break) is still present, so this is recognisable as a field entry, albeit an empty one. The file still obeys the rules for non-inflected files.

This time, the records are separated by blank lines, and the fields by line breaks. The import format file for this data file would read:

:Contact
Name=$
Organisation=$
Street=$
Town=$
State=$
ZIP=$
Department=^$

Note: If there had not been a blank line in the place of the missing field in the second record, the file would not have obeyed the rules for non-inflected files. You could still have imported the file, but the second record would not have had the correct data in the correct fields.