CodeRead - an autocoding mechanism for text-based data

Status: This document is very much under construction; QCML is under development. Any comments are welcome at andrew_perrin@unc.edu. The version of the software provided here is not working very well, particularly in its pattern-generation engine. Use at your own risk! I hope to have it working better soon.

Table of Contents


Licensing information

CodeRead is copyright 1998-2001 Andrew J. Perrin. All rights reserved. Users may download, use, and copy the software for their own use according to the provisions of the Gnu Public License ( http://www.gnu.org/copyleft/gpl.html) as long as this credit and licensing information remains with the software. Academic users should reference the software in published papers in a manner something like this:
Perrin, Andrew J. CodeRead. [Computer Program]. Available:
http://www.unc.edu/~aperrin/CodeRead 

@MISC{coderead,
AUTHOR="Andrew J. Perrin",
TITLE="CodeRead (Computer Program)",
NOTE="Available: {\url{http://www.unc.edu/~aperrin/CodeRead}}",
YEAR=2001
}

Very sketchy documentation

Most of CodeRead is encapsulated in the four Perl modules that define the data structures and methods. You can get information at that level by using perldoc CodeRead in the appropriate directory. The perldoc simply documents the perl objects and their properties and methods. Perl programmers may find this useful, most others will clearly not.

There is, in addition, a frontend script that remains VERY basic. This script makes it possible for non-programmers to get somewhere (although it's not clear where) using CodeRead. The script is called cr.pl and should usually be run from the command line while in the directory where CodeRead is installed. You can type cr.pl -h to get a list of options for using the script. It allows basic, but not very fine-grained, use of the tools. More will be forthcoming, I promise.

How to install and run

Comments and Questions

I'll do my best to answer any comments or questions, but no guarantees on my speed. E-mail me at andrew_perrin@unc.edu to contact me.

CodeRead: the article

There's a somewhat outdated article that will tell you a little more about the ideas and structure behind CodeRead. It's in pdf form here. An older version is here.

Some old junk I wrote about the original idea

Mostly of archaeological interest....

QCML will be released in three distinct parts:


This proposed standard is intended to allow sharing, replication, and collaboration among qualitative researchers by defining an open, flexible format for coding qualitative datasets and sharing them.

Criteria

Such a standard should:

QCML is based on SGML (ISO 8879), the Standard Generalized Markup Language.

Tags should be infinitely levelable; i.e. ...

Two types of codes: temporal markers have no text within them, content codes encapsulate one or more words.

Content codes last until <\QC > tag.

System maintains character number and total number of characters as built-in temporal markers. Each tagged section of text gets a score, which is just the number of characters (words?) between and . Marker tags get score of 0.

Output:

  • Char. count => number of active codes
  • Char. count => cumulative code score
  • dummy var's for any two tags or types. (=> correspondence analysis) also allow a distance score: how far apart tags can be (0 for overlapping) to show up as corresponding.
  • Text as uncoded dataset.
  • Text as coded dataset.
  • Tag report: list of tags, definitions, levels, counts
  • Pattern scoring: 2 or more tags and a tolerance/distance allowance: how often pattern holds as % of total uses of input tags

    Some tags:

    Mark the beginning of a dataset Define a new tag to be used in the dataset (optional - should spring into being if not defined first) Mark the beginning of the text portion of the dataset File directory to find input file(s) in. Read the referenced file in-place as one big file. Ignore this section in analysis and coded output, print only in original output Don't interpret but do count words & characters and include in coded output. When reading files, should ignore all whitespace in character count (leaning toward a word count instead)

    Technical:

    Words are numbered by system. Two data structures are created:
    • 1.) Array of words
    • 2.) Hash of tags at word numbers:
    • Word # ==> tag 1 && tag 2 && ... & tag n