space-4907935.jpg

DATA COLLECTIONS

The data on this page contain links to corpora and to data collections that can be transformed into corpora.

There are also links to other forensic linguistic and legal language repositories.

 

Public Forensic Linguistic & Legal Language Corpora 

 

DATA NAME

Description

Malicious Forensic Text Corpus

SOURCE NAME

Andrea Nini

Approximately 100 malicious threatening texts. Metadata is included where known.

Data Links

DATA NAME

Description

Shneidman and Farberow Suicide Note Corpus

SOURCE NAME

Shneidman, E.S., & Farberow, N.L. (1957). Clues to Suicide. New York: McGraw-Hill Book Company

The original data from the 1957 Shneidman and Farberow study. The data contain a matched sample of 33 authentic letters and 33 inauthentic letters.

Data Links

DATA NAME

Description

Corpus of State Conventions on the Adoption of the Constitution (COSCAC)

SOURCE NAME

Brigham Young University

Over 1 million words from more than 650 texts. Text contain debates of several state conventions on the adoption of the U.S. Constitution.

Data Links

DATA NAME

Description

Corpus of Early Statutes at Large (CESAL)

SOURCE NAME

Brigham Young University

Over 470,000 words from more than 480 texts. The corpus includes early laws passed by US Congress.

Data Links

DATA NAME

Description

The Enron Email Dataset

SOURCE NAME

William W. Cohen, MLD, CMU

Over 500,000 emails from 150 employees from the Enron Corporation (acquired by the Federal Energy Regulatory Commission during its investigation of Enron's collapse).

Data Links

DATA NAME

Description

NEW THREAT CORPUS

SOURCE NAME

Tammy Gales and Andrea Nini

This collection of threatening communications is a compilation of over 300 publicly-available texts from CTARC (the Communicated Threat Assessment Research Corpus, compiled by Tammy Gales), MFC (the Malicious Forensic Texts corpus, compiled by Andrea Nini), and the written texts from CoJO (the Corpus of Judicial Opinions, compiled by Julia Muschalik). Additional threatening texts come from ForensicLing.com (the forensic linguistic data site hosted by Tammy Gales and Dakota Wing). Meta data is supplied where known from the original case research.

Data Links

DATA NAME

Description

BYU-Corpus of Early Modern English (BYU-COEME)

SOURCE NAME

Brigham Young University

Over 1 billion words from more than 40,000 texts from 1475-1800. Texts are from the Evans Bibliography, the Early English Books Online, and Eighteenth Century Collections Online, among others.

Data Links

DATA NAME

Description

Corpus of Founding Era American English (COFEA)

SOURCE NAME

Brigham Young University

Over 130 million words from more than 125,000 texts spanning from 1760-1799. Documents include those "from ordinary people of the day, the Founders, and legal sources"

Data Links

DATA NAME

Description

Corpus of Supreme Court Opinions of the United States (COSCO-US)

SOURCE NAME

Brigham Young University

Over 94 million words from more than 60,000 texts from U.S. Supreme Court Opinions published through 2017.

Data Links

DATA NAME

Description

Corpus of the Records of the Constitutional Convention (CORCC)

SOURCE NAME

Brigham Young University

Over 680,000 words from more than 800 texts documenting the records of the federal convention of 1787.

Data Links

DATA NAME

Description

Corpus of US Caselaw (CUSC)

SOURCE NAME

Brigham Young University

Over 4 million words from more than 8000 texts, which include published court decisions between 1760-1799.

Data Links

 

Public Data Collections for Corpus Compilation

DATA NAME

Description

U.S. Supreme Court Oral Arguments

SOURCE NAME

U.S. Supreme Court

Transcripts of the oral arguments on cases heard by the U.S. Supreme Court

Data Links

DATA NAME

Description

The ISIS Files

SOURCE NAME

George Washington University and the New York Times

"A collection of more than 15,000 pages of internal ISIS documents collected by New York Times investigative journalist and Program on Extremism fellow Rukmini Callimachi during embeds with the Iraqi army."

Data Links

DATA NAME

Description

Police Interrogation and Confession Videos

SOURCE NAME

Red Circle Interrogations and Confessions

This site hosts videos of police interrogations and confessions from a range of cases.

Data Links

DATA NAME

Description

Police Interrogation Videos

SOURCE NAME

r/interrogationvideos

A variety of videos of police interviews and interrogations

Data Links

DATA NAME

Description

School Shooters.info: Resources on School Shootings, Perpetrators, and Prevention

SOURCE NAME

Peter Langman, Ph.D.

"This site is a compendium of documents relating to a wide range of active shooter incidents in educational settings. The purpose of the site is to help prevent school shootings and to provide insight into the perpetrators of large-scale school violence."

Data Links

DATA NAME

Description

911 calls

SOURCE NAME

A Call for Help r/911Calls

Collections of 911 calls (e.g., disturbing, strange, celebrity)

Data Links

DATA NAME

Description

Live Trial Videos

SOURCE NAME

Law and Crime Trial Network

This site hosts a range of live trial videos. Many postings include multiple videos from the same case at various stages of the court proceedings. Videos are free; they are not transcribed.

Data Links

DATA NAME

Description

Piracy Trial Documents

SOURCE NAME

The Library of Congress: LAW

Trial documents from piracy cases pre-1923

Data Links

DATA NAME

Description

Serial Killer Court Transcripts

SOURCE NAME

Serial Killer Info

A range of legal documents from serial killer cases across the United States

Data Links

DATA NAME

Description

Fire and Police Videos

SOURCE NAME

FireandPoliceVideos.com

A collection of videos from police and fire fighters

Data Links

DATA NAME

Description

Court Case Documents

SOURCE NAME

Legal Research Society

Court documents from landmark U.S. Supreme Court cases

Data Links

DATA NAME

Description

The Guantánamo Testimonials Project

SOURCE NAME

UCDavis Center for the Study of Human Rights in the Americas

Various testimonials from inmates, workers, and others affiliated with Guantánamo Bay, Cuba

Data Links

DATA NAME

Description

Canadian Court Opinions

SOURCE NAME

CanLII

Searchable database of Canadian court opinions

Data Links

DATA NAME

Description

Full length police interview videos

SOURCE NAME

Across the Table

A collection of full-length police interview videos

Data Links

DATA NAME

Description

Death Threats

SOURCE NAME

The Smoking Gun

Death threats (primarily against public figures)

Data Links

DATA NAME

Description

Death Row Final Statements: Texas

SOURCE NAME

Texas Department of Criminal Justice

Last statements from death row offenders prior to execution

Data Links

DATA NAME

Description

911 calls

SOURCE NAME

Los Angeles Police Department

Audio recordings of actual 911 calls received in Los Angeles

Data Links

DATA NAME

Description

Death Row Final Statements: California

SOURCE NAME

Evan Wagstaff, Los Angeles Times

Last words of the 13 men executed in California between 1978-2014

Data Links

DATA NAME

Description

Senate Judiciary Hearings

SOURCE NAME

Committee on the Judiciary

Videos and transcripts of U.S. Senate Judiciary hearings

Data Links

DATA NAME

Description

Airplane Black Box Last Word Recordings

SOURCE NAME

ListVerse

Audio recordings of last words from cockpit black boxes

Data Links

DATA NAME

Description

Death Row Final Statements: Bizarre

SOURCE NAME

BuzzFuse

A collection of bizarre last words from a range of criminals prior to execution

Data Links

DATA NAME

Description

Chat logs of Child Predators

SOURCE NAME

Perverted Justice

A collection of chat logs between child predators and perverted justice volunteers, who portrayed themselves as children. NOTE, as of 2019, they are ceasing operations, but the chat logs are currently still available on their website.

Data Links

DATA NAME

Description

Threats against Congress Members

SOURCE NAME

FBI Records: The Vault

Written threats against members of congress

Data Links

DATA NAME

Description

Los Angeles Times Legal and Political Documents

SOURCE NAME

Los Angeles Times

A wide collection of legal and political documents that have been in the news.

Data Links

DATA NAME

Description

Airplane Black Box Last Words

SOURCE NAME

PlaneCrashInfo.com

Transcripts of last words from cockpit black box recordings

Data Links

DATA NAME

Description

Civil Court Videos

SOURCE NAME

Caught in Providence

Videos of civil court interactions

Data Links

DATA NAME

Description

Trial Transcript Collection

SOURCE NAME

John Jay College of Criminal Justice

Thousands of New York County criminal trial transcripts from 1883-1927

Data Links

DATA NAME

Description

Civil Rights Court Documents

SOURCE NAME

ACLU-PA

A range of legal documents (e.g., motions, briefs, complaints, decisions) from cases related to civil rights.

Data Links

DATA NAME

Description

Police-Citizen Videos

SOURCE NAME

Audit the Audit

Videos from police body and car cams (embedded within the videos)

Data Links

DATA NAME

Description

911 calls

SOURCE NAME

911 Florida Raw Audio

Audio recordings of 911 calls to the Daytona, FL police department (embedded shortly into the podcasts)

Data Links

 

Private Collections

DATA NAME

Description

Ted Kaczynski papers, 1996-.

SOURCE NAME

University of Michigan

"Collection consists of three series: Correspondence, the bulk of the collection, which includes letters written to Kaczynski since his arrest in 1996; Publications, consisting of pamphlets, serials, and clippings sent to Kaczynski with a few added by archivists during processing; and Legal Documents, containing drafts of briefs, excluding any materials that fall under attorney-client privilege or are significant to the appeal process. Later additions include photographs and documents (some photocopies) from the FBI." See the website for access information.

Data Links

DATA NAME

Description

Serial Killer Archive

SOURCE NAME

serialkillermurderabilia.com

"The Rosetta Stone of Serial Killer Collections". This collection contains authentic artifacts, including letters, artwork, photos, recorded phone calls, and other documents and artifacts from a range of known serial killers. Access to data requires a fee to be negotiated with the collection holder. Use the 'contact' form at the bottom of the home page.

Data Links

 

Data Repositories

DATA NAME

Description

Inside the Courtroom

SOURCE NAME

Inside the Courtroom

A range of videos related to the court (e.g., witness interviews, interrogations, juror interviews, courtroom proceedings, etc.)

Data Links

DATA NAME

Description

Forensic Linguistics Databank (FoLD)

SOURCE NAME

Aston Institute for Forensic Linguistics

A range of forensic linguistic and legal language corpora and data collections

Data Links

DATA NAME

Description

Sources of Language and Law (SOULL)

SOURCE NAME

Universitat Siegen, International Language and Law Association, and Heidelberger Arbeitskreis der Rechtslinguistik

A range of legal language documents and references

Data Links

DATA NAME

Description

Crime Vault

SOURCE NAME

Crime Vault

A range of videos documenting crimes (e.g., police interviews, social media live streams, confessions, witness interviews)

Data Links

DATA NAME

Description

The Smoking Gun

SOURCE NAME

TSG Industries

A range of documents from popular culture cases. (Note, this site needs a lot of browsing to find data.)

Data Links