
DATA COLLECTIONS
The data on this page contain links to corpora and to data collections that can be transformed into corpora.
There are also links to other forensic linguistic and legal language repositories.
-
public corpora (e.g., the Threatening English Language (TEL) corpus);
-
public data collections that can be transformed into corpora (e.g., the school shooter database);
-
private collections that require permission to access (e.g., the matched sample parole board data);
-
data repositories (e.g., Aston University's FoLD repository).
Public Forensic Linguistic & Legal Language Corpora
DATA NAME
Description
Malicious Forensic Text Corpus
SOURCE NAME
Andrea Nini
Approximately 100 malicious threatening texts. Metadata is included where known.
Data Links
DATA NAME
Description
Shneidman and Farberow Suicide Note Corpus
SOURCE NAME
Shneidman, E.S., & Farberow, N.L. (1957). Clues to Suicide. New York: McGraw-Hill Book Company
The original data from the 1957 Shneidman and Farberow study. The data contain a matched sample of 33 authentic letters and 33 inauthentic letters.
Data Links
DATA NAME
Description
Corpus of State Conventions on the Adoption of the Constitution (COSCAC)
SOURCE NAME
Brigham Young University
Over 1 million words from more than 650 texts. Text contain debates of several state conventions on the adoption of the U.S. Constitution.
Data Links
DATA NAME
Description
Corpus of Early Statutes at Large (CESAL)
SOURCE NAME
Brigham Young University
Over 470,000 words from more than 480 texts. The corpus includes early laws passed by US Congress.
Data Links
DATA NAME
Description
NEW THREAT CORPUS
SOURCE NAME
Tammy Gales and Andrea Nini
This collection of threatening communications is a compilation of over 300 publicly-available texts from CTARC (the Communicated Threat Assessment Research Corpus, compiled by Tammy Gales), MFC (the Malicious Forensic Texts corpus, compiled by Andrea Nini), and the written texts from CoJO (the Corpus of Judicial Opinions, compiled by Julia Muschalik). Additional threatening texts come from ForensicLing.com (the forensic linguistic data site hosted by Tammy Gales and Dakota Wing). Meta data is supplied where known from the original case research.
Data Links
DATA NAME
Description
BYU-Corpus of Early Modern English (BYU-COEME)
SOURCE NAME
Brigham Young University
Over 1 billion words from more than 40,000 texts from 1475-1800. Texts are from the Evans Bibliography, the Early English Books Online, and Eighteenth Century Collections Online, among others.
Data Links
DATA NAME
Description
Corpus of Founding Era American English (COFEA)
SOURCE NAME
Brigham Young University
Over 130 million words from more than 125,000 texts spanning from 1760-1799. Documents include those "from ordinary people of the day, the Founders, and legal sources"
Data Links
DATA NAME
Description
Corpus of Supreme Court Opinions of the United States (COSCO-US)
SOURCE NAME
Brigham Young University
Over 94 million words from more than 60,000 texts from U.S. Supreme Court Opinions published through 2017.
Data Links
DATA NAME
Description
Corpus of the Records of the Constitutional Convention (CORCC)
SOURCE NAME
Brigham Young University
Over 680,000 words from more than 800 texts documenting the records of the federal convention of 1787.
Data Links
DATA NAME
Description
Corpus of US Caselaw (CUSC)
SOURCE NAME
Brigham Young University
Over 4 million words from more than 8000 texts, which include published court decisions between 1760-1799.
Data Links
Public Data Collections for Corpus Compilation
DATA NAME
Description
U.S. Supreme Court Oral Arguments
SOURCE NAME
U.S. Supreme Court
Transcripts of the oral arguments on cases heard by the U.S. Supreme Court
Data Links
DATA NAME
Description
The ISIS Files
SOURCE NAME
George Washington University and the New York Times
"A collection of more than 15,000 pages of internal ISIS documents collected by New York Times investigative journalist and Program on Extremism fellow Rukmini Callimachi during embeds with the Iraqi army."
Data Links
DATA NAME
Description
Police Interrogation and Confession Videos
SOURCE NAME
Red Circle Interrogations and Confessions
This site hosts videos of police interrogations and confessions from a range of cases.
Data Links
DATA NAME
Description
Police Interrogation Videos
SOURCE NAME
r/interrogationvideos
A variety of videos of police interviews and interrogations
Data Links
DATA NAME
Description
School Shooters.info: Resources on School Shootings, Perpetrators, and Prevention
SOURCE NAME
Peter Langman, Ph.D.
"This site is a compendium of documents relating to a wide range of active shooter incidents in educational settings. The purpose of the site is to help prevent school shootings and to provide insight into the perpetrators of large-scale school violence."
Data Links
DATA NAME
Description
911 calls
SOURCE NAME
A Call for Help r/911Calls
Collections of 911 calls (e.g., disturbing, strange, celebrity)
Data Links
DATA NAME
Description
Live Trial Videos
SOURCE NAME
Law and Crime Trial Network
This site hosts a range of live trial videos. Many postings include multiple videos from the same case at various stages of the court proceedings. Videos are free; they are not transcribed.
Data Links
DATA NAME
Description
Piracy Trial Documents
SOURCE NAME
The Library of Congress: LAW
Trial documents from piracy cases pre-1923
Data Links
DATA NAME
Description
Serial Killer Court Transcripts
SOURCE NAME
Serial Killer Info
A range of legal documents from serial killer cases across the United States
Data Links
DATA NAME
Description
Fire and Police Videos
SOURCE NAME
FireandPoliceVideos.com
A collection of videos from police and fire fighters
Data Links
DATA NAME
Description
Court Case Documents
SOURCE NAME
Legal Research Society
Court documents from landmark U.S. Supreme Court cases
Data Links
DATA NAME
Description
The Guantánamo Testimonials Project
SOURCE NAME
UCDavis Center for the Study of Human Rights in the Americas
Various testimonials from inmates, workers, and others affiliated with Guantánamo Bay, Cuba
Data Links
DATA NAME
Description
Canadian Court Opinions
SOURCE NAME
CanLII
Searchable database of Canadian court opinions
Data Links
DATA NAME
Description
Full length police interview videos
SOURCE NAME
Across the Table
A collection of full-length police interview videos
Data Links
DATA NAME
Description
Death Threats
SOURCE NAME
The Smoking Gun
Death threats (primarily against public figures)
Data Links
DATA NAME
Description
Death Row Final Statements: Texas
SOURCE NAME
Texas Department of Criminal Justice
Last statements from death row offenders prior to execution
Data Links
DATA NAME
Description
911 calls
SOURCE NAME
Los Angeles Police Department
Audio recordings of actual 911 calls received in Los Angeles
Data Links
DATA NAME
Description
Death Row Final Statements: California
SOURCE NAME
Evan Wagstaff, Los Angeles Times
Last words of the 13 men executed in California between 1978-2014
Data Links
DATA NAME
Description
Senate Judiciary Hearings
SOURCE NAME
Committee on the Judiciary
Videos and transcripts of U.S. Senate Judiciary hearings
Data Links
DATA NAME
Description
Airplane Black Box Last Word Recordings
SOURCE NAME
ListVerse
Audio recordings of last words from cockpit black boxes
Data Links
DATA NAME
Description
Death Row Final Statements: Bizarre
SOURCE NAME
BuzzFuse
A collection of bizarre last words from a range of criminals prior to execution
Data Links
DATA NAME
Description
Chat logs of Child Predators
SOURCE NAME
Perverted Justice
A collection of chat logs between child predators and perverted justice volunteers, who portrayed themselves as children. NOTE, as of 2019, they are ceasing operations, but the chat logs are currently still available on their website.
Data Links
DATA NAME
Description
Threats against Congress Members
SOURCE NAME
FBI Records: The Vault
Written threats against members of congress
Data Links
DATA NAME
Description
Los Angeles Times Legal and Political Documents
SOURCE NAME
Los Angeles Times
A wide collection of legal and political documents that have been in the news.
Data Links
DATA NAME
Description
Airplane Black Box Last Words
SOURCE NAME
PlaneCrashInfo.com
Transcripts of last words from cockpit black box recordings
Data Links
DATA NAME
Description
Civil Court Videos
SOURCE NAME
Caught in Providence
Videos of civil court interactions
Data Links
DATA NAME
Description
Trial Transcript Collection
SOURCE NAME
John Jay College of Criminal Justice
Thousands of New York County criminal trial transcripts from 1883-1927
Data Links
DATA NAME
Description
Civil Rights Court Documents
SOURCE NAME
ACLU-PA
A range of legal documents (e.g., motions, briefs, complaints, decisions) from cases related to civil rights.
Data Links
DATA NAME
Description
Police-Citizen Videos
SOURCE NAME
Audit the Audit
Videos from police body and car cams (embedded within the videos)
Data Links
DATA NAME
Description
911 calls
SOURCE NAME
911 Florida Raw Audio
Audio recordings of 911 calls to the Daytona, FL police department (embedded shortly into the podcasts)
Data Links
Private Collections
DATA NAME
Description
Ted Kaczynski papers, 1996-.
SOURCE NAME
University of Michigan
"Collection consists of three series: Correspondence, the bulk of the collection, which includes letters written to Kaczynski since his arrest in 1996; Publications, consisting of pamphlets, serials, and clippings sent to Kaczynski with a few added by archivists during processing; and Legal Documents, containing drafts of briefs, excluding any materials that fall under attorney-client privilege or are significant to the appeal process. Later additions include photographs and documents (some photocopies) from the FBI." See the website for access information.
Data Links
DATA NAME
Description
Serial Killer Archive
SOURCE NAME
serialkillermurderabilia.com
"The Rosetta Stone of Serial Killer Collections". This collection contains authentic artifacts, including letters, artwork, photos, recorded phone calls, and other documents and artifacts from a range of known serial killers. Access to data requires a fee to be negotiated with the collection holder. Use the 'contact' form at the bottom of the home page.
Data Links
Data Repositories
DATA NAME
Description
Inside the Courtroom
SOURCE NAME
Inside the Courtroom
A range of videos related to the court (e.g., witness interviews, interrogations, juror interviews, courtroom proceedings, etc.)
Data Links
DATA NAME
Description
Forensic Linguistics Databank (FoLD)
SOURCE NAME
Aston Institute for Forensic Linguistics
A range of forensic linguistic and legal language corpora and data collections
Data Links
DATA NAME
Description
Sources of Language and Law (SOULL)
SOURCE NAME
Universitat Siegen, International Language and Law Association, and Heidelberger Arbeitskreis der Rechtslinguistik
A range of legal language documents and references
Data Links
DATA NAME
Description
Crime Vault
SOURCE NAME
Crime Vault
A range of videos documenting crimes (e.g., police interviews, social media live streams, confessions, witness interviews)
Data Links
DATA NAME
Description
The Smoking Gun
SOURCE NAME
TSG Industries
A range of documents from popular culture cases. (Note, this site needs a lot of browsing to find data.)
Data Links