space-4907935.jpg

DATA COLLECTIONS

The data on this page contain links to corpora and to data collections that can be transformed into corpora.

There are also links to other forensic linguistic and legal language repositories.

 

Public Forensic Linguistic & Legal Language Corpora 

 

DATA NAME

Description

Corpus of Spoken Threats (CoST)

SOURCE NAME

Coming soon

Data Links

DATA NAME

Description

Malicious Forensic Text Corpus

SOURCE NAME

Approximately 100 malicious threatening texts. Metadata is included where known.

Data Links

Andrea Nini

DATA NAME

Description

Corpus of Early Statutes at Large (CESAL)

SOURCE NAME

Over 470,000 words from more than 480 texts. The corpus includes early laws passed by US Congress.

Data Links

Brigham Young University

DATA NAME

Description

Corpus of State Conventions on the Adoption of the Constitution (COSCAC)

SOURCE NAME

Over 1 million words from more than 650 texts. Text contain debates of several state conventions on the adoption of the U.S. Constitution.

Data Links

Brigham Young University

DATA NAME

Description

Corpus of Supreme Court Opinions of the United States (COSCO-US)

SOURCE NAME

Over 94 million words from more than 60,000 texts from U.S. Supreme Court Opinions published through 2017.

Data Links

Brigham Young University

DATA NAME

Description

Corpus of the Records of the Constitutional Convention (CORCC)

SOURCE NAME

Over 680,000 words from more than 800 texts documenting the records of the federal convention of 1787.

Data Links

Brigham Young University

DATA NAME

Description

Corpus of Founding Era American English (COFEA)

SOURCE NAME

Over 130 million words from more than 125,000 texts spanning from 1760-1799. Documents include those "from ordinary people of the day, the Founders, and legal sources"

Data Links

Brigham Young University

DATA NAME

Description

BYU-Corpus of Early Modern English (BYU-COEME)

SOURCE NAME

Over 1 billion words from more than 40,000 texts from 1475-1800. Texts are from the Evans Bibliography, the Early English Books Online, and Eighteenth Century Collections Online, among others.

Data Links

Brigham Young University

DATA NAME

Description

Corpus of US Caselaw (CUSC)

SOURCE NAME

Over 4 million words from more than 8000 texts, which include published court decisions between 1760-1799.

Data Links

Brigham Young University

DATA NAME

Description

Shneidman and Farberow Suicide Note Corpus

SOURCE NAME

The original data from the 1957 Shneidman and Farberow study. The data contain a matched sample of 33 authentic letters and 33 inauthentic letters.

Data Links

Shneidman, E.S., & Farberow, N.L. (1957). Clues to Suicide. New York: McGraw-Hill Book Company

Coming Soon!

DATA NAME

Description

Threatening English Language (TEL) Corpus

SOURCE NAME

This collection of threatening communications is a compilation of over 300 publicly-available texts from CTARC (the Communicated Threat Assessment Research Corpus, compiled by Tammy Gales), MFC (the Malicious Forensic Texts corpus, compiled by Andrea Nini), and the written texts from CoJO (the Corpus of Judicial Opinions, compiled by Julia Muschalik). Additional threatening texts come from ForensicLing.com (the forensic linguistic data site hosted by Tammy Gales and Dakota Wing). Meta data is supplied where known from the original case research.

Data Links

Tammy Gales and Andrea Nini

Coming Soon!

DATA NAME

Description

The Enron Email Dataset

SOURCE NAME

Over 500,000 emails from 150 employees from the Enron Corporation (acquired by the Federal Energy Regulatory Commission during its investigation of Enron's collapse).

Data Links

William W. Cohen, MLD, CMU

 

Public Data Collections for Corpus Compilation

DATA NAME

Description

62-B District Court City of Kentwood

SOURCE NAME

This is the official YouTube Channel of the 62-B District Court in Kentwood, Michigan.

Data Links

62-B District Court City of Kentwood

DATA NAME

Description

911 calls

SOURCE NAME

Audio recordings of 911 calls to the Daytona, FL police department (embedded shortly into the podcasts)

Data Links

911 Florida Raw Audio

DATA NAME

Description

911 calls

SOURCE NAME

Collections of 911 calls (e.g., disturbing, strange, celebrity)

Data Links

A Call for Help r/911Calls

DATA NAME

Description

Civil Rights Court Documents

SOURCE NAME

A range of legal documents (e.g., motions, briefs, complaints, decisions) from cases related to civil rights.

Data Links

ACLU-PA

DATA NAME

Description

Full length police interview videos

SOURCE NAME

A collection of full-length police interview videos

Data Links

Across the Table

DATA NAME

Description

Police-Citizen Videos

SOURCE NAME

Videos from police body and car cams (embedded within the videos)

Data Links

Audit the Audit

DATA NAME

Description

Death Row Final Statements: Notorious killers

SOURCE NAME

A collection of last words from a range of notorious killers prior to execution

Data Links

Brie Stimson, Fox News

DATA NAME

Description

Death Row Final Statements: Bizarre

SOURCE NAME

A collection of bizarre last words from a range of criminals prior to execution

Data Links

BuzzFuse

DATA NAME

Description

Cockpit Voice Recorder Database

SOURCE NAME

Transcripts of last words from cockpit black box recordings (1962-2019)

Data Links

CVR Database

DATA NAME

Description

Canadian Court Opinions

SOURCE NAME

Searchable database of Canadian court opinions

Data Links

CanLII

DATA NAME

Description

Civil Court Videos

SOURCE NAME

Videos of civil court interactions

Data Links

Caught in Providence

DATA NAME

Description

Senate Judiciary Hearings

SOURCE NAME

Videos and transcripts of U.S. Senate Judiciary hearings

Data Links

Committee on the Judiciary

DATA NAME

Description

Criminal Words

SOURCE NAME

"Case Summaries, Interrogation Transcripts, 911 Transcripts, and More!"

Data Links

Criminal Words

DATA NAME

Description

US Securities & Exchange Commission

SOURCE NAME

"EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, is the primary system for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940.

Containing millions of company and individual filings, EDGAR benefits investors, corporations, and the U.S. economy overall by increasing the efficiency, transparency, and fairness of the securities markets. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average. "

Data Links

EDGAR

DATA NAME

Description

Death Row Final Statements: California

SOURCE NAME

Last words of the 13 men executed in California between 1978-2014

Data Links

Evan Wagstaff, Los Angeles Times

DATA NAME

Description

Threats against Congress Members

SOURCE NAME

Written threats against members of congress

Data Links

FBI Records: The Vault

DATA NAME

Description

Fire and Police Videos

SOURCE NAME

A collection of videos from police and fire fighters

Data Links

FireandPoliceVideos.com

DATA NAME

Description

Russian Troll Tweets

SOURCE NAME

Nearly three million tweets from accounts associated with the Internet Research Agency, a Russian organization responsible for spreading disinformation and disrupting American politics, from between February 2012 and May 2018

Data Links

FiveThirtyEight

DATA NAME

Description

The ISIS Files

SOURCE NAME

"A collection of more than 15,000 pages of internal ISIS documents collected by New York Times investigative journalist and Program on Extremism fellow Rukmini Callimachi during embeds with the Iraqi army."

Data Links

George Washington University and the New York Times

DATA NAME

Description

Trial Transcript Collection

SOURCE NAME

Thousands of New York County criminal trial transcripts from 1883-1927

Data Links

John Jay College of Criminal Justice

DATA NAME

Description

Audio/visual Law Publications (French-Canadian)

SOURCE NAME

"Jurivision is the audiovisual platform for law. Its objective is to make legal knowledge accessible through quality audiovisual content that highlights the work of legal researchers and practitioners in Canada and around the world. Jurivision is an initiative of the Faculty of Law of the University of Ottawa."

Data Links

Jurivision

DATA NAME

Description

Justia

SOURCE NAME

"Justia provides free case law, codes, regulations and legal information for lawyers, business, students and consumers world wide."

Data Links

Justia

DATA NAME

Description

US Supreme Court Center

SOURCE NAME

"Justia provides a searchable and browsable database of all US Supreme Court decisions since the 1790s, as well as links to related sources. We also sponsor the Oyez Project, a multimedia archive that contains audio of Supreme Court oral arguments."

Data Links

Justia

DATA NAME

Description

Live Trial Videos

SOURCE NAME

This site hosts a range of live trial videos. Many postings include multiple videos from the same case at various stages of the court proceedings. Videos are free; they are not transcribed.

Data Links

Law and Crime Trial Network

DATA NAME

Description

Court Case Documents

SOURCE NAME

Court documents from landmark U.S. Supreme Court cases

Data Links

Legal Research Society

DATA NAME

Description

Airplane Black Box Last Word Recordings

SOURCE NAME

Audio recordings of last words from cockpit black boxes

Data Links

ListVerse

DATA NAME

Description

911 calls

SOURCE NAME

Audio recordings of actual 911 calls received in Los Angeles

Data Links

Los Angeles Police Department

DATA NAME

Description

Los Angeles Times Legal and Political Documents

SOURCE NAME

A wide collection of legal and political documents that have been in the news.

Data Links

Los Angeles Times

DATA NAME

Description

Michigan Virtual Courtroom

SOURCE NAME

Michigan Court's virtual courtrooms where you can view court hearings

Data Links

Michigan Courts

DATA NAME

Description

Founders Online

SOURCE NAME

"Correspondence and other writings of seven major shapers of the United States:
George Washington, Benjamin Franklin, John Adams (and family), Thomas Jefferson, Alexander Hamilton, John Jay, and James Madison. Over 184,000 searchable documents, fully annotated, from the authoritative Founding Fathers Papers projects."

Data Links

National Archives

 

Private Collections

DATA NAME

The Violence Project

Description

The Violence Project is "the most comprehensive database of mass shooters" with over 100 variables. "The Violence Project is a nonprofit, nonpartisan research center dedicated to reducing violence in society and using data and analysis to improve policy and practice."

SOURCE NAME

The Violence Project

Data Links

DATA NAME

Ted Kaczynski papers, 1996-.

Description

"Collection consists of three series: Correspondence, the bulk of the collection, which includes letters written to Kaczynski since his arrest in 1996; Publications, consisting of pamphlets, serials, and clippings sent to Kaczynski with a few added by archivists during processing; and Legal Documents, containing drafts of briefs, excluding any materials that fall under attorney-client privilege or are significant to the appeal process. Later additions include photographs and documents (some photocopies) from the FBI." See the website for access information.

SOURCE NAME

University of Michigan

Data Links

DATA NAME

Serial Killer Archive

Description

"The Rosetta Stone of Serial Killer Collections". This collection contains authentic artifacts, including letters, artwork, photos, recorded phone calls, and other documents and artifacts from a range of known serial killers. Access to data requires a fee to be negotiated with the collection holder. Use the 'contact' form at the bottom of the home page.

SOURCE NAME

serialkillermurderabilia.com

Data Links

 

Data Repositories

DATA NAME

Description

Inside the Courtroom

SOURCE NAME

A range of videos related to the court (e.g., witness interviews, interrogations, juror interviews, courtroom proceedings, etc.)

Data Links

Inside the Courtroom

DATA NAME

Description

Forensic Linguistics Databank (FoLD)

SOURCE NAME

A range of forensic linguistic and legal language corpora and data collections

Data Links

Aston Institute for Forensic Linguistics

DATA NAME

Description

Sources of Language and Law (SOULL)

SOURCE NAME

A range of legal language documents and references

Data Links

Universitat Siegen, International Language and Law Association, and Heidelberger Arbeitskreis der Rechtslinguistik

DATA NAME

Description

Crime Vault

SOURCE NAME

A range of videos documenting crimes (e.g., police interviews, social media live streams, confessions, witness interviews)

Data Links

Crime Vault

DATA NAME

Description

The Smoking Gun

SOURCE NAME

A range of documents from popular culture cases. (Note, this site needs a lot of browsing to find data.)

Data Links

TSG Industries

DATA NAME

Description

The FBI Vault

SOURCE NAME

This repository contains 6,700 documents from FBI case files that have been released to the public. (See also the individual data page for specific case documents that have been identified from this site.)

Data Links

The Federal Bureau of Investigation