The BYU Law corpora (updated)

[Cross-posted at Language Log.]

I’d imagine that most people who’ve been actively involved with corpus linguistics are familiar with the BYU corpora—a collection of web-accessible corpora created by Brigham Young University linguistics professor Mark Davies. These corpora (and BYU’s corpus-linguistics program more generally) have played an essential part in the development of what I’ll call the corpus-linguistic turn in legal interpretation. The BYU corpora served as my entry-point into corpus linguistics, and they have provided the corpus data that has been used in most of the law-and-corpus-linguistics work that has been done to date. And beyond that, the BYU Law School has played an enormous role, in a variety of ways, in Law and Corpus Linguistics becoming a thing.

One of the things that the law school has been doing has been happening largely behind the scenes. For the past two or three years, people there have been developing the Corpus of Founding Era American English (COFEA)—a historical corpus that is intended as resource for studying language usage in the time leading up to the drafting and ratification of the U.S. Constitution. At this year’s conference on law and corpus linguistics (the third such conference, all of them hosted by the BYU Law School), we were given a preview of COFEA. And via a tweet by the law school’s dean, Gordon Smith, I’ve now learned that a beta version of COFEA is up and available for public playing-around-with, as are beta versions of two other corpora: the Corpus of Early Modern English and the Corpus of Supreme Court of the United States.

All three corpora are hosted on a new website titled BYU Law Corpus Linguistics, the URL for which ( seems familiar to me, for some reason that I can’t put my finger on. Be that as it may, here’s how the website describes the three corpora:

Corpus of Founding Era American English (COFEA)
95,133 texts
138,892,619 words
The Corpus of Founding Era American English covers the time period starting with the reign of King George III, and ending with the death of George Washington (1760-1799). COFEA contains documents from ordinary people of the day, the Founders, and legal sources, including letters, diaries, newspapers, non-fiction books, fiction, sermons, speeches, debates, legal cases, and other legal materials. Three sources have provided the majority of texts, the National Archive Founders Online; William S. Hein & Co., HeinOnline; Text Creation Partnership (TCP) Evans Bibliography (University of Michigan).

Corpus of Early Modern English
40,300 texts
1,283,475,411 words
The Corpus of Early Modern English cover texts from 1475–1800 that were included in the Evans Bibliography, the Early English Books Online (EBO), Eighteenth Century Collections Online (ECCO) corrected by the Text Creation Partnership (TCP) Evans Bibliography (University of Michigan).

Corpus of Supreme Court of the United States
31,682 texts
140,853,673 words
The Corpus of the United States Supreme Court includes all opinions in the United States Reports and opinions published by the Supreme Court through the 2017 term.

All three corpora sport a new user interface that is designed to be more lawyer-friendly than the interface for the existing BYU corpora. My initial impression is that the new interface looks like it will be a step in the right direction; with the ways to invoke the site’s functionality being more immediately visible or at least more easily find-able than is the case with the older interface. (User-interface developers undoubtedly have at least 100 words for the kind of thing that I’m talking about. Unfortunately, I don’t know any of them.)

Nevertheless, the interface is definitely still at the beta stage. It’s not self-explanatory, and if there are any help files, I couldn’t find them. In order to take COFEA out for a test-drive, I searched for instances of the string been increased, and although I learned that there were 115 instances of that string in the corpus, I couldn’t figure out how to display any of them. Every time I clicked on what I thought was an appropriate place, all I got was a lot of blankness. And at this point it becomes relevant for me to note that in addition to there apparently being no help files yet, there is no link for reporting issues to the developers. But I am sure that these kinks will be worked out. I will make some inquiries and will report back on what I learn.

Finally, I want to note that although the motivation behind the development of these corpora has been to create tools for dealing with legal issues, it may turn out that the Corpus of Early Modern English, with its wide temporal coverage (325 years as compared to to COFEA’s 39), will be of interest more to historical linguists than to law-and linguists. However, whether that turns out to be the case could depend on a variety of factors, so we’ll have to await the historical linguists’ verdict.


I’ve been informed that the developers are aware of the issue I’ve raised and are working on it.

In the meantime, I’ve figured out how to get the results of my search displayed. I don’t know whether it’s because a fix has been implemented, or because I just stumbled on what needs to be done. So here’s how to display results for a two-word string of the form word1 word2.

  1. At the top of the screen, click on “Matches.”
  2. Enter word1 in the Query box.
  3. Immediately above the Query box, where there are choices “Matches,” “Sections, ” and “Collocates,” click on “Collocates.”
  4. In the Collocate box (to the right of the Query box) delete the asterisk and enter word2.
  5. To the right of the Collocate box are boxes labeled “Left” and “Right.” In the Left box, set the value to 0. In the Right box, set the value to 2—even though you’re only interested in word2 when it immediately follows word1. For some reason, if you set the value to 1, you won’t be able to see the results.
  6. Hit Enter. You will then get a line showing the number of hits.
  7. Click on that line.

That’s what worked for me. Hopefully it will work for you, too.

Leave a Reply