Textual Corpora Workshop 2014: A Review


On October 17-18, 2014, the Digital Islamic Humanities Project at Brown University organized a workshop on “Textual Corpora” in the Digital Scholarship Lab at Rockefeller Library. We had around forty participants from various universities and institutions from around the world, and we spent a couple of days engaged in fruitful discussions and hands-on tutorial sessions. One of the participants, Dr. Sarah Pearce (NYU), wrote up a review of the event for her blog, Meshalim. She has kindly allowed us to reprint it here.


Textual Corpora and the Digital Islamic Humanities (day 1) | by S. J. Pearce

Normally I would not be totally comfortable posting a conference report like this because it’s basically reproducing others’ intellectual work, presented orally and possibly provisionally, in written and disseminated form. However, because video of the event is going to be posted online along with all of the PowerPoint slides, these presentations were made with an awareness that they were going to be disseminated online and so a brief digest does not strike me as a problem. With that said, what I am writing here represents the work of others, which I will cite appropriately.

The workshop convener, Elias Muhanna, began by introducing what he called “digital tools with little no learning curve.” These included text databases such as quran.com, altafsir.com, and the aggregate dictionary page al-mawrid reader (which is apparently totally and completely in violation of every copyright law on the books). Then there were sources for collections of unsearchable PDFs (waqfeya.com, mostafa.com, and alhadeeth.com; and hathitrust.org, which is pretty much the only one on this list that isn’t violating copyright law in some way) and sources for searchable digital libraries of classical Arabic texts (almeshkat.netsaid.netalwaraq.netshiaonlinelibrary.com and noorlib.ir; and al-jami’ al-kabīr, which is a database that has the special feature of mostly not functioning and mostly not being installable). Arabicorpus.byu.edu as well as various databases used by computational linguists are the best bets for modernists looking for things.

With respect to all of these, the question of how the texts are entered is a bit of a mystery. Some are rekeyed from editions, some are scanned as PDFs and some are OCR scanned; and even though OCR scanning can be up to 99% accurate, that still translates into a typo every hundred characters, which is not ideal. Regardless of the technology used to upload these texts to these databases, copyright law was raised, ongoing, as an issue surrounding the use of these tools, and the current state of play appears to be somewhere between the wild west and don’t ask don’t tell.

A few sample searches were run to demonstrate what they might be used for — occurrences of the phrase allahu a’lam to gauge epistemological humility (I’m not totally sure about the reliability of the one to gauge the other, but nevermind) and an Arabic proverb I did not know previously about mongoose farts (fasā baynahum al-ẓaribān) to illustrate a search to determine how a saying might be used, whether purely for grammatical or sociologically illustrative purposes (as this one apparently is) or whether it occurs within a discourse.


These text collections were a segue into Maxim Romanov’s presentation on the difference between text collections and text corpora and the desiderata for the creation of the latter.

Text collections are what already exit. They are characterized by the following traits:

  • reproduce books (technically DBs but don’t fuction as DBs)
  • Book/Source is divided into meaningless unites of data, such as “pages”
  • Limited, ideologically biased (shamela is open but BOK format is obscure)
  • Not customizable  (users cannot add content)
  • Limited search options
  • Search results are impossible to handle (have to have your own system on top of the library system)
  • No advanced way for analyzing results (no graphing, mapping)
  • No ability to update metadata

Textual corpora are what we need to be creating. They are characterized by the following traits:

  • Adapted for research purposes (open organization format)
  • Source is divided into meaningful unites of data (such as “biographies” for a biographical collection, “events” for chronicles, “hadith reports” for hadith collections)
  • Open and fully customizable
  • Complex searches (with filtering options)
  • Results can be saved for later processing (multiple versions, annotations, links)
  • Visualizations of results
  • Easy to update metadata


Elli Mylonas gave an introduction to the idea of textual markup, which was the piece that was the most general and most theoretical of the day. She raised a number of interesting issues.

One was the question of how archival data can be, and she made the case for XML files being not quite as good as acid-free paper in a box, but basically the digital equivalent. It’s a standard language and it is text-based and therefore should be readable on future technologies, whatever they might be.

She then made the case that text markup is a form of textual interpretation; and when somebody asked a question that was predicated on his being okay with the status quo in which some people do programming and some people analyze texts, she replied that marking up a text for XML really forces you to think more carefully about both the structure and the content of the text; it’s not an either-or proposition. This is not a case where science is trying to impose itself upon the humanities (ahem, quantum medievalism) but rather supplement it methodologically.

One important distinction is between markup and markdown. The latter is a more descriptive, plain-text rendering of, well, text, that allows it to be more easily exported into a variety of schema. (I think?) Markdown is less rule-bound, more abstract, and more idiosyncratic, which means that it is less labor intensive but potentially less-widely useful in the absence of a really robust tagging scheme.

She showed a few examples of a marked up text, including the Shelley-Godwin archive, which has Mary Shelley’s notebooks marked up to show where her handwriting occurs and where her husband Percey’s does, as a way of trying to put to rest the question of who really wrote Frankenstein; a Brown project on the paleographic inscriptions of the Levant that, she told us, provokes an argument between every new graduate student worker and the PI over how to classify the religious assignation of the inscriptions (see? interpretation!); and the Old Bailey Online, in which you can search court records by crime and punishment.

The one difficulty for Islamic Studies is that XML comes out of the European printing and typesetting tradition and is therefore not natively suited to Arabic and other right-to-left languages.


Maxim Romanov then gave a practical introduction to one element of text markup, namely regular expressions, a way of creating customized searches within digitized text. These are two web sites with some basic instructions and options to practice:

One example of a regular expression is this. If I wanted to find all the possible transliterations of  قذافي (the surname of the deposed Libyan dictator) in a given text in a searchable text corpus, I would type:  [QGK]a(dh?)+a{1,2}f(i|y) as my search term. This would look for any word that began with Q, G, or K, then had an a, then had a d and possibly an h and a possible repetition of that combination, one or two As, and f, and then either an i or a y. There was much practicing and many exercises and that’s really all I have to say about that. (Except that this cartoon  and this one suddenly make a lot more sense.)


Textual Corpora & the Digital Islamic Humanities (day 2) | by S.J. Pearce


Following up on the Qaddafi-hunt by regular expression of day 1 of the workshop on digital Islamic humanities, here is Maxim Romanov, demonstrating a regular expression to search for terms that describe years within a text corpus that hasn’t been subjected to Buckwalter transliteration but is rather in the original Arabic script.

Three major topics that were covered on day 2.

Scripting. Maxim Romanov covered a basic overview of/introduction to scripting and the automation of repetitive tasks, such as downloading thousands of things from the web, converting text formats, and conducting complex searches (by regular expressions).

The preferred scripting language amongst this crowd of presenters was Python, in no small measure because it is named after Monty Python, but also because it is very straightforward. Maxim illustrated some of the possibilities with python by walking us through one of his research questions, which was about the chronological coverage of certain historical sources, in other words, how much attention do certain time periods get versus other time periods?. He demonstrated the methods he used for capturing date information from a really large amount of text by automating specific queries with script, and then processing the data so it could be output in an easily readable graph. Conference organizer Elias Muhanna emphasized that this was an example of how digital and computational methodologies are not replacements for analysis but rather demand quite a lot of good, old-fashioned philological hard-nosedness, but offer different tools for exploring and expressing it. This is a way of simply speeding up and scaling up what we are already doing.

We then had a brief presentation from one of the researchers from the Early Islamic Empire at Workproject, who showed us how his team is creating search tools for their corpus, tools which will be made publicly available in December as the Jedli toolbox, which will include various types of color-codeable, checklist- and keyword-based searching. One of the major takeaways from this presentation was the idea that by being able to edit open-source code and program things, it’s possible to build upon earlier existing work to make things do specifically what any given researcher wants them to.

This raised the question of citation, which, based on a lot of the comments made in response to the question (which I asked), made it seem like a total wild west. One of the participants with quite a lot of programming experience said that citing someone else’s code would be like citing a recipe when you make dinner for friends, and other participants and presenters said that if you were using something really extraordinary from somebody else’s project, you might mention that. However, Elli Mylonas disagreed, arguing that correct citation of existing work is one of the ways that the digital humanities can gain traction within the academy as legitimate scholarship that counts at moments like tenure review rather than languishing, in the same manner as the catalogues and indices that we all rely upon but don’t view as having been built by proper “scholars.” I would tend to think she’s right.

Timelines. Then Elli Mylonas introduced us to various timeline programs. Like yesterday, her presentation was really grounded in the theory and the wherefores and the big issues behind the DH. So she started out with the assertion that “timelines lie,” that is, that any kind of timeline looks objective but is, in fact, encoding a historical argument made by the researcher who compiled and presented it. (I think this actually has an interesting parallel with narrative, footnotelessness or minimally-footnoted writing such as A Mediterranean Society (which has loads of footnotes but leaves a lot out, too), that in effect encodes a massive amount of historical argumentation within something that simply reads as text.)

Important things to look for in choosing a timeline program are: the ability to represent spans of time rather than just single points, the exportability of data, and the ability of the program to handle negative dates (again, encoding an argument about the notions of temporality and the potentiality of time). A free timeline-generating app is Timeline JS, which works with Google spreadsheets. That is the one that we tested out as a group. We also looked at Tiki-Toki, which is gorgeous but requires a paid subscription. (Definitely worth looking into whether one’s institution has an institutional subscription.)

Maxim Romanov suggested that this might be a useful tool for something like revisiting the chronology in Marshall Hodgson’s Venture of Islam.

Finally, we looked at Orbis, Sanford’s geospatial model of the Roman world, which looks at travel through the Roman empire based upon time and cost. This is a feasible project because of the wealth of data and the relative uniformity of roads and resources and prices within the Roman empire and would have to be modified to deal with most of Islamciate history (Which brings to mind the question of the extent to which Genizah sources as a fairly coherent(ish) corpus can be used to extrapolate for the rest of the Islamic world rather than just the Jewish communities within it; if yes, that might be a feasible data set for this kind of processing. Really not my problem, though.) This was a perfect segue into the final topic of the day.

Geographic information systems. This piece was presented by Bruce Boucek, who is a social sciences librarian at Brown trained as a cartographer. He gave an overview of data sources and potential questions and problems, and then Maxim Romanov gave a final demonstration about how geographic imaging can be used to interrogate medieval geographic descriptions and maps.

Image courtesy of S.J. Pearce

Image courtesy of S.J. Pearce

By aligning the latitude and longitude information from a modern map to the cities marked on a medieval one (or simply by making L&L conform on a less contemporary modern map of unknown projection or questionable scake) and observing the distortion of the medieval map when it was made to conform to the modern one, we began to see what kind of view of the world the mapmaker, in this case Muqaddasi, held. What was he making closer and farther away than it really was? What kind of schematic does that yield?

And that’s that. Video and a link library should be up online at the workshop web site, and one of my colleagues storified all of the tweets from the conference. I’ll probably write another post or two in the coming week reflecting on how I might begin to start using some of these tools and methods as I finish up the book and start work on a second project.