Distant Reading and the Islamic Archive

October 13, 2015 by EM

The Digital Islamic Humanities Project at Brown University is pleased to announce its third annual conference, titled “Distant Reading and the Islamic Archive,” which will be held on Friday, October 16, 2015. Speaker biographies, paper abstracts, and the conference program may be found here.

Please note that the event is fully subscribed. A live webcast will be available at this link, beginning at 8:45am on the day of the event. A recording of the proceedings will also be available on the website of the Digital Islamic Humanities Project (islamicDH.org).

Call for Papers: Distant Reading and the Islamic Archive

March 20, 2015 by EM

Each year, the number of digitized books, inscriptions, images, documents, and other artifacts from the Islamic world continues to grow. As this archive expands, so too does the repertoire of digital tools for navigating and interpreting its diffuse and varied contents. Drawing upon such tools as topic modeling, context-based search, social network maps, and text reuse algorithms, the study of large-scale archives and textual corpora is undergoing significant and exciting developments.

With this in mind, the Middle East Studies program at Brown University is pleased to announce its 3^rd annual Islamic Digital Humanities Conference, to be held onOctober 16-17, 2015. We cordially invite proposals for papers related to distant reading and other computational approaches to the study of the pre-modern and early modern Islamic world.

Faculty members, postdoctoral fellows, graduate students, archivists, librarians, curators, and other scholars are welcome to apply. Candidates are requested to submit a title and abstract of 300 words and a CV to the conference organizers at digitalhumanities@brown.edu. The deadline for submissions is April 30, 2015, and successful applicants will be notified by the end of May.

Papers should be no longer than twenty minutes and read in English. A collection of abstracts from previous conferences and workshops may be found on our website (islamichumanities.org) along with recorded webcasts, a list of digital resources, and announcements for related events.

There may be limited funding available to cover travel expenses and hotel accommodation for junior scholars. All other participants are asked to cover their own expenses. The conference will begin at noon on Friday, October 16 and conclude by the early afternoon of Saturday, October 17.

Brown University is located in Providence, Rhode Island, one hour south of Boston and easily accessible by train and plane. For any questions, please contact Dr. Elias Muhanna at the email address above.

Here is a PDF version of this call for papers; please feel free to circulate it.

Textual Corpora Workshop 2014: A Review

November 10, 2014 by Qifa Nabki

On October 17-18, 2014, the Digital Islamic Humanities Project at Brown University organized a workshop on “Textual Corpora” in the Digital Scholarship Lab at Rockefeller Library. We had around forty participants from various universities and institutions from around the world, and we spent a couple of days engaged in fruitful discussions and hands-on tutorial sessions. One of the participants, Dr. Sarah Pearce (NYU), wrote up a review of the event for her blog, Meshalim. She has kindly allowed us to reprint it here.

Textual Corpora and the Digital Islamic Humanities (day 1) | by S. J. Pearce

Normally I would not be totally comfortable posting a conference report like this because it’s basically reproducing others’ intellectual work, presented orally and possibly provisionally, in written and disseminated form. However, because video of the event is going to be posted online along with all of the PowerPoint slides, these presentations were made with an awareness that they were going to be disseminated online and so a brief digest does not strike me as a problem. With that said, what I am writing here represents the work of others, which I will cite appropriately.

The workshop convener, Elias Muhanna, began by introducing what he called “digital tools with little no learning curve.” These included text databases such as q uran.com, altafsir.com, and the aggregate dictionary page al-mawrid reader (which is apparently totally and completely in violation of every copyright law on the books). Then there were sources for collections of unsearchable PDFs (w aqfeya.com, mostafa.com, and alhadeeth.com; and h athitrust.org, which is pretty much the only one on this list that isn’t violating copyright law in some way) and sources for searchable digital libraries of classical Arabic texts (almeshkat.net, said.net, alwaraq.net, shiaonlinelibrary.com and noorlib.ir; and al-jami’ al-kabīr, which is a database that has the special feature of mostly not functioning and mostly not being installable). Arabicorpus.byu.edu as well as various databases used by computational linguists are the best bets for modernists looking for things.

With respect to all of these, the question of how the texts are entered is a bit of a mystery. Some are rekeyed from editions, some are scanned as PDFs and some are OCR scanned; and even though OCR scanning can be up to 99% accurate, that still translates into a typo every hundred characters, which is not ideal. Regardless of the technology used to upload these texts to these databases, copyright law was raised, ongoing, as an issue surrounding the use of these tools, and the current state of play appears to be somewhere between the wild west and don’t ask don’t tell.

A few sample searches were run to demonstrate what they might be used for — occurrences of the phrase allahu a’lam to gauge epistemological humility (I’m not totally sure about the reliability of the one to gauge the other, but nevermind) and an Arabic proverb I did not know previously about mongoose farts (fasā baynahum al-ẓaribān) to illustrate a search to determine how a saying might be used, whether purely for grammatical or sociologically illustrative purposes (as this one apparently is) or whether it occurs within a discourse.

***

These text collections were a segue into Maxim Romanov’s presentation on the difference between text collections and text corpora and the desiderata for the creation of the latter.

Text collections are what already exit. They are characterized by the following traits:

reproduce books (technically DBs but don’t fuction as DBs)
Book/Source is divided into meaningless unites of data, such as “pages”
Limited, ideologically biased (shamela is open but BOK format is obscure)
Not customizable (users cannot add content)
Limited search options
Search results are impossible to handle (have to have your own system on top of the library system)
No advanced way for analyzing results (no graphing, mapping)
No ability to update metadata

Textual corpora are what we need to be creating. They are characterized by the following traits:

Adapted for research purposes (open organization format)
Source is divided into meaningful unites of data (such as “biographies” for a biographical collection, “events” for chronicles, “hadith reports” for hadith collections)
Open and fully customizable
Complex searches (with filtering options)
Results can be saved for later processing (multiple versions, annotations, links)
Visualizations of results
Easy to update metadata

***

Elli Mylonas gave an introduction to the idea of textual markup, which was the piece that was the most general and most theoretical of the day. She raised a number of interesting issues.

One was the question of how archival data can be, and she made the case for XML files being not quite as good as acid-free paper in a box, but basically the digital equivalent. It’s a standard language and it is text-based and therefore should be readable on future technologies, whatever they might be.

She then made the case that text markup is a form of textual interpretation; and when somebody asked a question that was predicated on his being okay with the status quo in which some people do programming and some people analyze texts, she replied that marking up a text for XML really forces you to think more carefully about both the structure and the content of the text; it’s not an either-or proposition. This is not a case where science is trying to impose itself upon the humanities (ahem, quantum medievalism) but rather supplement it methodologically.

One important distinction is between markup and markdown. The latter is a more descriptive, plain-text rendering of, well, text, that allows it to be more easily exported into a variety of schema. (I think?) Markdown is less rule-bound, more abstract, and more idiosyncratic, which means that it is less labor intensive but potentially less-widely useful in the absence of a really robust tagging scheme.

She showed a few examples of a marked up text, including the Shelley-Godwin archive, which has Mary Shelley’s notebooks marked up to show where her handwriting occurs and where her husband Percey’s does, as a way of trying to put to rest the question of who really wrote Frankenstein; a Brown project on the paleographic inscriptions of the Levant that, she told us, provokes an argument between every new graduate student worker and the PI over how to classify the religious assignation of the inscriptions (see? interpretation!); and the Old Bailey Online, in which you can search court records by crime and punishment.

The one difficulty for Islamic Studies is that XML comes out of the European printing and typesetting tradition and is therefore not natively suited to Arabic and other right-to-left languages.

***

Maxim Romanov then gave a practical introduction to one element of text markup, namely regular expressions, a way of creating customized searches within digitized text. These are two web sites with some basic instructions and options to practice:

One example of a regular expression is this. If I wanted to find all the possible transliterations of قذافي (the surname of the deposed Libyan dictator) in a given text in a searchable text corpus, I would type: [QGK]a(dh?)+a{1,2}f(i|y) as my search term. This would look for any word that began with Q, G, or K, then had an a, then had a d and possibly an h and a possible repetition of that combination, one or two As, and f, and then either an i or a y. There was much practicing and many exercises and that’s really all I have to say about that. (Except that this cartoon and this one suddenly make a lot more sense.)

***

Textual Corpora & the Digital Islamic Humanities (day 2) | by S.J. Pearce

Following up on the Qaddafi-hunt by regular expression of day 1 of the workshop on digital Islamic humanities, here is Maxim Romanov, demonstrating a regular expression to search for terms that describe years within a text corpus that hasn’t been subjected to Buckwalter transliteration but is rather in the original Arabic script.

Three major topics that were covered on day 2.

Scripting. Maxim Romanov covered a basic overview of/introduction to scripting and the automation of repetitive tasks, such as downloading thousands of things from the web, converting text formats, and conducting complex searches (by regular expressions).

The preferred scripting language amongst this crowd of presenters was Python, in no small measure because it is named after Monty Python, but also because it is very straightforward. Maxim illustrated some of the possibilities with python by walking us through one of his research questions, which was about the chronological coverage of certain historical sources, in other words, how much attention do certain time periods get versus other time periods?. He demonstrated the methods he used for capturing date information from a really large amount of text by automating specific queries with script, and then processing the data so it could be output in an easily readable graph. Conference organizer Elias Muhanna emphasized that this was an example of how digital and computational methodologies are not replacements for analysis but rather demand quite a lot of good, old-fashioned philological hard-nosedness, but offer different tools for exploring and expressing it. This is a way of simply speeding up and scaling up what we are already doing.

We then had a brief presentation from one of the researchers from the Early Islamic Empire at Workproject, who showed us how his team is creating search tools for their corpus, tools which will be made publicly available in December as the Jedli toolbox, which will include various types of color-codeable, checklist- and keyword-based searching. One of the major takeaways from this presentation was the idea that by being able to edit open-source code and program things, it’s possible to build upon earlier existing work to make things do specifically what any given researcher wants them to.

This raised the question of citation, which, based on a lot of the comments made in response to the question (which I asked), made it seem like a total wild west. One of the participants with quite a lot of programming experience said that citing someone else’s code would be like citing a recipe when you make dinner for friends, and other participants and presenters said that if you were using something really extraordinary from somebody else’s project, you might mention that. However, Elli Mylonas disagreed, arguing that correct citation of existing work is one of the ways that the digital humanities can gain traction within the academy as legitimate scholarship that counts at moments like tenure review rather than languishing, in the same manner as the catalogues and indices that we all rely upon but don’t view as having been built by proper “scholars.” I would tend to think she’s right.

Timelines. Then Elli Mylonas introduced us to various timeline programs. Like yesterday, her presentation was really grounded in the theory and the wherefores and the big issues behind the DH. So she started out with the assertion that “timelines lie,” that is, that any kind of timeline looks objective but is, in fact, encoding a historical argument made by the researcher who compiled and presented it. (I think this actually has an interesting parallel with narrative, footnotelessness or minimally-footnoted writing such as A Mediterranean Society (which has loads of footnotes but leaves a lot out, too), that in effect encodes a massive amount of historical argumentation within something that simply reads as text.)

Important things to look for in choosing a timeline program are: the ability to represent spans of time rather than just single points, the exportability of data, and the ability of the program to handle negative dates (again, encoding an argument about the notions of temporality and the potentiality of time). A free timeline-generating app is Timeline JS, which works with Google spreadsheets. That is the one that we tested out as a group. We also looked at Tiki-Toki, which is gorgeous but requires a paid subscription. (Definitely worth looking into whether one’s institution has an institutional subscription.)

Maxim Romanov suggested that this might be a useful tool for something like revisiting the chronology in Marshall Hodgson’s Venture of Islam.

Finally, we looked at Orbis, Sanford’s geospatial model of the Roman world, which looks at travel through the Roman empire based upon time and cost. This is a feasible project because of the wealth of data and the relative uniformity of roads and resources and prices within the Roman empire and would have to be modified to deal with most of Islamciate history (Which brings to mind the question of the extent to which Genizah sources as a fairly coherent(ish) corpus can be used to extrapolate for the rest of the Islamic world rather than just the Jewish communities within it; if yes, that might be a feasible data set for this kind of processing. Really not my problem, though.) This was a perfect segue into the final topic of the day.

Geographic information systems. This piece was presented by Bruce Boucek, who is a social sciences librarian at Brown trained as a cartographer. He gave an overview of data sources and potential questions and problems, and then Maxim Romanov gave a final demonstration about how geographic imaging can be used to interrogate medieval geographic descriptions and maps.

Image courtesy of S.J. Pearce

By aligning the latitude and longitude information from a modern map to the cities marked on a medieval one (or simply by making L&L conform on a less contemporary modern map of unknown projection or questionable scake) and observing the distortion of the medieval map when it was made to conform to the modern one, we began to see what kind of view of the world the mapmaker, in this case Muqaddasi, held. What was he making closer and farther away than it really was? What kind of schematic does that yield?

And that’s that. Video and a link library should be up online at the workshop web site, and one of my colleagues storified all of the tweets from the conference. I’ll probably write another post or two in the coming week reflecting on how I might begin to start using some of these tools and methods as I finish up the book and start work on a second project.

A Database and Handbook of Classical Islamic Pedagogy: A Digital Islamic Studies Project at the University of Göttingen

August 30, 2013 by EM

Given the challenges Arabic and Islamic studies are facing in the increasingly culturally diverse contexts of contemporary societies, meaningful new methodologies and tools of research need to be explored. The Göttingen Database and Handbook of Classical Islamic Pedagogy is devoted to addressing some of these issues in a three-year research project conducted at the University of Göttingen, Germany.

The main objectives of the project are, in a first step, to identify, collect, and systematically analyze large amounts of data on Islamic educational theory and practice, drawn from a large variety of classical Arabic texts. This will facilitate, in a second step, to elucidate key principles and theories of classical Islamic education, reintroduce them into contemporary intellectual discourse and, thus, respond to the very real need to better understand the larger purposes and values that underlie and animate Islamic education on social, ethical, psychological and religious levels.

In order to document, administer, and examine the data on Islamic education extracted from the classical Arabic sources, a specific database was designed. This presentation discusses the underlying themes and theoretical premises of this database and handbook project, along with its structure and research opportunities within the context of digital humanities.

Author: Sebastian Günther (Univ. of Göttingen)

Abstract Models for Islamic History

May 22, 2013 by EM

Latest developments in the digital sphere offered new opportunities and challenges to the humanists. Equipped with new digital methods of text analysis, scholars in various fields of humanities are now trying to make sense of huge corpora of literary and historical texts. Perhaps the most prominent of such attempts is the work of Franco Moretti and his abstract models for literary history that trace long-term patterns in English fiction. Inspired by Moretti’s approach, I seek to develop abstract models for the analysis of pre-modern Arabic historical literature, relying mainly on various textmining techniques that are being developed at the intersection of statistics, linguistics and computer science. At the moment, I concentrate primarily on biographical collections, a genre that includes several hundred multi-volume titles (The largest collection—al-Dhahabī’s “History of Islam”—covers 700 years and contains about 30,000 biographies). Working with the corpus of 10 biographical collections (about 125 printed volumes; 45,000 biographical accounts), I am developing an analytical tool that can be later used to study other biographical collections—ideally, all of them together. In the long run I hope that the results of my work will pave the way to the development of analytical tools for other genres of pre-modern Arabic literature such as chronicles, ḥadīth collections, interpretations of the Qur’ān, compendia of legal decisions, etc.

Working with my biographical collections I look primarily into such kinds of biographical data as “descriptive names” (nisbas), dates, toponyms, and, since recently, rather loosely defined linguistic formulae and wording patterns. The analysis of different combinations of these data allows one to trace various social, religious and cultural patterns in time and space. I am particularly interested in how the Islamic world changed over the period of 640–1300 CE: how cultural centers were shifting; how different social, professional and religious groups were replacing and displacing each other; how different regions were connected with each other and how these connections changed over time. The results of my analysis will be presented in the form of graphs and geographical maps (Some current examples of my work can be found at www.alraqmiyyat.org).

Author: Maxim Romanov (Univ. of Michigan)

Analytical Database of Arabic Poetry

May 22, 2013 by EM

The Analytical Database of Arabic Poetry will represent an important contribution to the emerging field of digital studies in Arabic philology. The database will include comprehensive data on the vocabulary of early Arabic poetry (6th-8th centuries A.D.) in the form of an electronic dictionary. With the help of the analytical tools of the database, each lexeme of the entire lexical corpus will be assessed in relation to the literary framework of its attestation including information on the genre of the relevant poetic text and on the tribal, chronological and geographical background of its author. Moreover, the database will record in detail the data of textual transmission of the works of early Arabic poetry in the context of Arab-Muslim scholarship of the 8th to 10th centuries. This comprehensive collection of data and its analytical classification will for the first time allow systematic investigation into the process of semantic change in the Arabic language and the development of a philological approach to the language.

A ground-breaking feature of the database results from the possibility of including cross-references to parallel linguistic material provided by inscriptions, papyri and the Qur’ān, which have never been studied in relation to each other. Thus, the Analytical Database of Arabic Poetry promises to become the cornerstone of the common digital platform of the Arabic language, which will bring together several current European projects in the field of digital Arabic philology, including the two ERC funded projects “Glossarium graeco-arabicum” (ERC Ideas Advanced Grant 249431, Cristina D’Ancona, Università di Pisa, Italy, Gerhard Endress, Universität Bochum, Germany) and “Digital Archive for the Study of pre-Islamic Arabian Inscriptions (DASI)” (ERC-AG-SH5 ERC Advanced Grant 269774-DASI, Alessandra Avanzini, Università di Pisa, Italy) as well as other initiatives, such as the “Safaitic Database Project” (Michael C. A. Macdonald, Oxford University, UK), the “South Arabian Lexicographical Database of the University of Jena” (Peter Stein, Jena, Germany), the “Arabic Papyrology Database” (Andreas Kaplony, LMU München, Germany, Johannes Thomann, Universität Zürich, CH), the “Corpus Coranicum” project (Michael Marx, BBAW, Berlin, Germany) and “Arabic and Latin Glossary” (Dag Nikolaus Hasse, Universität Würzburg, Germany).

Author: Kirill Dmitriev (St. Andrews)

Comparing Canons: Examining Two Seventeenth-Century Fatawa Collection from the Ottoman Lands

May 22, 2013 by EM

In recent years growing attention has been paid to the circulation of texts and to various textual practices throughout the Islamic world in general and the Ottoman Empire in particular. Most studies, however, were qualitative in nature. My paper seeks to demonstrate the advantages of digital humanities for the study of circulation of manuscripts and the ways in which they were used and consulted. To illustrate the advantages of digital humanities the paper takes as its case study the circulation of legal texts across the Ottoman Empire. More specifically, the case study is based on a comparison of two fatawa collections from the mid seventeenth century that I have digitized for my research: the fatawa collection by the chief imperial mufti, şeyhülislam Minkarizade Efendi (1609-1677 or 8), and the famous Palestinian mufti, Khayr al-Din al-Ramli (1585-1671). By focusing on the special features of each of the fatawa collections, I hope to draw attention to the advantages these databases and the digital humanities more broadly offer for this kind of study on the one hand, and to raise attention to what the databases conceal on the other. Finally, through this case study, the paper intends to discuss how this methodology can be applied to the study of texts and their circulation in other contexts and time periods in Islamic history.

Author: Guy Burak (Bobst Library, New York University)

Digital Islamic Humanities Project

Brown University | Providence, RI

Tag Archives: data mining

Distant Reading and the Islamic Archive

Call for Papers: Distant Reading and the Islamic Archive

Textual Corpora Workshop 2014: A Review

A Database and Handbook of Classical Islamic Pedagogy: A Digital Islamic Studies Project at the University of Göttingen

Abstract Models for Islamic History

Analytical Database of Arabic Poetry

Comparing Canons: Examining Two Seventeenth-Century Fatawa Collection from the Ottoman Lands