NLP Workshop Report

On Tuesday May 18, the East Asian Digital Humanities (EADH) Working Group https://eadh.princeton.edu/ held a Workshop on the application of Natural Language Processing methods for East Asian language material.

  • Screenshot showing output from code identifying parts of speech in a chinese novel.

The workshop was led by Dr Jeffrey Tharsen, Computational Scientist for Digital Humanities at the Research Computing Center of the University of Chicago, and addressed the first steps in any DH project, including OCR of sources, cleaning of scanned text, building corpora or data set and identifying the right tools for natural language processing and analysis such as Tokenization, Part-of-Speech tags, Lemmatization, Named entity recognition (NER), Word Vectors, Stylometry and Topic Modeling of text in East Asian languages.

The workshop was well attended by students, faculty, and researchers from neighboring institutions.

A recording of the event is available on Media Central. The workshop material can be made available on individual basis. Please contact jseufert@princeton.edu for more information.

EADH is an initiative spearheaded by the East Asian Library in close cooperation with faculty and the Center for Digital Humanities.

Workshop: Natural Language Processing of East Asian material with Jeffrey Tharsen

As the second event organized by the East Asian Digital Humanities Working Group at Princeton, we have invited Jeffrey Tharsen, Computational Scientist for Digital Humanities at the Research Computing Center of the University of Chicago to teach a workshop to address the first steps in any DH project including OCR of sources, cleaning of scanned text, building corpora or data set and identifying the right tools.

Jeffrey Tharsen holds dual appointments as Computational Scientist for the Digital Humanities and as Lecturer in the Digital Studies Program at the University of Chicago, serving as university-wide technical domain expert for digital and computational approaches to humanistic inquiry. He received his doctorate in 2015 from the University of Chicago’s East Asian Languages & Civilizations department, specializing in the fields of premodern Chinese philology, phonology, poetics and paleography.

In his work, Jeffrey advises researchers and leads teams creating new resources, platforms and methods for humanistic research, designs curricula and teaches courses and workshops on data science, computational linguistics and natural language processing, serves on thesis committees and mentors students interested in developing new digital and computational research methods, and regularly presents his work at national and international conferences and symposia.

The workshop will be 90 minutes followed by a 30-minute Q&A session, and will focus on the following topics with an emphasis on the specific challenges of East Asian scripts.

  • Optical character recognition (Chinese, Japanese, Korean), cleaning and formatting source texts
  • Platforms for editing & collaboration
  • Tokenization 
  • Part-of-Speech tags, Lemmatization (Japanese only), Named entity recognition (NER)
  • Word Vectors (cosine similarities)
  • Stylometry (HCA Dendogram & k-means PCA)
  • Topic Modeling (gensim LDA + SpaCy)

Time: May 18, 2020 04:00-06:00 PM (Eastern Time)

Register at: https://princeton.zoom.us/meeting/register/tJEvdOiqrjwtHNY6YyrG_2LY8w3NxxYmaokI

The workshop is open to all.

We hope that many of you will be able join us.

cover photo by TAKA@P.P.R.S (https://www.flickr.com/photos/takapprs_flickr/)

Kickoff Meeting Report

On September 28, 2020 the East Asian Digital Humanities Working Group met for its inaugural meeting. Students, faculty, and staff at Princeton from a wide range of subjects were invited to join.

With 47 participants the event was very well attended with participation from undergraduate and graduate students, alumni, faculty, developers, librarians and museum curators from a wide range of subjects (including, but not limited to East Asian Studies, History, Art & Archaeology, Neuroscience, Politics, Sociology, Religion, and Comparative Literature). Despite not having widely announced the event outside of Princeton, six external participants (from Rutgers, Temple, Montclair State, Uni Jena, WUSTL, and New College Florida) joined the discussion.

The working group discussed its plans for the coming weeks and months. A number of Princeton students, faculty and staff shared in their ongoing or planned DH projects in short presentations.

Speakers were:

In a lively discussion the group identified several areas of interest for future events, including dataset creation, optical character recognition (OCR) of East Asian texts, the application of geographic information systems (GIS), computer aided textual analysis, and topic modelling.

Kickoff Meeting Invitation

Sep 28, 2020 4:30 PM Eastern Time (US and Canada)

zoom link / join by telephone

meeting ID 2520754010, passcode EADH

The East Asian Studies Department and the Center for Digital Humanities at Princeton are excited to jointly announce the formation of a new East Asian Digital Humanities Working Group at Princeton. The Working Group plans to regularly organize meetings, workshops and other events over the academic year. We would like to invite you to a virtual kick-off meeting – registration is not required.

In this first meeting, the group will share its plans for the coming weeks and months. A number of Princeton students, faculty and staff from a variety of disciplines working on different parts of East Asia have been invited to share in very short presentations their ongoing or planned DH projects, including:

  • Anna Shields (East Asian Studies)
  • Chan Yong Bu (East Asian Studies)
  • Gian Duri Rominger (East Asian Studies) / Nick Budak (Center for Digital Humanities)
  • Hannah Waight (Sociology) 
  • Caitlin Karyadi (Art & Archaeology)
  • Joshua Seufert (East Asian Library)

We hope that many of you will be able join us.

Best wishes,

Joshua Seufert, Xin Wen, Nick Budak