NLP Workshop Report

On Tuesday May 18, the East Asian Digital Humanities (EADH) Working Group https://eadh.princeton.edu/ held a Workshop on the application of Natural Language Processing methods for East Asian language material.

  • Screenshot showing output from code identifying parts of speech in a chinese novel.

The workshop was led by Dr Jeffrey Tharsen, Computational Scientist for Digital Humanities at the Research Computing Center of the University of Chicago, and addressed the first steps in any DH project, including OCR of sources, cleaning of scanned text, building corpora or data set and identifying the right tools for natural language processing and analysis such as Tokenization, Part-of-Speech tags, Lemmatization, Named entity recognition (NER), Word Vectors, Stylometry and Topic Modeling of text in East Asian languages.

The workshop was well attended by students, faculty, and researchers from neighboring institutions.

A recording of the event is available on Media Central. The workshop material can be made available on individual basis. Please contact jseufert@princeton.edu for more information.

EADH is an initiative spearheaded by the East Asian Library in close cooperation with faculty and the Center for Digital Humanities.

By nbudak

Digital Humanities Developer at the Center for Digital Humanities @ Princeton.