As the second event organized by the East Asian Digital Humanities Working Group at Princeton, we have invited Jeffrey Tharsen, Computational Scientist for Digital Humanities at the Research Computing Center of the University of Chicago to teach a workshop to address the first steps in any DH project including OCR of sources, cleaning of scanned text, building corpora or data set and identifying the right tools.
Jeffrey Tharsen holds dual appointments as Computational Scientist for the Digital Humanities and as Lecturer in the Digital Studies Program at the University of Chicago, serving as university-wide technical domain expert for digital and computational approaches to humanistic inquiry. He received his doctorate in 2015 from the University of Chicago’s East Asian Languages & Civilizations department, specializing in the fields of premodern Chinese philology, phonology, poetics and paleography.
In his work, Jeffrey advises researchers and leads teams creating new resources, platforms and methods for humanistic research, designs curricula and teaches courses and workshops on data science, computational linguistics and natural language processing, serves on thesis committees and mentors students interested in developing new digital and computational research methods, and regularly presents his work at national and international conferences and symposia.
The workshop will be 90 minutes followed by a 30-minute Q&A session, and will focus on the following topics with an emphasis on the specific challenges of East Asian scripts.
- Optical character recognition (Chinese, Japanese, Korean), cleaning and formatting source texts
- Platforms for editing & collaboration
- Part-of-Speech tags, Lemmatization (Japanese only), Named entity recognition (NER)
- Word Vectors (cosine similarities)
- Stylometry (HCA Dendogram & k-means PCA)
- Topic Modeling (gensim LDA + SpaCy)
Time: May 18, 2020 04:00-06:00 PM (Eastern Time)
The workshop is open to all.
We hope that many of you will be able join us.