Alexander Kain — Curriculum Vitae
1 Present Positions
Skyworks Solutions, Inc.
Oregon Health & Science University
Associate Professor
3181 SW Sam Jackson Park Road
Portland, OR 97239
kaina@ohsu.edu
2 Professional Education
Undergraduate and Graduate
- 2001, Ph. D. in Computer Science and Engineering
Oregon Graduate Institute, Portland, OR
- 1995, B. A. in Computer Science and Mathematics
Rockford College, Rockford, IL
Postgraduate
- 2002–2005, Postdoctoral Training,
OGI School of Science & Engineering, Portland, OR
3 Professional Experience
Academic
- 2014–present, Associate Professor
2007–2014, Assistant Professor
2005–2007, Senior Research Associate
Oregon Health & Science University, Portland, OR
Industrial
- 2020–present, Principal Artificial Intelligence Systems Engineer
Skyworks Solutions, Inc., Hillsboro, OR
- 2018–2020, Chief Technology Innovation Officer
2005–2017, Chief Scientist
BioSpeech, Inc., Portland, OR
- 2001–2008, Lead Speech Synthesis Technologist
Sensory Incorporated, Santa Clara, CA
- 1999, Visiting Researcher
AT&T Research Labs, Florham Park, NJ
4 Scholarship
4.1 Areas of Research/Scholarly Interest
I am interested in innovation, application, and education in the field of machine learning and biological signal processing.
4.2 Collaborators
- Kris Tjaden, Fredrik Van Brenk (University at Buffalo): Dysarthria
- Jun Wang (University of Texas, Austin): Silent Speech Interface
- Deanna Britton (PSU/OHSU): Dystussia
- Miranda Lim (OHSU): REM Sleep Behavior Disorder
- Jeanne-Marie Guise (OHSU): Epidemiology of Preventable Safety Events, Functional near-infrared spectroscopy (fNIRS)
- Marian Dale (OHSU): Speech analysis of patients with progressive supranuclear palsy receiving cerebellar transcranial magnetic stimulation
- Derek Lam and Holden Richards (OHSU): Airway obstruction
- Lina Reiss and Nishad Sathe (OHSU): Cochlear implants
- Matthew Brodsky and Linda Bryans (OHSU): Speech assessment protocol for Deep Brain Stimulation
4.3 Grants
Current
- 2015/09/01–2021/08/31: National Institute of Health 2R01DC004689-11A1, “Therapeutic Approaches to Dysarthria: Acoustic and Perceptual Correlates”, PIs: Tjaden (State University of New York at Buffalo) and Kain (OHSU). 90% of the one million Americans living with idiopathic Parkinson’s disease (PD) and 50% of the 500,000 Americans living with Multiple Sclerosis (MS) will experience dysarthria at some point during the disease. The perceptual sequelae of dysarthria have devastating consequences for quality of life and participation in society by virtue of their effect on social and psychological variables such as employment, leisure activities and relationships. Knowledge of therapy techniques for maximizing perceived speech adequacy, as indexed by intelligibility, therefore is of paramount importance. As a result of our incomplete knowledge of the comparative merits of dysarthria therapy techniques and their variants, however, the choice of a particular technique is not based on a rigorous research base, but is based on either trial and error or the clinician’s educational and experience biases. The proposed project will address these barriers by comparing the acoustic and perceptual consequences of rate reduction, increased vocal intensity and clear speech variants in MS and PD. Our approach is to employ established acoustic measures and perceptual paradigms as well as a state-of-the-art speech re-synthesis technique that will permit conclusions concerning the underlying speech production characteristics, as inferred from the acoustic signal, causing improved intelligibility. Amount: $3.3M.
Completed
- 2020/06/01–2020/08/31: National Institute of Health R01DC013307-06A1, “Binaural Spectral Integration with Hearing Loss and Hearing Devices”, PI: Reiss (OHSU). Approximately 22 million Americans have a hearing impairment. While hearing devices such as hearing aids (HAs) and cochlear implants (CIs) are successful in improving speech recognition for many hearing- impaired individuals, there is still significant variability in benefit, and speech recognition in noise remains a problem. One factor that may limit benefit, especially binaural benefit, is abnormal binaural spectral integration. A prerequisite for binaural integration is binaural fusion — the fusion of stimuli from the two ears into a single auditory object. Our findings from the previous grant showed that unlike normal-hearing (NH) listeners, many HA and CI users experience abnormally “broad” binaural fusion in the spectral domain, such that pitches that differ greatly in frequency between the two ears are still heard as a single percept. Individuals with broad fusion also experience abnormal binaural spectral integration - averaging and thus distortion of spectral information across the ears when disparate sounds are fused. We also showed broad fusion to be associated with binaural interference — poorer speech recognition with two ears compared to one. More importantly, preliminary data show that broad fusion is associated with greater difficulty with understanding speech in challenging multi-talker listening situations, such as noisy restaurants. Difficulties with speech understanding in noise is a major complaint of both HA and CI users. In order to help hearing-impaired listeners reduce binaural interference and improve speech understanding in background noise, we need to understand the underlying causes and factors in broad fusion. The long-term goal of this research program is to investigate the effects, causes, and potential treatments for abnormal binaural spectral integration in hearing-impaired listeners. In this proposal we will: 1) determine how broad binaural fusion affects speech perception in quiet and in background talkers; and 2) investigate potential causes of broad binaural fusion in children and adults with HAs and CIs, focusing on peripheral spectral resolution, auditory experience, and top-down auditory processing factors. The proposed research will indicate the role of broad binaural spectral fusion in difficulties faced by hearing-impaired listeners, especially for speech understanding in background noise. Determination of the factors underlying broad fusion will inform future rehabilitation approaches to treat broad fusion, and help hearing-impaired listeners attain the same benefits of binaural hearing as NH listeners. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project.
- 2019/12/19–2020/08/31: National Institute of Health 1R44DC017403-01A1, “Audiobooks for Hearing Loss”, PI: Lindaas-Hamilton (BioSpeech). The proposed project's goal is to create an audiobook app for individuals with hearing loss with enhanced audio and visual features to support listening, serving to provide auditory training for individuals who want to improve their hearing skills as well provide access to audiobooks for individuals whose hearing ability preludes listening to standard audiobooks. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $225K.
- 2019/12/15–2020/08/31: National Institute of Health 1R01DC016621-01A1, “Wearable Silent Speech Technology to Enhance Impaired Oral Communication”, PI: Jun Wang (University of Texas at Austin). A silent speech interface (SSI) maps articulatory movement data to speech output. Although still in experimental stages, silent speech interfaces hold significant potential for facilitating oral communication in persons after laryngectomy or with other severe voice impairments. Despite the recent efforts on silent speech recognition algorithm development using offline data analysis, online test of SSIs have rarely been conducted. In this paper, we present a preliminary, online test of a real-time, interactive SSI based on electromagnetic motion tracking. The SSI played back synthesized speech sounds in response to the user’s tongue and lip movements. Three English talkers participated in this test, where they mouthed (silently articulated) phrases using the device to complete a phrase-reading task. Among the three participants, 96.67% to 100% of the mouthed phrases were correctly recognized and corresponding synthesized sounds were played after a short delay. Furthermore, one participant demonstrated the feasibility of using the SSI for a short conversation. The experimental results demonstrated the feasibility and potential of silent speech interfaces based on electromagnetic articulography for future clinical applications. My role: I provide signal processing and machine-learning expertise.
- 2017/05/01–2020/04/30: National Institute of Health 4R44DC015145-02, “SBIR Phase 2: Prosody Assessment Toolbox”, PI: Lindaas-Hamilton (BioSpeech). Abnormal receptive or expressive prosody is present in a wide range of disorders, including Autism Spectrum Disorder (ASD), Cognitive Impairment, Down’s syndrome, dysarthria, Parkinson’s disease, depression, schizophrenia, aphasia, Alzheimer’s disease, TBI, Language Impairment, bipolar disorder, ADHD, and PTSD. The characteristics of these prosodic abnormalities and underlying brain dysfunction are still largely unknown, due to the dearth of instruments for assessing prosodic deficits. Building on our broad expertise in computerized prosody assessment, we propose to build a system for researchers that performs automated scoring and acoustic analysis of expressive prosody, allows stimuli to be acoustically modified for detailed perceptual assessment of receptive prosody, and can be extended by researchers to include novel tasks. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $706K.
- 2015/12/10–2019/04/30: National Institute of Health 5R01DC013996-02, “Automatic Voice-Based Assessment of Language Abilities”, PI: van Santen (OHSU). Since untreated language disorder can lead to serious behavioral and educational problems, large-scale early language assessment is urgently needed not only for early identification of language disorder but also for planning interventions and tracking progress. However, such large-scale efforts would pose a large burden on professional staff and on other scarce resources. As a result, clinicians, educators, and researchers have argued for the use of computer based assessment. Recently, progress has been made with computer based language assessment, but it has been limited to language comprehension. One contributing factor is that a key technology needed for this, Automatic Speech Recognition (ASR), is perceived as inadequate for accurate scoring of language tests since even the best ASR systems have word error rates in excess of 20%. However, this perception is based on a limited perspective of how ASR can be used for assessment, in which a general-purpose ASR system provides an (often inaccurate) transcript of the child's speech, which then would be scored automatically according to conventional rules. We take an alternative perspective, and propose an innovative approach that comprises two core concepts: (1) creating a special-purpose, test-specific ASR systems whose search space is carefully matched to the space of responses a test may elicit, and (2) integrating these systems with machine-learning based scoring algorithms whereby the latter operate not on the final, best transcript generated by the ASR system, but on the rich layers of intermediate representations that the ASR system computes in the process of recognizing the input speech. My role: (1) Developing automatic voice-based scoring methods for each language test, (2) developing pronunciation screening methods to detect atypical speech, and (3) evaluating the accuracy of automatic voice-based scoring, stopping, and pronunciation screening systems, and comparing TD group with groups with neuro-developmental disorders. Amount: $638K.
- 2014/09/01–2017/08/31: National Institute of Health 1R43MH101978-01A1, “System for automatic classification of rodent vocalizations”, PI: Lahvis (BioSpeech). Development of treatments for neuropsychiatric disorders presents a formidable challenge. To advance drug discovery, assessments of laboratory rodents are widely employed by academia and industry to model neuropsychiatric disorders. Substantial recent advances in digital recordings of rodent ultrasonic vocalizations (USVs) have engendered interest in assessment of USVs to measure behavior change. A practical obstacle to USV assessment is that they are classified manually. We propose a software system that allows a user to rapidly interrogate recordings of rodent USVs for prosodic content. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $324K.
- 2013/12/01–2017/08/31: National Institute of Health 1R43DA037588-01A1, “Screening for Sleep Disordered Breathing with Minimally Obtrusive Sensors”, PI: Snider (BioSpeech). Sleep disordered breathing (SDB) is believed to be a widespread, under-diagnosed condition associated with detrimental health problems, at a high cost to society. The current gold standard for diagnosis of SDB is a time-consuming, expensive, and obtrusive (requiring many attached wires) sleep study, or polysomnography (PSG). The immediate objective of our research is to develop and evaluate a hardware design and a set of algorithms for automatically detecting obstructive, central, or mixed apneas and hypopneas from acoustic, peripheral oxygen saturation (SpO2), and pulse rate data, using an ambient microphone and a wireless pulse oximeter. The long-term goal is to create a low-cost, easy-to-operate, minimally obtrusive, at-home device that can be used for early and frequent screen for SDB in patients' homes, significantly increasing patient comfort while capturing more representative sleep data compared to a clinical sleep study. In collaboration with Chad Hagen, M. D. at the Sleep Disorders Program at OHSU, we aim to (1) develop a screening system by selecting minimally obtrusive sensor hardware and extending state-of-the art algorithms for automatically detecting SDB from acoustic, SpO2, and pulse rate data; (2) collect patient data in the sleep lab and at home from representative populations using the proposed system; (3) determine the screening accuracy by comparing the performance of the proposed system on the collected data against standard PSG-derived clinical results; and (4) measure the usability of an at-home screening device by the target population, by asking subjects who participated in the at-home data collection to complete a survey on various aspects of the setup and operation of the proposed system. My role: I provide signal-processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $205K.
- 2016/01/01–2017/04/01: National Institute of Health 1R44DC015145-01, “SBIR Phase 1: Prosody Assessment Toolbox”, PI: Connors (BioSpeech). Current instruments for assessing prosodic deficits are decades behind those that are used for clinical assessment of other aspects of language. We propose to build a system that addresses these shortcomings. The system performs automated scoring and acoustic analysis of expressive prosody, allows stimuli to be acoustically modified for detailed perceptual assessment of receptive prosody, and can be extended by researchers to include novel tasks. It is evaluated with individuals who have ASD (adults and children), DS (adults and children), or MCI, and a typically developing control group. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $1.6M.
- 2010/09/27–2016/09/30: National Science Foundation BCS-1027834, "Computational Models for the Automatic Recognition of Non-Human Primate Social Behaviors", PI: Kain (OHSU). To develop methods that will permit researchers to remotely and automatically monitor behavior of primates and other highly social animals. Amount: $578K.
- 2012/04/01–2015/03/31: National Institute of Health 5R44DC009515-03, "SBIR Phase 2: Computer-based auditory skill building program for aural (re)habilitation", PI: Connors (BioSpeech). To extend an adaptive computer-guided software program that focuses on learning phoneme discrimination and identification. See Phase I description. Amount: $400K.
- 2011/12/01–2015/08/31: National Institute of Health R21DC012139, "Computer-Based Pronunciation Analysis for Children with Speech Sound Disorders", PI: Kain (OHSU). In this work we are developing speech-production assessment and pronunciation training tools for children with speech sound disorders. To-date, computer-assisted pronunciation training has not yet been successfully extended to help children with speech sound disorders, primarily because of a lack of accuracy in phoneme-level analysis of the speech signal. My role: I am creating a set of algorithms that will reliably identify and score the intelligibility of a phoneme within an isolated target word, providing immediate, relevant, and understandable feedback about pronunciation errors. The use of human perceptual data during training is an important and new component of the proposed approach. As PI, I am also responsible for overall project supervision and management. Amount: $416K.
- 2010/06/09–2015/05/31: National Science Foundation IIS-0964102, "Semi-Supervised Discriminative Training of Language Models", PI: Kain (OHSU). To conduct fundamental research in statistical language modeling to improve human language technologies, including automatic speech recognition (ASR) and machine translation (MT). Amount: $519K.
- 2010/05/15–2015/04/30: National Science Foundation IIS-0964468, "HCC: Medium: Synthesis and Perception of Speaker Identity", PI: Kain (OHSU). Millions of Americans with impaired or absent speech communication ability rely on Augmentative and Alternative Communication devices with voice output (Speech Generating Devices, or SGDs) to communicate. A psychologically important and desirable feature is the ability to speak with one's own voice, i. e. the ability for the SGD to produce speech that mimics the individual's pre-morbid speech or speech that the individual may be able to intermittently produce. However, current text-to-speech (TTS) systems can only create speech with one or very few supplied speaker characteristics, and cannot be trained to take on the user's voice. My role: Together with Ph. D. students and co-investigators, I am creating a TTS synthesis system that generates speech that sounds like that of a specific individual (Speaker Identity Synthesis, or SIS). In the process we are building and evaluating analysis and synthesis models of the relevant acoustic features, including pitch, duration, and spectrum. Since the system includes a trainability component, this project also involves use of advanced mapping technology in the form of a joint-density Gaussian mixture model. I first proposed this approach in a 1998 publication which has since been cited over 370 times. As PI, I am also responsible for overall project supervision, management, and mentorship of graduate student Mohammadi. Amount: $905K.
- 2011/04/01–2012/03/31: National Institute of Health 5R42DC008712, "User Adaptation of AAC Device Voices - Phase 2", PI: Klabbers (BioSpeech). Developing and evaluating voice transformation and prosody modification technologies to customize synthetic voices in AAC devices, mimicking the individual user's pre-morbid speech. See Phase 1 description. Amount: $410K.
- 2011/03/01–2013/03/31: National Institute of Health 1R43DC011706-01, "SBIR Phase 1: Computerized System for Phonemic Awareness Intervention", PI: Connors (BioSpeech). Phonemic awareness, defined as “the ability to notice, think about, and work with the individual sounds in spoken words”, is considered a necessary skill for literacy. The financial and quality-of-life costs of these impairments are significant, not only because of the link with reading difficulties and hence with future employability, but also because there may exist further links between reading difficulties and a range of psychiatric disorders. This argues for phonemic awareness intervention beyond what can be taught in a regular pre-school or elementary school curriculum. Such intervention is typically provided in the form of one-on-one sessions with a specialized professional (e. g. a Speech Language Pathologist). However, responding to cost concerns and poor access to these services, and also recognizing the importance of frequent intervention sessions, usage of computerized intervention systems is becoming more common. These computerized intervention systems have been steadily improving. However, one significant drawback continues to be their restricted response modalities, typically consisting of the child using a touch screen or a pointing device to select from a set of pictures. By confining the phonemic awareness skills that the system addresses to those that can be tapped into via picture-point-and-click , these systems have a restricted scope of what they can teach. A second drawback of many current systems is that their user interface (e. g. visual layout, tempo) is typically not tunable to the individual characteristics of the child. Given the prevalence of phonemic awareness issues in a broad range of neurodevelopmental disorders, including Autism Spectrum Disorder and Developmental Language Disorder, individual tuning may be critical to address individual neurocognitive weaknesses, such as problems in memory, attention, visual scanning, perceptual motor coordination, and processing speed. We have addressed these drawbacks by (1) taking advantage of drag-and-drop and other touch response modalities that current low-cost touch screen computers are capable of processing and that children are increasingly more familiar with, and (2) by incorporating multiple dimensions of individual tune-ability into the system. My role: Since 2005, I have been the primary developer of the BioSpeech text-to-speech system, a medium-size software project. For this project, I assisted with integration with the graphical user interface, as well as provided solutions to the problem of synthesizing illegal (i. e. not found in normal use of English) phoneme sequences. Amount: $216K.
- 2009/09/01–2013/08/31: National Science Foundation IIS-0915754, "RI: Small: Modeling Coarticulation for Automatic Speech Recognition", PI: Kain (OHSU). We have developed a data-driven, triphone formant trajectory model and methodology for estimating its parameters. In this model, formant targets are speaker dependent, but independent of speaking style. We have validated this model using perceptual listening tests. An analysis of conversationally and clearly spoken speech confirmed that (1) formant trajectories in clear vowels reach their targets more frequently, (2) formants show considerable asynchronicity, and (3) phoneme formant targets approximate their expected values. We also found preliminary evidence that targets derived from clear speech alone perform better at modeling both styles than targets from conversational speech. Having created and validated this model, we are now in the process of applying the approach to disordered speech, paving the way for an objective diagnosis of the degree of coarticulation of dysarthria. Another application is an objective evaluation of the effectiveness of specific speech interventions for certain kinds of dysarthria, e. g. the Lee Silverman Voice Treatment. Finally, this research may also provide an avenue for automatically transforming conversationally-spoken speech to sound as if it had been spoken clearly, thus increasing its intelligibility. A real-time, transparent version of this algorithm would be a desirable feature in many general telecommunications devices. My role: As PI, I am responsible for all aspects of the project, including overall project supervision and management, as well as mentoring of graduate student Bush. Amount: $466.
- 2009/07/15–2012/06/30: National Science Foundation IIS-0905095, "HCC: Automatic detection of atypical patterns in cross-modal affect", PI: van Santen (OHSU).The expression of affect in face-to-face situations requires the ability to generate a complex, coordinated, cross-modal affective signal, having gesture, facial expression, vocal prosody, and language content modalities. This ability is compromised in neurological disorders such as Parkinson's disease and autism spectrum disorder (ASD). The long term goal is to build computer-based interactive systems for remediation of poor affect communication and diagnosis of the underlying neurological disorders based on analysis of affective signals. A requirement for such systems is technology to detect atypical patterns in affective signals. We developed a play situation for eliciting affect and collected audio-visual data from approximately 60 children between the ages of 4–7 years old, half of them with ASD and the other half constituting a control group of typically developing children. We labeled the data on relevant affective dimensions, developed algorithms for the analysis of affective incongruity, and then tested the algorithms against the labeled data in order to determine their ability to differentiate between ASD and typical development. My role: I created special delexicalized speech stimuli, using a novel delexicalization algorithm that rendered the lexical content of an utterance unintelligible while preserving important acoustic prosodic cues. Preference tests showed that the proposed method preserved drastically more speaker identity, and sounded more natural than conventional methods. These delexicalized speech stimuli were used in perceptual tests to exclude the effect of lexical content on affect.
- 2009/07/17–2012/06/30: National Institute of Health 5R21DC010035, "Quantitative Modeling of Segmental Timing in Dysarthria", PI: van Santen (OHSU). The project seeks to apply a quantitative modeling framework to segment durations in sentences produced by speakers with a variety of neurological diagnoses and dysarthrias. My role: I was responsible for software development for custom recording of speech data and for the extension of my previously published hybridization algorithm for the purposes of creating special perceptual speech stimuli.
- 2008–2009: Nancy Lurie Marks Family Foundation award, "In Your Own Voice: Personal AAC Voices for Minimally Verbal Children with Autism Spectrum Disorder", PI: van Santen (OHSU). My role: I performed research and development to adapt a text-to-speech voice to sound like a particular child's voice; a task made particularly challenging by the difficulty of extracting reliable acoustic features from children's speech.
- 2007/09/01–2011/08/31: National Science Foundation IIS-0713617, "HCC: High-quality Compression, Enhancement, and Personalization of Text-to-Speech Voices", PI: Kain (OHSU). My role: Together with Ph. D. students and co-investigators, I developed text-to-speech (TTS) technologies that focus on elimination of concatenation errors and improved accuracy in the areas of coarticulation, degree of articulation, prosodic effects, and speaker characteristics, using an asynchronous interpolation model that Jan van Santen and I proposed in 2002. These algorithmic advances added to the general acceptability of Speech Generating Devices (SGDs), used by individuals with impaired or absent speech communication.
- 2007/01/01–2008/06/30: National Institute of Health 1R41DC008712, "User Adaptation of AAC Device Voices - Phase 1", PI: van Santen (BioSpeech). Speech communication ability is impaired or absent in millions of Americans due to neurological disorders and diseases and to trauma, including autism, Parkinson's disease, and stroke. Augmentative and Alternative Communication (AAC) devices that are operated via switches, keyboards, and a broad range of other input devices, and that have synthetic speech as output, are often the only manner in which these individuals can communicate. A psychologically important feature that no currently available systems have is the ability to speak with the user's voice, i.e., the ability to produce speech that mimics the individual's pre-morbid speech or speech that the individual may be able to intermittently produce. This project used voice transformation (VT) technology to accomplish this goal. My role: I developed and evaluated voice transformation and prosody modification technologies to customize synthetic voices using concatenative speech synthesis technologies, with the aim of mimicking the individual user's pre-morbid speech.
- 2006/09/01–2008/03/31: National Institute of Health 1R41DC007240, "Voice Transformation for Dysarthria - Phase 1", PI: van Santen (BioSpeech). Dysarthria is a motor speech disorder due to weakness or poor co- ordination of the speech muscles. Affected muscles include the lungs, larynx, oro- and nasopharynx, soft palate, and articulators (lips, tongue, teeth, and jaw). The degree to which these muscle groups are compromised determines the particular pattern of speech impairment. For example, poor lung function affects the overall volume or loudness, while problems with specific articulators may cause mispronunciations of certain phonemes. There is a great variety of diseases that can cause dysarthria, including Parkinson’s, Multiple Sclerosis, and strokes. My role: I continued development of software that transforms speech compromised by dysarthria into easier-to-understand and more natural-sounding speech. In addition, I designed a hardware configuration that allowed the software to reside on a wearable computer, with a headset microphone as input and powered speaker as output, giving the user full mobility while wearing the speaking-aid.
- 2005/01/10–2010/12/31: National Institute of Health 5R01DC007129, "Expressive cross-modal affect integration in Autism", PI: van Santen (OHSU). Autistic Spectrum Disorders (ASD) form a group of neuropsychiatric conditions whose core behavioral features include impairments in reciprocal social interaction, in communication, and repetitive, stereotyped, or restricted interests and behaviors. The importance of prosodic deficits in the adaptive communicative competence of speakers with ASD, as well as for a fuller understanding of the social disabilities central to these disorders is generally recognized; yet current studies are few in number and have significant methodological limitations. The objective of the proposed project is to detail prosodic deficits in young speakers with ASD through a series of experiments that address these disabilities and related areas of function. My role: I developed a delexicalization algorithm that rendered the lexical content of an utterance unintelligible, while preserving important acoustic prosodic cues.
- 2005/01/01–2006/06/30: National Science Foundation IIP-0441125, "STTR Phase 1: Small Footprint Speech Synthesis", PI: Kain (BioSpeech). Text-to-speech (TTS) systems have recognized societal benefits for universal access, education, and information access by voice. For example, TTS-based augmentative devices are available for individuals who have lost their voice; and reading machines for the blind have been available for several decades. My role: I developed and implemented a novel algorithm that led to dramatic decreases in disk and memory requirements at a given speech quality level and minimization of the amount of voice recordings needed to create a new synthetic voice. The latter point enabled building personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds or for individuals who are about to undergo surgery that will irreversibly alter their speech.
- 2001/10/01–2005/09/30: National Science Foundation IIS-0117911, "Making Dysarthric Speech Intelligible", PI: van Santen (OHSU). My role: I developed software that transforms speech compromised by dysarthria into easier-to-understand and more natural-sounding speech. The strategy for improving intelligibility is the manipulation of a small set of highly relevant speech features; specifically the energy, pitch, and formant frequencies of an input speech waveform. Pitch and energy are appropriately smoothed, and formant frequencies are mapped with a joint-density Gaussian mixture model, a technique I first introduced in 1998 that since has become the most often used mapping technique in the field. Results from perceptual tests indicated that the transformation improved intelligibility, and that the accompanying removal of the vocal fry improved perceived naturalness.
4.4 Publications/Creative Work
In the following lists, students or interns under my mentorship are underlined.
Peer-reviewed Journal Articles and 4–5 page Conference Papers
In preparation / submitted
- N. Sathe, A. Kain, and L. Reiss, “Fusion and Identification of Dichotic Consonants in Normal-Hearing and Hearing-Impaired Listeners”, JSLHR
Published
2021
- F. Van Brenk, A. Kain, K. Tjaden, “Investigating Acoustic Correlates of Intelligibility Gains and Losses During Slowed Speech: A Hybridization Approach”, American Journal of Speech-Language Pathology, May 2021
- F. Van Brenk, K. Stipancic, A. Kain, K. Tjaden, “Intelligibility across a reading passage: The effect of dysarthria and cued speaking styles”, American Journal of Speech-Language Pathology
2020
- D. Britton, A. Kain, Y-W. Chen, J. Wiedrick, J. O. Benditt, A. L. Merati, D. Graville, Extreme sawtooth sign in motor neuron disease (MND) suggests laryngeal resistance to forced expiratory airflow”, Annual Convention of the American Speech-Language-Hearing Association, 2020.
- A. Kain, A. Roten, R. Gale, “Diacritic-Level Pronunciation Analysis using Phonological Features”, ICASSP 2020.
- P. Wallis, D. Yaeger, A. Kain, X. Song, M. Lim, “Automatic Event Detection of REM Sleep Without Atonia from Polysomnography Signals using Deep Neural Networks”, ICASSP 2020.
- T. Dinh, A. Kain, K. Tjaden, “Improving Speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion”, Interspeech 2020.
- T. Dinh, A. Kain, R. Samlan, B. Cao, J. Wang, “Increasing the Intelligibility and Naturalness of Speech of Laryngectomees”, Interspeech 2020
2019
- F. Van Brenk, K. Tjaden, A. Kain, “Identifying Acoustic Correlates of Speaker-Dependent Variation in Slowed Speech Intelligibility: A Hybridization Approach”, International Congress of Phonetic Sciences (ICPhS), 2019.
- T Dinh, A. Kain, K. Tjaden, “Using a Manifold Vocoder for Spectral Voice and Style Conversion”, Interspeech, 2019.
2018
- S. Dudy, S. Bedrick, M. Asgari, A. Kain, “Automatic Analysis of Pronunciations for Children with Speech Sound Disorders”, Computer Speech & Language Journal, 2018.
2017
- A. Kain, M. Del Giudice, K. Tjaden, “A Comparison of Sentence-level Speech Intelligibility Metrics”, Interspeech, 2017.
- S. Mohammadi, A. Kain, “Siamese Autoencoders for Speech Style Extraction and Switching Applied to Voice Identification and Conversion”, Interspeech, 2017.
- S. Mohammadi, A. Kain, “An Overview of Voice Conversion Systems”, Speech Communication, 2017.
2016
- S. Mohammadi, A. Kain, “A Voice Conversion Mapping Function based on a Stacked Joint-Autoencoder”, Interspeech, 2016.
- B. Snider and A. Kain, “Classification of Respiratory Effort and Disordered Breathing during Sleep from Audio and Pulse Oximetry Signals”, ICASSP, 2016.
2015
- M. Langarani, J. van Santen, S. Mohammadi, A. Kain, “Data-driven Foot-based Intonation Generator for Text-to-Speech Synthesis”, Interspeech, 2015.
- S. Mohammadi, A. Kain, “Semi-supervised Training of a Voice Conversion Mapping Function using a Joint-Autoencoder”, Interspeech, 2015.
- S. Dudy, M. Asgari, and A. Kain, “Pronunciation Analysis for Children with Speech Sound Disorders”, IEEE Engineering in Medicine and Biology society (EMBC), Milan, 2015. (PMC4710861).
2014
- A. Amano-Kusumoto, J.-P. Hosom, A. Kain, J. Aronoff, “Determining the relevance of different aspects of formant contours to intelligibility”, Speech Communication, vol. 59, April 2014.
- K. Tjaden, A. Kain, J. Lam, “Hybridizing Conversational and Clear Speech to Investigate the Source of Increased Intelligibility in Parkinson’s Disease”, Journal of Speech, Language, and Hearing Research, Volume 57, August 2014.
- S. Mohammadi, A. Kain, “Voice conversion using Deep Neural Networks with speaker-independent pre-training”, IEEE Spoken Language Technology Workshop (SLT), 2014.
- B. Bush, A. Kain, “Modeling Coarticulation in Continuous Speech”, Interspeech 2014.
2013
- S. Mohammadi, A. Kain, “Transmutative Voice Conversion”, ICASSP, 2013.
- B. Bush, A. Kain, “Estimating Phoneme Formant Targets and Coarticulation Parameters of Conversational and Clear Speech”, ICASSP, 2013.
- B. Snider and A. Kain, “Automatic Classification of Breathing Sounds during Sleep”, ICASSP, 2013.
2012
- S. Mohammadi, A. Kain, J. van Santen, “Making Conversational Vowels More Clear”, Proceedings of Interspeech, 2012.
- E. Morley, E. Klabbers, J. van Santen, A. Kain, S. Mohammadi, “Synthetic F0 can Effectively Convey Speaker ID in Delexicalized Speech”, Interspeech, 2012.
2011
- E. Morley, J. van Santen, E. Klabbers, A. Kain, “F0 Range and Peak Alignment across Speakers and Emotions”, ICASSP, 2011.
- B. Bush, J.-P. Hosom, A. Kain, and A. Amano-Kusumoto, “Using a genetic algorithm to estimate parameters of a coarticulation model”, Interspeech, 2011.
2010
- A. Kain and T. Leen, “Compression of Line Spectral Frequency Parameters using the Asynchronous Interpolation Model”, Proceedings of 7th ISCA Workshop on Speech Synthesis, September 2010.
- A. Kain and J. van Santen, “Frequency-domain delexicalization using surrogate vowels”, Interspeech, 2010.
- A. Amano-Kusumoto, J.-P. Hosom, and A. Kain, “Speaking style dependency of formant targets”, Interspeech, 2010.
- E. Klabbers, A. Kain, and J. van Santen, “Evaluation of speaker mimic technology for personalizing SGD voices”, Interspeech, 2010.
2009
- A. Kain, J. van Santen, “Using Speech Transformation to Increase Speech Intelligibility for the Hearing- and Speaking-impaired”, Proceedings of ICASSP, April 2009.
- Q. Miao, A. Kain, J. van Santen, “Perceptual Cost Function for Cross-fading Based Concatenation”, Proceedings of Interspeech, 2009.
- R. Moldover, A. Kain, “Compression of Line Spectral Frequency Parameters with Asynchronous Interpolation”, Proceedings of ICASSP, April 2009.
2008
- A. Kain, A. Amano-Kusumoto, and J.-P. Hosom, “Hybridizing Conversational and Clear Speech to Determine the Degree of Contribution of Acoustic Features to Intelligibility”, Journal of the Acoustical Society of America, vol. 124, issue 4, October 2008, pp. 2308–2319.
2007
- A. Kain, J. Hosom, X. Niu, J. van Santen, M. Fried-Oken, J. Staehely, “Improving the Intelligibility of Dysarthric Speech”, Speech Communication, vol. 49, issue 9, September 2007, pp. 743–759.
- E. Klabbers, J. van Santen, A. Kain, “The Contribution of Various Sources of Spectral Mismatch to Audible Discontinuities in a Diphone Database”, IEEE Transactions on Audio, Speech, and Language Processing Journal, Volume 15, Issue 3, pp. 949–956, March 2007.
- A. Kusumoto, A. Kain, P. Hosom, and J. van Santen, “Hybridizing Conversational and Clear Speech”, Proceedings of Interspeech, August 2007.
- A. Kain, Q. Miao, J. van Santen, “Spectral Control in Concatenative Speech Synthesis”, Proceedings of 6th ISCA Workshop on Speech Synthesis, August 2007.
- A. Kain and J. van Santen, “Unit-Selection Text-to-Speech Synthesis Using an Asynchronous Interpolation Model”, Proceedings of 6th ISCA Workshop on Speech Synthesis, August 2007.
2006
- X. Niu, A. Kain, J. van Santen, “A Noninvasive, Low-cost Device to Study the Velopharyngeal Port During Speech and Some Preliminary Results”, Proceedings of Interspeech, September 2006.
2005
- J. van Santen, A. Kain, E. Klabbers, and T. Mishra, “Synthesis of Prosody using Multi- level Unit Sequences”, Speech Communication Journal, vol. 46, issues 3–4, pp. 365–375, July 2005.
- X. Niu, A. Kain, J. van Santen, “Estimation of the Acoustic Properties of the Nasal Tract during the Production of Nasalized Vowels”, Proceedings of EUROSPEECH, September 2005.
2004
- A. Kain, X. Niu, J. Hosom, Q. Miao, J. van Santen, “Formant Re-synthesis of Dysarthric Speech”, Proceedings of 5th ISCA Workshop on Speech Synthesis, June 2004.
- J. van Santen, A. Kain, and E. Klabbers, “Synthesis by Recombination of Segmental and Prosodic Information”, Speech Prosody 2004, March 2004.
- H. Duxans, A. Bonafonte, A. Kain, and J. van Santen, “Including Dynamic and Phonetic Information in Voice Conversion Systems”, Proceedings of ICSLP, October 2004.
2003
- J. Hosom, A. Kain, T. Mishra, J. van Santen, M. Fried-Oken, J. Staehely, “Intelligibility of modifications to dysarthric speech”, Proceedings of ICASSP, May 2003.
- A. Kain and J. van Santen, “A speech model of acoustic inventories based on asynchronous interpolation”, Proceedings of EUROSPEECH, pp. 329-332, August 2003.
- J. van Santen, L. Black, G. Cohen, A. Kain, E. Klabbers, T. Mishra, J. de Villiers, X. Niu, “Applications of computer generated expressive speech for communication disorders”, Proceedings of EUROSPEECH, pp. 1657-1660, August 2003.
2002
- A. Kain and J. van Santen, “Compression of Acoustic Inventories using Asynchronous Interpolation”, Proceedings of IEEE Workshop on Speech Synthesis, pp. 83-86, September 2002.
- J. van Santen, J. Wouters, and A. Kain, “Modification of Speech: A Tribute to Mike Macon”, Proceedings of IEEE Workshop on Speech Synthesis, September 2002.
2001
- A. Kain and M. Macon, “Design and Evaluation of a Voice Conversion Algorithm based on Spectral Envelope Mapping and Residual Prediction”, Proceedings of ICASSP, May 2001.
2000 and earlier
- A. Kain and Y. Stylianou, “Stochastic Modeling of Spectral Adjustment for High Quality Pitch Modification”, Proceedings of ICASSP, June 2000, vol. 2, pp. 949–952.
- J. House, A. Kain, and J. Hines, “ESP - Metaphor for learning: an evolutionary algorithm”, Proceedings of GECCO 2000, Las Vegas, NV.
- A. Kain and M. Macon, “Personalizing a speech synthesizer by voice adaptation”, Third ESCA / COCOSDA International Speech Synthesis Workshop, November 1998, pp. 225–230.
- A. Kain and M. Macon, “Text-to-speech voice adaptation from sparse training data”, Proceedings of ICSLP, November 1998, vol. 7, pp. 2847–50.
- A. Kain and M. Macon, “Spectral Voice Conversion for Text-to-Speech Synthesis”, Proceedings of ICASSP, May 1998, vol. 1, pp. 285–288.
- S. Sutton, R. Cole, J. de Villiers, J. Schalkwyk, P. Vermeulen, M. Macon, Y. Yan, E. Kaiser, B. Rundle, K. Shobaki, P. Hosom, A. Kain, J. Wouters, D. Massaro, M. Cohen, “Universal Speech Tools: The CSLU Toolkit”, Proceedings of ICSLP, November 1998, vol. 7, pp. 3221–24.
- N. Malayath, H. Hermansky, A. Kain and R. Carlson, “Speaker-independent Feature Extraction by Oriented Principal Component Analysis”, Proceedings of EUROSPEECH 1997.
Abstracts
- F. van Brenk, A. Kain, and K. Tjaden, Acoustic Correlates of Intelligibility in Dysarthria: Findings from Between-Speaker Hybridization”, Boston Speech Motor Control Symposium, 2021.
- F. van Brenk, A. Kain, and K. Tjaden, “Investigating Intelligibility Gains for a Slowed Rate Using Hybridization”, Twentieth Biennial Conference on Motor Speech: Motor Speech Disorders & Speech Motor Control, 2020.
- D. Britton, A. Kain, Y-W. Chen, J. Wiedrick, J. O. Benditt, A. L. Merati, D. Graville, “Extreme sawtooth sign in motor neuron disease (MND) suggests laryngeal resistance to forced expiratory airflow”, Dysphagia Research Society 28th Annual Meeting, 2020.
- N. Sathe, A. Kain, and L. Reiss, “Fusion and Identification of Dichotic Consonants in Normal-Hearing and Hearing-Impaired Listeners”, ARO Mid-Winter Meeting, 2019.
- D. Britton, A. Kain, Y-W. Chen, J. Wiedrick, J. O. Benditt, A. L. Merati, D. Graville, “Extreme sawtooth sign in motor neuron disease (MND) suggests laryngeal resistance to forced airflow”, Fall Voice Conference, 2018.
- B. Snider and A. Kain, “Estimation of Localized Ideal Oximetry Sensor Lag via Oxygen Desaturation--Disordered Breathing Event Cross-Correlation”, SLEEP: Journal of Sleep and Sleep Disorders Research, 40, page A232, 2017.
- J.-P. Hosom, A. Kain, and B. Bush, “Towards the recovery of targets from coarticulated speech for automatic speech recognition”, The Journal of the Acoustical Society of America, 130(4), page 2407, 2011.
- A. Kain, "Speech transformation: Increasing intelligibility and changing speakers", Journal of the Acoustical Society of America, 126(4), page 2205, 2009.
Ph. D. Thesis
Technical Reports
- B. R. Snider and A. Kain, “Adaptive Reduction of Additive Noise from Sleep Breathing Sounds”, CSLU-2012-001.
- A. Kain, J.-P. Hosom, S. H. Ferguson, B. Bush, “Creating a speech corpus with semi-spontaneous, parallel conversational and clear speech”, CSLU-11-003.
Patents
- A. Kain and Amie Roten, OHSU. Mispronunciation Detection with Phonological Feedback. (Provisional, filed 2020-04-08)
- J. van Santen and A. Kain, OHSU. System and Method for Compressing Concatenative Acoustic Inventories for Speech Synthesis.
- A. Kain and Y. Stylianou, AT&T Research Laboratories. Stochastic Modeling Of Spectral Adjustment For High Quality Pitch Modification.
Datasets
- A. Kain, The VOICES dataset, Linguistic Data Consortium Catalog, LDC2006S01, ISBN 1-58563-363-1, 2006.
OHSU Disclosures
- #2867 Phonological Feature-based Automatic Pronunciation Analysis and Feedback System, 02/11/2020
- #2805 Decomposing autoencoder, 09/18/2019
- #2803 Joint Deep autoencoder, 09/18/2019
- #2717 Children's pronunciation database, 04/08/2019
- #2489 TimeView software, 08/15/2017, non-exclusively licensed under the MIT open-source license
- #2275 PyTTS Text-to-Speech software with 16 voices, 05/12/2016, Exclusively Licensed
- #1365 Mexican Spanish female diphone voice, 12/08/2008
- #1364 Mexican Spanish male diphone voice, 12/08/2008
- #1362 American English female diphone voice (AS), 12/08/2008
- #1361 American English male speaker diphone voice, 12/08/2008
- #1360 German male speaker diphone voice, 12/08/2008
- #1359 German female speaker diphone voice, 12/08/2008
- #1358 New Flinger singing synthesis, 12/08/2008
- #1195 Clear-Speech Corpus, Speaker JPH, 05/07/2007
- #1065 Controlling Formant Frequencies in Concatenative Speech Synthesis Systems, 05/16/2006
- #1061 Noninvasive Nasal Flow Measurement Device and Algorithm, 05/11/2006
- #0868 CSLU System and Method for Synthesis Based Speech Enhancement, 09/24/2004, Exclusively Licensed
- #0844 CSLU Voice transformation for Dysarthria with Formant Re-synthesis, 06/03/2004, Exclusively Licensed
- #0665 Voice Transformation (High Resolution), 11/13/2002
- #0566 Method to compress concatenative acoustic inventories for speech synthesis, 07/01/2001, Exclusively Licensed.
4.5 Invited Lectures, Conference Presentations, or Professorships
International and National
- Conference presentation: “Mispronunciation Detection and Feedback for Children with Speech Sound Disorders”, Machine Learning for Health, Portland, OR, 2020.
- Conference presentation: “Siamese Autoencoders for Speech Style Extraction and Switching Applied to Voice Identification and Conversion”, Interspeech, Stockholm, Sweden, 2017.
- Conference presentation: “A Comparison of Sentence-level Speech Intelligibility Metrics”, Interspeech, Stockholm, Sweden, 2017.
- Conference presentation: “Semi-supervised Training of a Voice Conversion Mapping Function using a Joint-Autoencoder”, Interspeech, Dresden, Germany, 2016.
- Conference presentation: “Hybridizing Conversational and Clear Speech to Investigate the Source of Intelligibility Variation in Parkinson’s Disease”, Conference on Motor Speech, Sarasota, Florida, 2014.
- Conference presentation: “Transmutative Voice Conversion”, ICASSP, Vancouver, Canada, 2013.
- Conference presentation: ”Frequency-domain delexicalization using surrogate vowels”, Interspeech, Makuhari, Japan, 2010.
- Conference presentation: ”Compression of Line Spectral Frequency Parameters using the Asynchronous Interpolation Model”, 7th ISCA Workshop on Speech Synthesis, Kyoto, Japan, 2010.
- Conference presentation: ”Hybridizing Conversational and Clear Speech to Determine the Degree of Contribution of Acoustic Features to Intelligibility”, Meeting of the Acoustical Society of America, 2009, San Diego, CA.
- Conference presentation for a Special Session on Voice Transformation: ”Using Speech Transformation to Increase Speech Intelligibility for the Hearing- and Speaking-impaired”, ICASSP, Taipei, Taiwan, 2009.
- Conference presentation: ”Compression of Line Spectral Frequency Parameters with Asynchronous Interpolation”, ICASSP, Taipei, Taiwan, 2009.
- Conference presentation: ”Hybridizing Conversational and Clear Speech”, Interspeech, Antwerp, Belgium, 2007.
- Conference presentation: ”Spectral Control in Concatenative Speech Synthesis”, 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, 2007.
- Conference presentation: ”Unit-Selection Text-to-Speech Synthesis Using an Asynchronous Interpolation Model”, 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, 2007.
- Conference presentation: ”Formant Re-synthesis of Dysarthric Speech”, 5th ISCA Workshop on Speech Synthesis, Pittsburgh, PA, USA, 2004.
- Conference presentation: ”A speech model of acoustic inventories based on asynchronous interpolation”, EUROSPEECH, Geneva, Switzerland, 2003.
- Conference presentation: ”Compression of Acoustic Inventories using Asynchronous Interpolation”, IEEE Workshop on Speech Synthesis, Santa Monica, CA, 2002.
- Conference presentation: ”Design and Evaluation of a Voice Conversion Algorithm based on Spectral Envelope Mapping and Residual Prediction”, ICASSP, Salt Lake City, UT, 2001.
- Conference presentation: ”Spectral Voice Conversion for Text-to-Speech Synthesis”, ICASSP, Seattle, WA, 1998.
Regional and Local
- Presentation at CSLU Seminar Series approximately 1–2 times annually
4.6 Awards
- 2020 OHSU Early-stage Technology Development award
- 2017, 2013 OHSU Technology Transfer and Business Development Award
- 2017 NVIDIA Academic Hardware Donation Program
- 2005 OHSU Commercialization Award
5 Service
5.1 Membership in Professional Societies
- International Speech Communication Association (ISCA)
- Institute of Electrical and Electronics Engineers (IEEE)
- Acoustical Society of America (ASA)
5.2 Granting Agency Review Work
- Spinal Cord Injury/Disease Research Program, 2018.
- National Science Foundation, 2010, 2013.
5.3 Editorial and Ad Hoc Review Activities
- Review of 1–3 journal articles annually. I have reviewed for:
- Journal of the Acoustical Society of America
- Journal of Computer, Speech, and Language
- IEEE Transactions on Audio, Speech and Language Processing (Guest editor for special Voice Transformation issue Volume 18, Issue 5, July 2010)
- Speech Communication Journal
- Transactions on Accessible Computing
- Transactions on Asian and Low-Resource Language Information Processing
- Journal of Speech, Language, and Hearing Research
- International Journal of Speech-Language Pathology
- Review 4–8 conference papers annually for international conferences Interspeech and ICASSP. These conference papers are five pages long, and are scored along several dimensions.
5.4 Committees
International/National
- Publications Chair for the international conference InterSpeech 2012 in Portland Oregon. Over several months, I coordinated with the Technical Program Committee, the Organizing Committee, and the Professional Conference Organizer to produce the electronic proceedings and the abstract book.
- Member of the Portland Institute for Computational Science (www.pi4cs.org).
Departmental
- Participation in multiple distinct Dissertation Advisory Committee (DAC, for Ph. D. students) and Thesis Advisory Committee (TAC, for Masters Students). These are typically semi-annual half-hour meetings wherein a student meets with his/her research advisor and other faculty to discuss progress, course work, and future plans.
- Participation in the annual Qualifying Exam Committee, an annual one-day meeting wherein pre-qualifying Ph. D. students present their Qualifying Exam Work to faculty and other students. Faculty are assigned to be readers on several papers, and the written work and presentation are scored along several dimensions.
- Participation in Ph. D. Thesis Committees as needed. Prior to the Ph. D. defense, several faculty are assigned as evaluators of the written thesis, a task that typically takes 1–3 days due to the volume of information (usually over 100 pages).
- Member of the Admissions Committee reviewing applications of M. S. and Ph. D. students to the CS/EE program.
- From 2014–2017 I was a member of the Faculty Council Committee, which makes available to the Dean informed representative faculty and departmental opinion, counsel, affairs, and problems of the Medical School, especially in areas of administrative and operational policies directly concerned with educational matters.
5.5 Activities
- Co-organized the 2020 OHSU-PSU Workshop on Statistical Learning for Health ( www.pi4cs.org/mlhworkshop)
- Created and maintain several open-source projects:
- TimeView (https://github.com/lxkain/timeview) is a cross-platform (Windows, MacOS, Linux) desktop application for viewing and editing Waveforms, Time-Value data, and Segmentation data. These data can easily be analyzed or manipulated using a library of built-in processors; for example, a linear filter can operate on a waveform, or an activity detector can create a segmentation from a waveform. Processors can be easily customized or created from scratch.
- Multiple Isotonic Regression (https://github.com/lxkain/multi-isoreg) is an algorithm that, given a sequence, can find the minimum error and any number of optimal inflection points of segments that are either monotonically rising or falling. This allows finding shapes like up-down (one peak), or down-up-down, or up-down-up-down (2 peaks), etc. Special emphasis was placed on performance.
- Joint-Density Regression (https://github.com/lxkain/jd-reg) can create non-linear mapping functions using K-Means or GMMs.
- Worked on designing a formal mentee feedback mechanism to mentors, as an outcome of the 2017 “Making a Meaningful Difference” Leadership class offered by OHSU's Niki Steckler.
- Offered Speech Processing Consultation to the OHSU community.
- Maintained CSLU's Virtual Reality Laboratory, which features a HTC Vive headset, driven by Valve's SteamVR software. Custom worlds can be created using Epic's Unreal Engine.
- Maintained CSLU's Audio Laboratory hardware and software systems, featuring two WhisperRoom sound-proof booths, a 10-channel 32-bit 96 kHz Focusrite recording audio interface connected to a Digital Audio Workstation running custom software, a Kay Elemetrics laryngograph for capturing a high-quality voicing signal, high-quality condenser microphones, numerous digital video cameras, a teleprompter, and other support equipment.
- In 2012–2013, Izhak Shafran and I worked with Portland State University (PSU) on increasing educational collaboration. As a result, CSLU and PSU cross listed certain Computer Science and Electrical Engineering courses. In 2016, this became an institution-wide agreement.
6 Education
6.1 Students
- Mentored Ph. D. students: Dinh, Snider, Mohammadi, Bush, Dudy, Khan, Moldover. I typically meet each of them one-on-one for 1–1.5 hours weekly to discuss their research. I have also co-mentored Ph. D. students: Langarani, Wallis, Resalat, Sathe, Bayestehtashk, Niu, Amano-Kusumoto, Miao. These students have found employment at Apple, Microsoft, Amazon, and other similar high-tech firms.
- Co-/mentored Master's students: Roten, Soethiha, Yaeger, Alder, Velata, Moore.
- When appropriate, I lead weekly 1.5-hour project group meetings wherein 2–7 students and possibly additional faculty members report on and discuss their research with each other.
- When appropriate, I lead weekly 1-hour reading group meetings, wherein one student presents a published paper of his/her choice to a group of students and faculty (invitation is open to all of CSLU), with subsequent discussion.
- Supervised 6 undergraduate students in the summers of 2008, 2009, and 2013, and 2016, funded through the National Science Foundation's Research Experiences for Undergraduates (REU) program and through the University Center for Excellence in Developmental Disabilities (UCEDD).
- Worked with volunteers who would like to become Research Assistants, students, or co-authors on a publication with me.
6.2 Courses
Nearly all of my lectures are taught using jupyter notebooks, which, in addition to regular text and equations, allow for interactive code examples, graphical widgets, and data visualization during class. Students can download these notebooks and use them as reference or starting points for their own projects.
CS 627 Data Science Programming
This course is a best-of compilation of concepts, practices, and python-based software libraries (all free, open-source, and unrestricted) that allow for rapid, straight-forward, and easy-to-maintain implementation of new ideas and scientific questions. Students will gain awareness and initial working knowledge of some of the most fundamental computational tools for performing a wide variety of academic research. As such, it will focus on providing breadth instead of depth, which means that for each concept we will talk about motivation, key concepts, and concrete usage scenarios, but without exhaustive mathematical background or proofs, which can be acquired in more specialized classes. In this class we will: write programs in python; perform numeric tasks using numpy and scipy; manage data using pandas; discuss audio, image and text processing using scipy.signal, scikit-image, nltk, and pynini; apply machine learning algorithms such as deep neural networks, convolutional neural networks, and autoencoders using scikit-learn and keras; visualize data using matplotlib and pyqtgraph; use pyqt/QT to build graphical user interfaces; address performance issues via compilation/profiling/parallelization tools, and much more (winner of the 2019 Sakai Torchbearer Award).
I have created the curriculum for, and teach this 3-credit course (201.5-hour lectures). Creating the curriculum required approximately 200 hours. Due to the quickly changing landscape at the edge of technology, updating and teaching the course requires approximately 100 hours each time it is offered. Grading students' answers and evaluating their project outcomes requires a total of approximately 3 hours per student over the course of the class (unless a TA is available). Students' evaluation scores averaged 5.0/5.0 in Fall 2015, 5.15/6.0 in Fall 2016, 5.64/6.0 in Winter 2018, 5.83/6.0 in Spring 2019.
EE 682 Digital Signal Processing
This course teaches students the core principals of digital signal processing. We survey a variety of topics in class lecture/discussion based on assigned readings while exploring specific topics/applications in depth through lab assignments and a final project. Specifically, we cover the core topic areas in digital signal processing including an overview of discrete-time signals and systems, the discrete-time Fourier transform, the -Transform and transform analysis, the discrete Fourier Series, the discrete Fourier transform, circular convolution, network structures for FIR systems, design of IIR and FIR filters, and multi-rate processing.
I co-teach this 3-credit course (101.5-hour lectures). Creating the lectures required approximately 180 hours. Students' evaluation scores averaged 5.13/6.0 in Winter 2019.
EE 658 Speech Signal Processing
Speech systems are becoming commonplace in today's computer systems and Augmentative and Alternative Communication (AAC) devices. Topics include speech production and perception by humans, linear predictive features, pitch estimation, speech coding, speech enhancement, prosodic speech modification, Voice Conversion (VC), Text-to-Speech (TTS), and automatic speech recognition (ASR).
I have created the curriculum for, and teach this 3-credit course (201.5-hour lectures). Creating the curriculum required approximately 180 hours. Due to the quickly changing landscape at the edge of technology, updating and teaching the course requires approximately 200 hours each time it is offered. Grading students' answers and evaluating their project outcomes requires a total of approximately 3 hours per student over the course of the class (unless a TA is available). Students' evaluation scores averaged 4.7/5.0 in Winter 2016.
CS 653 Text-to-Speech Synthesis
This course will introduce students to the problem of synthesizing speech from text input. Speech synthesis is a challenging area that draws on expertise from a diverse set of scientific fields, including signal processing, linguistics, psychology, statistics, and artificial intelligence. Fundamental advances in each of these areas will be needed to achieve truly human-like synthesis quality and advances in other realms of speech technology (like speech recognition, speech coding, speech enhancement). In this course, we will consider current approaches to sub-problems such as text analysis, pronunciation, linguistic analysis of prosody, and generation of the speech waveform. Lectures, demonstrations, and readings of relevant literature in the area will be supplemented by student lab exercises using hands-on tools.
I have created the curriculum for and teach the second half of this 3-credit course (101.5-hour lectures). Creating the curriculum required approximately 120 hours of my time. Students' evaluation scores averaged 4.4/5.0 in 2015.
CS 606 Computational Approaches to Speech and Language Disorders
This course covers a range of speech and language analysis algorithms that have been developed for measurement of speech or language based markers of neurological disorders, for the creation of assistive devices, and for remedial applications. Topics include introduction to speech and language disorders, robust speech signal processing, statistical approaches to pitch and timing modeling, voice transformation algorithms, speech segmentation, and modeling of disfluency. The class uses a wide array of clinical data, and is closely tied to several ongoing research projects.
I have created and taught 21.5-hour lectures for this course.
6.3 Presentations
- Speaker at the 2017 OHSU Symposium on Educational Excellence, on “Interactive lecturing with jupyter notebooks”.
- Twice yearly Course Advertisement Talks to preview my courses that are scheduled the following quarter to prospective students.
6.4 Awards
- 2019 Sakai Torchbearer Award for the use of jupyter notebooks in education
- 2017 nominated for OHSU Excellence in Graduate Teaching Award
- 2014 OHSU Excellence in Graduate Teaching Award