Interspeech 2014 Students Roundtable - Table ASR1: Automatic Speech Recognition

Submitted by andi on Mon, 09/29/2014 - 10:57

 

Table: ASR1

Expert: Isabel Trancoso

Reporter: Jochen Weiner

As people arrived we talked about where we are from and what we are currently working at. The talk then turned to thesis writing. Isabel Trancoso gave us some advice for the writing process such as not to start (and get stuck) on the introduction. Instead start with the contents chapters (because their contents has often already been published in papers) and only then turn to introduction and conclusion. We also discussed meaning of a thesis for the research company. While most researchers will mostly read papers and that is where the cutting edge is published, a thesis should be an overview over a research area. Thus a well-written thesis can become a commonly read (and well cited) text for people new to the research area such as new PhD students.

Then the talk went back to the students’ projects and their respective research institutions during which time we were joined by representatives of two of the sponsors.

The talk then came to publishing in a journal. The students shared some of their experience with review processes and reviewing times, while Isabel Trancoso shared some of her experience as Editor-in-Chied of a journal.

Finally we talked about ISCA-SAC, which Isabel Trancoso is very fond of. We were joined by Catharine Oertel, ISCA-SAC’s current General Coordinator, and talked about the ISCA-SAC’s activities, their search for new volunteers and how we could get involved.

It was a very nice discussion, so I would like to thank Isabel Trancoso, Catha and the other students for sharing in this student lunch event.

 

Report written by Jochen Weiner


Interspeech 2014 Students Roundtable - Table DIA: Machine Processing of Dialogue and Spontaneous Speech

Submitted by andi on Mon, 09/29/2014 - 11:03

 

Table: DIA

Experts: Joakim Gustafson, Alexander Rudnicky

Reporter: Inigo Casanueva

In this table we met 5 students with different research interests:

  • Sheng Li, from Kyoto University, interested in academic lecture transcription
  • I-Fan Chen, from Georgia, interested in key word search for limited resourced languages
  • David Guy Brizan from New York, interested in accent in dialogue interaction
  • Sheng-Syun Shen, from Taiwan university, interested in sumarization
  • Iñigo Casanueva, from University of Sheffield, interested in personalised dialogue systems

 We had the luck to share the lunch with 2 well known world experts in the field of dialogue:

  • Joakim Gustafson, head of the Speech group and deputy manager of the Department of Speech, Music and Hearing at KTH, Sweden
  • Alexander I. Rudnicky, Research Professor at the Computer Science Department in the School of Computer Science at Carnegie Mellon University, in the Carnegie Mellon Speech Group.

Joakim and Alexander where really helpful with us and during the lunch they told us about their opinions in diverse fields related to dialogue, and the main challenges that researchers are dealing with in this field at the moment. We started talking about how humans adapt their accent to the person (or machine) they are interacting with, to continue talking about the challenges of multi-party dialogue interaction and long term interaction with a dialogue system. The conversation continued with the differences between goal oriented dialogues and conversational interaction, and the difficulties to define an objective function for a conversational agent. There are a lot of issues that arise when designing a dialogue system which may look trivial, such as the timings during turn taking or handling mixed initiative systems, but actually these issues remain unresolved in the field and there is a lot of research going on to solve them. Another interesting question that was asked during the lunch was: "How does a machine know when it is not understanding something?" and "How do humans know when we don’t understand something, to know when its the moment to ask for a clarification?" The lunch finished with other interesting topics such as disfluences, realistic synthesis, emotion detection, etc.

In general, it was a very pleasant lunch, especially because Joakim and Alexander were really friendly and helpful during all the lunch, and they made us feel very comfortable even if we were dealing with world-class researchers.

 

Report written by Inigo Casanueva


Interspeech 2014 Students Roundtable - Table MM1: Multimodality/Multimedia

Submitted by andi on Mon, 09/29/2014 - 11:08

Table: MM1

Expert: Alexandros Potamianos

Reporter: Fei Tao

First, we introduced ourselves. We talked about our research. Even though all of us were assigned to the same table, we still found that our work were diverse. The idea of "Multimodal and Multimedia" contains lots of stuff including visual and textual information besides audio.

Second, we went to further to talk about the concept of multimodality. We were discussing about that what kind of modality was good and whether more modality will facilitate the ASR system.

Third, we talked about how to be a good phd student (or say researcher). We agreed with that one should have interest and passion to do the research, because the PhD study will last for about 4-5 years. Only these things can keep us focus on the research. Otherwise, we may feel bored. Also, we felt pretty honor when we were telling the differences between phd students and other graduate students. We realize that we do research to create new things; we are innovators and inventors; while others will follow our work.

Forth, we also talked that besides the research, PhD student should also have the skills of presenting own work, like a lot researches did in the conference. The conference provided us a great chance to learn this kinds of skills.

 

Report written by Fei Tao


Interspeech 2014 Students Roundtable - Table REC1: Speaker Recognition/Verification/Diarization

Submitted by andi on Mon, 09/29/2014 - 11:12

 

Table: REC1

Expert: John Hansen

Reporter: Karthika Vijayan

As student attendees of Interspeech 2014, we were provided with this wonderful opportunity to have lunch with eminent personalities, who work in different research areas of speech signal processing. Several discussions ranging from recent trends in scientific research to career planning in industry and academia were triggered over the lunch table.

I was fortunate enough to have Prof. John H. L. Hansen as the expert for discussions. Along with my fellow PhD scholars, I got the chance to discuss my anxieties and expectations in pursuing research in speech processing. Prof. Hansen was kind enough to discuss the differences in experiences, while working in an industry and academics. He described the challenges in being a professor in a university- including requesting and getting funds for research from different funding agencies, managing a student community by properly motivating them and so on. On the other hand, working in an industry mainly includes team playing. Also the kind of company, one is working in, decides the nature/role of the job. As a person with no major job experiences, I found these discussions quite fruitful.

Prof. Hansen elaborated the importance of building research relationships over conferences. He had advised all of us to meet new people over this kind of conferences, rather than talking and maintaining relationships with the people of our own respective countries. He strongly recommended us to come out of our comfort zones and explore. Also he had asked us to convince and take help from our respective PhD advisors for getting contacts for post doc as well as industrial research positions. Prof. Hansen had enquired the career ambitions from each of us personally and took time to comment on each of our choices.

In the mean time, Dr. Sharmistha Gray, core researcher at Nuance had joined us and we got to learn her experiences while working in different companies. Also she talked about the importance of maintaining proper relationship with advisor/manager while doing research in a university/industry. As a woman working in industrial research, her experiences were worth learning for.

In the end, Prof. Hansen had offered to give further guidance required for making decisions for our future research and career. I personally found the student lunch event, a valuable experience and I am quite sure that my fellow scholars had also felt the same. I would like to thank the Interspeech 2014 organizing team for providing us with this opportunity and also look forward for similar student interaction sessions in future.


(from left to right: Saeid Safavi, Bing Jiang, Prof. John H.L. Hansen and Karthika Viajan.)

 

Report written by Karthika Vijayan


Interspeech 2014 Students Roundtable - Table REC2: Speaker Recognition/Verification/Diarization

Submitted by andi on Mon, 09/29/2014 - 11:13

 

Table: REC2

Expert: Najim Dehak

Reporter: Rohan Kumar Das

We were assigned to speaker recognition table 2 where we had our expert Dr.Najim Dehak, Research scientist from MIT. He needs no introduction as he is well known for the introduction of i-vector based modeling to the field of speaker recognition. Three of us were there along with him for the lunch- Mr. Brecht Desplanques from Ghent University, Miss Qian Zhang from UT Dallas and myself Rohan Kumar Das from Indian Institute of Technology Guwahati.

Initially each of us introduced ourselves to Dr. Dehak and then explained the area in detail in which we were working. I introduced myself and elaborated about my research interests in the domain of speaker verification. Where as the Mr. Brecht and Miss Qian were working in the domain of language identification and they explained about their work. I am working in the field of speaker verification of short utterances which I elaborated to Dr. Dehak. He then gave some suggestions to use source features for short utterances as phonetic content is bad for speaker recognition task. On the other hand, the other two fellow mates explained the hurdles that the were facing in their research and then our expert gave them some insights to deal with those problems.

It was really a great lunch with Dr. Dehak as he not only technically motivated us, but also he inspired us to have high goals and not to be afraid of asking questions to anyone. He also introduced us to some team members of Nuance during our lunch which was very useful for us. The lunch ended with sharing all of our contact details and presentation timings at Interspeech 2014 as Dr. Dehak was interested to attend our presentations.

I thank ISCA and Interspeech 2014 n behalf of all the members of our table for organizing such an event.


(from left to right: Quian Zhang, Prof. Najim Dehak, Brecht Desplanques, Rohan Kumar Das)

 

Report written by Rohan Kumar Das


Interspeech 2014 Students Roundtable - Table TTS1: Text To Speech

Submitted by andi on Mon, 09/29/2014 - 11:17

 

Table: TTS1

Experts: Alan Black, David Winarsky

Reporter: Xin Wang

Students:

  • Anandaswarup Vadapalli (International Institute of Information Technology, India)
  • Peng Liu (Tsinghua University, P.R.China)
  • S Aswin Shanmugam (Indian Institute of Technology Madras, India)
  • Xixin Wu (Tsinghua University, P.R.China)
  • Xin Wang (University of Science and Technology of China, P.R.China)

In general, 3 topics were covered during the lunch session: 1. difficulty in promoting speech synthesis towards ultimate naturalness; 2. the role of industry plays in speech synthesis research; 3. difficulty in deploying the speech synthesis system in multilingual community.

On the first topic, expressing emotion in speech synthesis system draws the most part of attention. All experts do not think current neutral style of synthesized speech would satisfy the need in recent application of speech synthesis, e.g. audio book reading. A key question, as Xin Wang put it, is the definition of emotion. What's more, according to Prof. Black, if we can not measure and define different levels of emotion, the power of emotion can not be fully exploited. Conventional methods may define the emotion in a discrete space. As the result, emotion prediction must be incorporated. Unfortunately, the explicit emotion modelling require annotated corpus and extra effort to train a classification model. For large corpus, emotion modelling may be daunting because of the prohibitive labor on emotion annotation and unacceptable inconsistency among the annotation from different annotators.

As Xixin said, automatic classification method may be utilized. A key notion is to classify the emotion in databased in an unsupervised way. To put it further, a continuous emotion space can be defined. In this way, discrete emotion tag can be replaced by continuous vector. Given input text, a continuous vector may be computed in order to represent the emotion in the text implicitly. Of course, the emotion space is corpus dependent. But the automatic training algorithm may facilitate construction of that space. Due to limited time, further detail on the issue is not covered.

But Prof. Black initiated a another interesting topic after the above discussion. He said the judgment of emotion is subjective. If we could acquire the opinion from mass audience, we may build a powerful model to guide the emotional speech synthesis. Unfortunately, labs in universities have no access to the feedback of the audience, only large companies could find large amount of feedback from the consumer.

So, what's role of industry in speech synthesis? On one hand, companies could afford the investment in fine-tuning the speech products; on the other hand, they can get direct touch with consumers on shortcomings of current products. In general, Industry promotes the application of speech products. But it doesn't mean that academia could not find place in the research of speech technology. For example, companies would not be willing to focus on language with no writing system. But it is a good topic for research in universities and colleges.

At last, experts and students discussed about the speech synthesis in a multilingual environment. An interesting example discussed during the session is that people in P.R.China would speak 'iPhone 6' as 'iPhone six' or 'iPhone liu4' ( liu4 is Chinese Pinyin for six) or 'ai4 feng4 liu4' (ai4 feng4 approximates the English pronunciation of 'iPhone'). In the dynamic community where the use of language is not only confined to the native language, a key issue in speech synthesis is how the system could capture the habitats of user in local community. Of course this is not simple task.

All in all, the event was a great chance to meet students from different parts of the world, to make new contacts, and to learn more about the work that others are conducting in the wide field of TTS.

 

Report written by Inigo Xin Wang


Interspeech 2014 Students Roundtable - Table ASR2: Automatic Speech Recognition

Submitted by andi on Wed, 10/01/2014 - 07:59

 

Topic: ASR

Expert: Murat Akbacak

Reporter: Angel Mario Castro Martinez

Students: Angel Mario Castro Martinez, Tim Schlippe, Zhen Huang, Min Ma, Dongpeng Chen

Topics:

  1. Adaptation DNN
    • Accents, speakers and language adaptation
    • Minimize errors on each speaker condition
    • Minimize cross-entropy on speaker side and do not deviate on the speaker independent side
    • Bottleneck Linear Transformation adaptation
    • Adapt only input layer
  2. Pronunciation dictionaries
    • Little improvements
    • Mean improvements are misleading, check for particular cases
    • Non-native pronunciation (Germans pronounce “iphone”)
    • Key word search for language improvement
    • Phonemes depends on data (graphene representation)
    • Pronunciation models for language ID (SVM)
  3. Training low resource ASR
    • RATS (speaker, language ID, voice activity detection)
    • Extract and identify context
    • Machine translation
    • Error detection
    • 0.5 dB SNR conditions
    • Military applications
    • Semantically based taggers

 

Report written by Angel Mario Castro Martinez