IS2011 Student Event Report : Multimedia and Multimodal Interaction

Submitted by abe on Sat, 11/03/2012 - 22:50

 

The student lunch at the table I was assigned to was hosted by Murat Saraclar (Google and Bogazici University) and Gareth Jones (Dublin City University) and the topic of discussion was spoken term detection and spoken document retrieval. The students that were attending were: Tsung-Hsien Wen (National Taiwan University), Deepak Krishnarajanaga Thotappa (IIT Guwahati), Haiyang Li (Harbin Institute of Technology), Atta Norouzian (McGill), Hungya Lee (National Taiwan University), Maria Eskevich (Dublin City University), and myself, Abe Kazemzadeh (USC SAIL Lab). We spent most of the time doing introductions and learning about each other's research, but later it became more of a conversational discussion. Some topics that came up in the discussion were:

  • the differences between term detection and document retreival
  • differences in data sources, e.g., broadcast news vs. meeting data
  • discriminative language models
  • the work of Amit Singhal at Google
  • Malach corpus of spoken video interviews
  • an older NIST overview paper on spoken document retrieval

Here are some pictures of the student lunch:

 

Report written by Abe Kazemzadeh


Report on Student Lunch Event "Students meet experts" at InterSpeech 2012

Submitted by Hannes Pessentheiner on Mon, 10/15/2012 - 09:38

 

Report on Student Lunch Event "Students meet experts" at InterSpeech 2012

Table 3: ASR: Signal processing with Steve Renals and Miroslav (Mirek) Novak

 

Experts :

  1. Steve Renals, Professor of Speech Technology, University of Edinburgh, UK; interested in understanding human communication using machine learning and statistical models, and constructing systems that can recognize and interpret communication scenes.
  2. Miroslav (Mirek) Novak, Research Staff Member at IBM T. J. Watson Research Center, USA; interested in large vocabulary-speech recognition and efficient algorithms for speech recognition.

 

Participants : (see picture)

  1. Kailash Patil (Johns Hopkins University)
  2. Jeffrey Kallay (The Ohio State University)
  3. Preethi Jyothi (The Ohio State University)
  4. Fethi Bougares (LIUM - Le Mans France)
  5. Antti Hurmalainen (Tampere University of Technology)
  6. Hannes Pessentheiner (Graz University of Technology)

Text:

Get in touch to benefit from each other's knowledge and experiences, make contacts for future collaborations, become friends; just a couple of words, just a couple of consecutive events—with a giant impact on a young scientist's career. (A personal thought about this event.)

This year's student lunch event wasn't just a nice dinner around midday; it was a big opportunity for six young scientists to enjoy a delicious meal with two well-known experts in academic and industrial research: Steve Renals and Miroslav (Mirek) Novak.

Let me introduce Mirek first. He's a research staff member at IBM T. J. Watson Research Center, USA. His interests range from large vocabulary-speech recognition to efficient algorithms for speech recognition. Steve, our second expert, is a professor of speech technology at University of Edinburgh, UK. He focuses on human communication using machine learning and statistical models as well as constructing systems that can recognize and interpret communication scenes. Both are working in different research environments—Steve works at university whereas Mirek works in an industrial research center—which often led to colorful and interesting discussions.

After a short introduction about each participant's working area, we started a discussion far away from automatic speech recognition (ASR) and signal processing, but closely related to each one's career/life : the big differences in academic and industrial research, family and science, communication problems between different research groups, hiring people at university and industrial enterprise, and funding flexibilities. Beside that we talked about the difficulty to pursue the same research interests in industry. In the second part of our event we continued with topics closely related to ASR: real-time processing on old PCs, the advantages of C++ in comparison to C, HTK and its performance, and Sphinx and its toolkits (a.o.).

All in all, everyone enjoyed this year's student lunch event—and, especially, the extraordinary sandwiches and side dishes—, gained a lot of new experiences, contacts, and motivation for our upcoming (research and social) challenges.

 

Report written by Hannes Pessentheiner


Notes from the students’ roundtable on TTS, Interspeech 2012

Submitted by admin on Sat, 11/03/2012 - 14:25

 

This roundtable hosted eight participants with varied background and interest in TTS. Among these there was one expert from academia, one expert from industry and six doctoral students working on topics such as hybrid synthesis systems, speech intelligibility enhancement and voice morphing.

The discussion was primarily stirred by the industry expert who identified a range of challenges in the field inducing an involved discussion among the participants. The expert from academia provided valuable insights, from a theoretical perspective, into some of these challenges and outlined research paths to address these.

Some of the discussed issues are listed below:

  • The evolution of hybrid (waveform + parametric) TTS systems towards a fully parametric solution
  • Unit selection vs. HMM models.
  • Level of waveform injection (word, phone).
  • Lack of expressiveness in HMM-based synthesis.
  • Over-smoothing issues in HMM-based synthesis.
  • Quality issues in HMM-based synthesis.
  • Emotions and style in TTS
  • Choice of words identifies an emotional attachment – extracting the author’s intention through semantic analysis and its application to style adaptation.
  • Memory in TTS and its relation to the emotional attachment of words.
  • Feature deficiency for style representation: duration and pitch are not enough.
  • Pause prediction: a difficult problem. Requires modeling at the paralinguistic level.
  • Change of emotion within an utterance.
  • Data related issues
  • There are powerful machine learning tools that can improve TTS. Insufficient training data is a bottleneck for using these tools.
  • The importance of initialization in model training and its dependence on data availability.
  • Automated data acquisition through web-based applications.

 

Report written by Petro Petkov


Interspeech 2012 - Students Meet Experts - Table 7

Submitted by This email address is being protected from spambots. You need JavaScript enabled to view it. on Thu, 10/11/2012 - 14:52

 

Experts: Prof. Mark Hasegawa Johnson is a professor in the Department of Electrical and Computer Engineering at the University of Illinois. Prof. Keikichi Hirose is at professor in the Department of Information and Communication Engineering at the University of Tokyo.

Students: Andrew Fandrianto, Luying Hou, Anna Katharina Fuchs, Tim Mahrt, and Barbara Samlowski

We started off the session by having each scholar introduce themselves. In addition to providing their names and university affiliations, the students also briefly discussed some of their current research.

Andrew Fandrianto is a Master's student at the Language Technologies Institute at Carnegie Mellon University. Andrew works on the automatic detection of anger and hyper articulation.

Luying Hou is studying at the Shanghai International Studies University, China. Her research is about dubbing in Chinese films and how dubbed speech in Mandarin take on aspects of the target foreign language of a film.

Anna Katharina Fuchs, Signal Processing and Speech Communication Laboratory, Graz University of Technology. Her work is on speech pathology and the use of prosody to enhance comprehension.

Tim Mahrt is a PhD student at the University of Illinois Urbana-Champaign. His work is on the analysis of acoustic cues to prosodic prominence, particularly in respect to speaker differences.

Barbara Samlowski is at Bielefeld University, Germany. Her most work involved an experiment with stress and unstressed syllables in different sentential positions.

After introductions and while we ate lunch, we had a discussion about the nature of prosody and its role in speech. It was acknowledged that there are difficulties in aligning results done on studies of spontaneous speech and on read speech or speech collected in a controlled environment. No one had any suggestion for how this might be overcome.

Luying described her work in detail, which involves research on dubbing in films. Movie goers in China may get the feeling that the dubbers in foreign films take on the qualities of the target language in a film. For example, French-accented Mandarin in a French film or American-English-accented Mandarin in an American film. Through acoustic analysis of speech in Mandarin, the target language for the movie being dubbed, and the dubbed productions by native speakers of Mandarin, Luying found that the qualities of the dubbed speech lies somewhere between the dubber's native Mandarin and the target language, showing that subjects were sensitive to and could imitate the some of the qualities that distinguish different languages.

Overall the lunch was a good opportunity to gain some exposure to scholars working on different areas of prosody. Even though each scholar at the table was working on prosody, the specific subdomains being worked on were quite different.

 

Report written by Tim Mahrt


Interspeech2012 Student Roundtable Lunch Event - report

Submitted by gezhenhao on Thu, 09/27/2012 - 19:43

 

Table 13

Participants:

Experts: Prof. Helmer Strik, Prof. Martin Russell

Students: Ann Lee, Zhenhao Ge, Christos Koniaris, Hyuksu Ryu, Khairun-ncsa Hassanali, Shou Chun Yin

Main topic: mispronunciation detection, pronunciation learning techniques

 

Self-introduction:

Prof. Helmer Strik is a professor in the Department of Language and Speech in Radbound University Nijmegen in Netherland. He works on developing Computer-Assisted Language Learning (CALL) based on Automatic Speech Recognition (ASR).

Prof. Martin Russell is a professor in the School of Electronic, Electrical and Computer Engineering of University of Birmingham in UK. His research interests are in speech and language technology and the integration of speech with other modalities, such as gaze and gesture.

Ann Lee is PhD student from MIT, currently working on sentence boundary detection using multiple annotations. Christos Koniaris from Greece is a PhD student in KTH – Royal Institute of Technology in Sweden, who has defended his PhD thesis last week. He has two directions in his PhD research, one is speech recognition, and the other is mathematical methods in mispronunciation detection.

Hyuksu Ryu is a master student from Seoul National University, working in the spoken language processing lab in the department of Linguistics. His research interests include comparing the quality of transcription between the native/non-native transcribers and improving the agreement of non-native English speech corpus transcribed by non-natives.

I am Zhenhao Ge, working in pronunciation training and accent detection. I have worked on several projects including developing online tutor help American students to correct mispronunciation in learning Spanish, feature optimization of MFCCs in terms of frequency scales for accuracy improvement in mispronunciation detection. There are other two projects I am working on for a software company related to speech: grammar-based name recognition and accent detection using Foreign Accented English (FAE) corpus.

Shou-Chun Yin is a PhD student from Korean, currently studying in McGill University in Canada. His research is in the area of pronunciation verification in speech therapy application and development of tools for CALL.

Khairun-nisa Hassanali is a PhD student in the University of Texas at Dallas. She is currently working on detecting language impairment from child language transcripts. Her work involves tackling interesting problems such as measuring language development, grammatical errors, parsing challenges, and coherence analysis and topic detection.

I discussed some practical issues of accent detection with Prof. Martin, such as how to remove the influence among speaker before detecting accents. Otherwise we would detect speakers rather than their accents. Prof. Martin suggested me to use Intersession Variability Compensation technologies to remove speakers’ difference. Real-time detection of accent detection is really difficult even for human being, so we need first test our algorithm on the duration like 30 seconds of speech, rather than aggressively work on speech which is too short, like 10 seconds long.

Some of the participants discussed the paper about DBN-HMM mispronunciation detection presented on Wednesday morning and explained why DBN pre-process before HMM can maximize the probability of generating data without introducing class labels.

 

Report written by Martin Russell


Report for IS2012 event from Table 14, Robust ASR

Submitted by admin on Tue, 10/09/2012 - 14:52

 

Expert: Roger Moore

Students: Joris Pelemans, Arun Narayanan, Yali Zhao, Matthew Seigel, Philip Harding, Jose Gonzales and Harm Buisman (Reporter)

In the beautiful Heritage Ballroom of the Governor Hotel in Portland we had the opportunity to have lunch with Roger Moore, the 007 of the speech community and an expert on robust ASR. Upon arrival the tables were already served in an artistic fashion (see the photo). As we introduced ourselves we found that our backgrounds and research topics were very diverse, covering large vocabulary speech recognition, robust speaker recognition, robust ASR, speech enhancement, social speech processing and even military espionage.

The discussion started with the accessible topic of tourist attractions in Portland, but soon ventured into the depths of everyone’s research area. Moore turned out to be an expert not just on robust ASR, but also on each of our topics. He had useful suggestions to each of us.

Moore challenged us with philosophical questions and ideas. He asked us if there was such a thing as noise. One of my fellow students disagreed, which was clearly the answer Moore had in mind. It is common practice in speech recognition to add noise, but what we fail to recognize is that distortions in communication have structure. The human auditory system compensates for this. Later on in the discussion this idea returned. Speech recognition is not about recognition, Moore says, but about communication. In our research we should keep that perspective in mind. That is the only way to make progress.

But, where are we now? Without giving an explicit answer, consider the two following short anecdotes. Every year there is a survey among speech researchers, asking them at what date they expect ASR to reach the level of human speech recognition. Every year, the average response date moves forward in time, at an increasing higher rate. The closer we get, the harder we realize it is. But what about systems such as Siri? Moore sometimes gives talks to the general public and asks them whether they know dictation systems or apps such as Siri. Almost 100% does. Asking them how many actually tried it: about 80% raise their hand. How many had a good experience? Only a few hands. How many plan to continue their usage? Even fewer. The general public is now aware of our technology, but large steps have to be taken in usability.

Should we be worried that we picked the wrong field? No, says Moore. We picked a field that we will fall in love with, and on top of that, we are very employable. Speech technology is a complex area. We use sophisticated techniques, but realize that we do not have all the answers. We should be confident in our capabilities and future. Moore: “If you have a go at speech, you can have a go at pretty much anything.”

While the future shines bright for us, I look back at an enlightening lunch on a sunny day at Interspeech 2012.

Harm Buisman

 

Report written by Harm Buisman

 


IS2012 Students meet Experts report - Table 20: Spoken language as media content

Submitted by Joao Felipe Santos on Thu, 10/25/2012 - 13:29

The Students Meet Experts event was organized by the ISCA Student Advisory Committee as an extra activity for students attending InterSpeech 2012. Students and experts met and had lunch together, while discussing interesting problems within their fields of expertise.

This year, we had the opportunity of enjoying a wonderful lunch with Kay Berkling from DHBW-Karlsruhe and Martha Larson, from TU-Delft. Our table topic was "Spoken language as media content", however, this was just one of the many topics we have discussed.

These were the student participants:

  • Ehsan Variani (Center for Language and Speech Processing, John Hopkins University)
  • João Felipe Santos (MuSAE Lab, INRS-EMT)
  • Tim Polzehl (Quality and Usability Lab, TU-Berlin)
  • Dogan Can (Signal Analysis and Interpretation Lab, USC)
  • Majid Mirbagheri (Institute for System Research, UMD)
  • Kartik Audhkhasi (Signal Analysis and Interpretation Lab, USC)
  • Yun-Nung Chen (Language Technologies Institute, CMU)

The discussions started with a presentation of all participants. And when I say that the discussions started there, I mean it! Even describing our research topics leaded to interesting conversations and interaction.

At some point, we ended up discussing career paths and the future of education. One interesting point was that we do not know exactly how will teaching evolve on the next 10 years. The success of initiatives like Coursera and Udacity shows us that in some years, the figure of the professor giving a lecture in a university room may not be as common as today and we should be prepared for that.

We also discussed about the importance of events like Students meet Experts for networking. It was, indeed, a wonderful opportunity for meeting and discussing interesting topics not only with colleagues from other institutions, but also with great experts from our field.

To finish, a couple of pictures from our table at the event, taken from the album posted on the ISCA-SAC Facebook group by Maria Eskevich.

 

Report written by Joao Felipe Santos