Interspeech 2013 Student Roundtable discussion - "ASR in industry"

Submitted by benlambert on Tue, 09/03/2013 - 17:37

 

This blog post is a brief summary of our discussion for the Interspeech 2013 student round-table discussion, at table #1. Our discussion focussed on issues relating to the general topic of "ASR in industry." We had eight students attending, and two experts, Michiel Bacciani and Kate Knill.

We talked a lot about the transition from academia to industry, and differences in the approach to research between academia and industry. We discussed a number of aspects of this such as: the pros/cons of internships, practicality of research (e.g. may give up on hard problems sooner in industry), how directed the work is vs. how independently work is done, and sharing of ideas with other members of the community and patents. We talked a little about Google's process with new hires.... a lot of infrastructure to learn about at first.

We also talked a bit about recent algorithms. The main topic was on neural networks. We talked a bit about the hype, and whether it would last. We also talked a bit about the history, why did NN come back into vogue lately. Finally, we talked a bit about system combination.

In addition to all that, we had a good time getting to know one another, and talking about various other topics such as the local cuisine in Lyon and international politics.

 

Report written by Benjamin Lambert


Interspeech 2013 Student Roundtable - Table #3 - "ASR & General Questions"

Submitted by andrea86 on Thu, 09/05/2013 - 13:37

 

This report is a summary of events during the Student Roundtable discussion for Table #3 at Interspeech 2013. The expert present was Najim Dehak from MIT, and the topics varied in scope.

The discussion started off with an insight into the I-vector paradigm, a number of pitfalls related to short utterances and how modern applications are focusing on the use of uncertainty to make decisions. The discussion turned very philosophical - there seems to be a lack of people working directly in DSP to devise new feature vectors for the multitude of tasks we apply speech technology to. Most people work in modelling, and the feature frontends are not varied as much as they should be at this point.

Following this, the discussion delved into whether it is possible, and how would it be possible to have a mathematical quantification of complimentary of different fused systems. We recalled the ideas from information theory which say that fusion at a frontend level is better, and more understood, than fusion in scores.

Going back to I-vectors and adaptation, it was noted how ASR is on the verge of a number of changes - one related to the technology being used - Deep Neural Networks (DNNs), and one related to using modern adaptation techniques such as eigenvoice adaptation within an ASR context. It was noted how DNNs are not really new. They have been around for a very long time. What has changed however is the hardware available, which allows much more practicable use of DNNs.

The discussion ended with a number of pointers for careers in academia. The main points being that it is a good idea to read, and read a lot, about many different areas of research. It is easier to find funding and publish on new topics that not too many people are working on. Another important aspect is networking, and the usefulness of conferences not just at an academic level, but to get in contact with possible future principal investigators that might consider offering a research position in the future. Also, having referees outside of your direct circle of research will help in giving more options when it comes to start a career in academia.

All in all it was an interesting and varied discussion, with many technical and philosophical points being made.

 

Report written by Andrea DeMarco


Interspeech 2013 Students Roundtable - Table #5 - ASR: Language Model

Submitted by sarah.f.juan on Mon, 09/30/2013 - 09:27

 

At our Student Meets Experts session, we were fortunate to have Frederic Bechet and Tanja Schultz to answer our questions. There were nine of us studying in different countries; Switzerland, England, Germany, France, USA and Italy. Ali Orkan Bayer (University of Trento), Justin Chiu (Carnegie Mellon University), Maryam Najafian(University of Birmingham), Ramya Rasipuram and Marzieh Razavi (Idiap Research Institute), Zixing Zhang (Technical University of Munich) and myself from Grenoble Informatics Laboratory.

We started off by introducing ourselves and getting to know some background of our experts. Despite of the noisy environment in the restaurant, we managed to talk (briefly!) about our Phd work to each other at the table.

Some valuable advices we received from our experts were about our Phd journey. For instance, our first year should concentrate on networking; searching for potential experts/researchers in the same domain or research interest (thus, this session is crucial!). Then, our second and third year should be working on our research (of course!) and also to start making ourselves visible in the research community through workshops and conferences. With this strategy, current researchers are able to recognize our work early and this is beneficial for future career after Phd.

In terms of research and engineering, we got some tips on normalization issues on texts and also dealing with acoustic models from speeches of professional readers and non-professional readers. Besides that, we discussed on issues regarding spoken term detection and "code-switching" phenomena by multilingual speakers.

Before our session ended, Tanja advised us to think beyond speech in our future research. She expects more scientific work rather than focus too much on engineering as she wants to see more discovery to further understand speech. Personally, I will put this in mind always!

In summary, the lunch was great. And with great company, we definitely enjoyed the discussion. Apparently, we were the last ones to leave the restaurant :)

To my fellow friends at Table #5, au revoir et à bientot.

Merci beaucoup, Interspeech!

 

Report written by Sarah Samson Juan

Interspeech 2013 Students Roundtable - Table #6 - Speech synthesis and voice conversion

Submitted by gangchen on Fri, 10/04/2013 - 19:44

 

In this Student Meets Experts session, we have two experts: Nick Campbell and Yao Qian. 4 students are working on fields related to speech synthesis. We started by briefly introducing our own research interests, and then discussed the challenges in speech synthesis research.

Nick Campbell: Trinity College Dublin (The University of Dublin)

Corpus-based approaches of speech synthesis and natural conversational speech collection in a multimodal environment

Yao Qian (MSRA):
TTS in MSRA: using a hybrid system and also have language experts to help them.

Zhizheng Wu (Nanyang Technological University, Singapore):
Voice conversion; spectral envelope conversion; convert speech to attack speaker verification systems; measured by means of spectral distortion and listening tests.

Gang Chen (University of California, Los Angeles, USA):
Glottal source modeling; voice quality analysis; use the glottal source model to generate various voice qualities. Voice quality/glottal source is critical for making voice personalized.

Jani Nurminen (Tampere University of Technology, Finland)
Low bit rate speech coding; compression of speech for storage.

Xinyu Na (Idiap Research Institute, Switzerland)
Prosody refinement in HTS; pitch production model; change tone in Mandarin; use Straight for pitch modification. Low bit rate codec (e.g., 100 bps for military communication): run ASR to convert speech signal to text, linear transformation of the acoustic model, encode the speaker identity and speech content, and decode (re-synthesize) to generate personalized speech.

Challenges in recent speech synthesis research:

Parametric/HMM synthesis can still be regarded by non-experts as unnatural, although speech experts know how much has been improved during recent years. General audience would tend to judge synthetic speech as non-human and simply dislike it. With the comparison to computer graphics industry, audience don't seem to complain cartoon or anime as "unreal". For human-computer interaction, people are expecting the speech to be real, natural, and human-like.

HMM synthesis needs to capture the variability not the average voice. Intelligibility is not a problem for concatenative synthesis and HTS, but personalized voice is a challenge. Voice might change throughout a recording or a long time span.

For voice conversion, there is always a tradeoff between transformation and naturalness. The more modification is, the more distortion there will be.

I would like to thank ISCA-SAC for organizing this great event. Hope this will continue for future interspeech conferences!

 

Report written by Gang Chen


Interspeech 2013 Student Roundtable - Table #8 - Prosody

Submitted by samsibar on Wed, 09/18/2013 - 16:36

 

Date: 28th August, 2013

Topic: Prosody

Experts: Plínio A. Barbosa (University of Campinas, Brazil), Petra S. Wagner (Bielefeld University, Germany)

Students: George Christodoulides (Université catholique de Louvain, Belgium), Pierre-Edouard Honnet (Idiap Research Institute, Switzerland), Albert Lee (University College London, UK), Rachel Rakov (City University of New York, USA), Barbara Samlowski (Bielefeld University, Germany), Andreas Windmann (Bielefeld University, Germany), Rui Xia (University of Texas at Dallas, USA)

---

This year's "Students meet Experts" event took place at the restaurant "La Scène", which was very close to the conference venue. While enjoying a delicious three-course dinner in a relaxed atmosphere, our table discussed a variety of aspects related to prosody with two distinguished experts on the subject: Professor Petra Wagner from Bielefeld University in Germany and Professor Plínio Barbosa from the University of Campinas in Brazil.

Rather than following the suggested agenda point by point, we opted for an informal, spontaneous discussion session. The long, narrow tables as well as the presence of two experts encouraged individual dialogues rather than one central group discussion. Soon, several animated conversations were taking place simultaneously. Participants shared information about their current research topics and interests, and discussed differences and commonalities of the various projects.

During our discussions we found that we each approached the topic of prosody from different angles. One such angle is the various types of semantic meaning which can be conveyed through prosody. Pierre-Edouard Honnet is working on automatic methods of transferring prosodic meaning from one language to another in speech-to-speech translation systems. Rachel Rakov is investigating how sarcastic utterances can be automatically distinguished from sincere ones on the basis of prosodic cues, while Rui Xia is examining emotion recognition by humans and through automatic systems in different languages.

Prosodic structure reflects the situation of speakers even when it is not specifically used to convey information. George Christodoulides is investigating simultaneous conference interpreting and the effect that the high cognitive load of this this task has on prosody. When it comes to duration of segments, syllables, and words, speakers have to reconcile their wish to reduce articulatory effort with their desire to be understood. Andreas Windmann is working on a model of speech timing which optimizes these demands.

Finally, prosody is also influenced by linguistic factors. Albert Lee is investigating how lexical tones interact with word accent in the pronunciation of pitch accents in Japanese, while my focus is on the effects of factors such as word stress, lexical class, and syllable frequency on acoustic prominence and syllable duration.

Considering the different angles, discussion topics were widely varied and included, among other things, conversations about the difficulties of interpreting between languages with different grammatical structures, the effect of media on foreign language pronunciation, and advantages and disadvantages of using forced-alignment systems for data segmentation.

All in all, the event was a great chance to meet students from different parts of the world, to make new contacts, and to learn more about the work that others are conducting in the wide field of prosody.

 

Report written by Barbara Samlowski


Interspeech 2013 Students Roundtable - Table #9- Speech Applications

Submitted by lezzoum on Thu, 09/19/2013 - 01:23

 

Eight students from all over the world met around a lunch table with two experts in speech processing: Julia Hirschberg and Maxine Eskenazi.

In the beginning of this meeting, each student introduced himself:

  • Sudarsana is doing his master degree in the IIIT Hyderabad, India. He is working on Emotional speech analysis and recognition.
  • Audrey is a PhD student in Université de Grenoble, France. Her PhD thesis focuses on cortical speech recovery after oral resection following cancer.
  • Emad Grais is achieving his PhD in Sabanci University, Turkey. He is working on source separation and speech enhancement.
  • Raphael is also a PhD student in the Idiap Research Institute / EPFL, Switzerland. His thesis focuses on perception and information content of background noises in speech signals.
  • Mahnoosh is a PhD student in Center for Robust Speech Systems, University of Texas at Dallas, US. Her research interests include robustness of speech systems such as speaker and language identification and speech recognition to speaking style such as singing.
  • Nathan is doing his PhD in IRISA/Université de Rennes 1, France. He is working on source separation for speech recognition in movies.
  • Thuy is achieving her PhD in source separation at the university of South Australia. She is also interested in speech applications in medicine and heath care, especially using acoustic information for diagnosis.
  • Narimene (me) is doing her PhD at Ecole de Technologie Superieure in Montreal Canada, working on a smart hearing protection device that guarantees protection while transmitting speech signals to the ear. My research interests include voice activity detection in low SNRs and speech enhancement.

 

After introducing ourselves, the two experts introduced themselves:

  • Julia Hirschberg is a professor and Chair of Computer Science at Columbia University, she worked in AT&T labs in text to speech applications, she also worked in spoken dialogue systems and emotion recognition in speech.
  • Maxine Eskenazi, is a Principal Systems Scientist in Language Technologies Institute in Pittsburgh. She created a system that improves and correct speech pronunciation (Fluency), in addition to other speech systems.

 

In this meeting, the discussion went around what we (the students) want to do after our studies: teaching and staying in academia, working in a company or starting our own company. Almost all students preferred working in companies or in academia than starting a new company. Thus, the experts shared with us some of their experiences in companies and gave us some advantages and disadvantages of work in company and in academia. Furthermore, we discussed about new speech technologies such as the one used in smart phones, and the question was: how people accept new speech technologies, and where speech technology is going.

This meeting gave us some insights for future and a realistic vision for a scientific career either in academia or in a company.

The two photos below were taken at the end of the meeting and two of the 8 students were missing (Audrey and Nathan) and also Maxine.

 

Report written by Narimene Lezzoum


Students meet experts at Interspeech Lyon 2013, table 10, speech production, perception, articulatory models and phonetics

Submitted by pianordgren on Wed, 09/04/2013 - 08:24

 

At Interspeech, Lyon, 2013, we were offered to participate in a Student Roundtable Lunch Event, “Students meet Experts”. The event took place at a very nice restaurant, La Scène in the Hotel de la Cite Concorde. The topics at our table were: speech production, perception, articulatory models and phonetics. We had the opportunity to meet with professor Catherine Best, MARCS Institute, University of Western Sydney and Slim Ouni, University of Nancy 2, France.

The students that were present were: Laurence Bruggeman, MARCS Institute, University of Western Sydney, Adele Gregory, Linguistics Department, La Trobe University, Ann-Kathrin Grohe, University of Konstanz, Jessica Siddins, Institute for Phonetics and Speech Processing, LMU Munich, Rosario Signorello, Grenoble-Alps University, GIPSA-lab and Roma Tre University, Educational Department and Pia Nordgren, Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg.

At first, we introduced ourselves to each other. We spoke about where we came from originally (which country) and where we work at present (which country). We also presented our current work and research projects to each other. Some of the students had begun the Ph D studies recently or were about to start soon, while other student’s hade been working for several years. Mutual interests were discovered and some contacts exchanged. Between courses, we moved around so that we had the possibility to speak to several people.

The lunch discussions between the experts and the students continued in a more specific way, where we discussed various topics, for example:

  1. Speech production/perception and articulatory models. Which articulators produce a certain sound? We discussed the ultrasound machine, which may be used for studying tongue movements. it is a non - invasive method, compared to x - ray. Certain person's are more appropriate for this specific method, than other person's. The articulograph, as a research method was also mentioned in this discussion.
  2. Language in autism: We discussed the importance of studying language in autism spectrum disorder. Some research projects were discussed and compared, which was very interesting!
  3. We discussed sociolinguistics - culture and language, attitudes and problems with racism.

In summary, many interesting discussions and thoughts to bring home! We had a very nice experience in meeting the experts and the other students in a very nice surrounding!

 

Report written by Pia Nordgren


Table #11 Multimodality/Multimedia

Submitted by catha on Sat, 10/05/2013 - 21:47

 

At this year’s student lunch event I was placed at the multimodal/multimedia table. The expert on this table was Helen Meng. We were a pretty small table compared to other tables with only four students which made it really nice and easy to talk to each other.

Helen was a really engaging and interested expert. She read all our research interests and asked very interesting questions. All of us students had pretty different research backgrounds but this did not inhibit a very lively and interesting discussion.

A lot of the time we actually ended up spending discussing my own PhD project which is concerned with the multimodal modelling of group involvement and individual engagement in conversation. Everybody had questions and seemed interested. It was particularly interesting as everybody came from a different research field the question people posed were quite different from what people normally ask so that I got a new perspective on certain aspects of my work.

One issue which came up during our discussion was that it is hard to publish and get to present at the conferences for certain research topics that lie on the intersection of speech technologies and other research fields. For example, the information retrieval of audio content is less present at Interspeech that the students maybe would want it to be. Also at Interspeech only a very small portion of the studies are actually relevant to this topic. We concluded that for this reason it is nice to actually have the student lunch event which brings the students which share the same research background together; especially as at such a big conference as Interspeech it might be difficult to find each other’s otherwise.

The question of future work and academia versus industry dilemma also came up. Helen shared her experience on her career development, how it is important not to be afraid to become part of the challenging new projects, and maybe even trying to start the building of the new research lab in the place where there was none before. It was a very motivating discussion, as we all need to make this decision at some point soon.

All in all, I think that this year’s student lunch event was very successful. It was held in a nice atmosphere, the discussions were really interesting and the food was great too!

 

Report written by Catharine Oertel


Interspeech2013 Student Roundtable - Table #12 - Machine Processing of Dialogue and Spontaneous Speech

Submitted by Raveesh on Mon, 10/07/2013 - 07:56

 

Interspeech2013 Student Roundtable Lunch Event "Students meet Experts"

Table No.: 12

Table topic: Machine Processing of Dialogue and Spontaneous Speech

Experts:

  • Sophie Rosset, Orsay, Cedex, France
  • Olivier Pietquin, Supélec Campus de Metz

 

Reporter

  • Raveesh Meena
  • Shammur Absar Chowdhury, University of Trento
  • Francesca Bonin, Trinity College Dublin
  • Iñigo Casanueva, University of Sheffield
  • Matthew Henderson, University of Cambridge

 

Participants

  • Pierre Lison, University of Oslo
  • Raveesh Meena, KTH Royal Institute of Technology
  • Juan Rafael Orozco-Arroyave, Friedrich-Alexander-Universität
  • Pei-Hao Su, National Taiwan University

 

We slowly started off with the delightful lunch arranged by the ISCA student body. Soon discussions started off at both ends of the table. At one end, Olivier briefed us about the Dialogue State Tracking Challenge that was on SIGdial the week before in Metz. He first gave an overview of dialogue systems, discussed what is exactly the “state” of the dialogue, what features are needed to represent it and if its really necessary a very precise representation of the state to obtain a good dialogue management. He discussed why MDP based approaches, especially the POMDP variants are suitable for modeling diaogue.

We later talked about the explosion of Deep Neural Networks in speech recognition, and why DNNs are gradyally substituting all other machine learning techniques in speech-- because of the huge amount of transcribed data available today, and the higer computational power of machines. We also discussed about reinforcement learning and its limitations. We discussed how reverse-inforcement learning can be used to automatically learn reward policies.

At the other end of the table Sophie Rosset presented her work on spoken question answering, as well as where the field was heading in the coming years. We discussed the upcoming Horizon 2020 currently in negotiation at the EU level and the perspectives (or lack thereof) it offers for funding NLP research. One of the student presented an idea for project proposal regarding the integration of dialogue modelling in statistical machine translation. The key is to see if discouse and dialogue structure could be used to enhace quality of automatic mahcine translation.

Alongside the engaging discussions, we enjoyed the rather delicious meal, and it was soon time for the French specialties – macaroon. It was a yummy end of a very rewarding discussion.

 

Report written by Raveesh Meena