Vol.1 RICOH THETA Developer Interview 360°Spatial Audio, Ask the developer. Vol.1 RICOH THETA Developer Interview 360°Spatial Audio, Ask the developer.

RICOH THETA Developer Interview
360°Spatial Audio,
Ask the developer.

Ricoh Company, Ltd.
Atsushi Matsuura (Left)/Takafumi Ohkuma (Right)

“Making the high resolution 4K image into a highly immersive video experience. This is Spatial Audio”

First, please tell us what is the meaning of “360° Spatial Audio”.

Ricoh Company, Ltd. Industrial Product Division, Advanced Technology Development Office Takafumi Ohkuma

Ricoh Company, Ltd. Industrial Product Division,
Advanced Technology Development Office
Takafumi Ohkuma

Ohkuma“Spatial Audio” is a term that has recently come into common use. Put simply, this is a framework for recording sound in three dimensions, and then playing it back. The expression “stereophonic sound” is also used. In addition to the conventional left and right directions, it also collects sound from above and below, and depth. With the person listening at the center, just as the name suggests, audio can be enjoyed from all 360° directions.

What this is based on is Ambisonics, a stereophonic sound technology developed in Europe in the 1970s. However, Ambisonics requires huge equipment, and also requires time for setting. And with the technology of the time, it could not be easily used in combination with video. Today, however, VR (virtual reality) and other video rotations can now be utilized to align with the head position in HMD (head mounted display). If in addition to that the audio can also be rotated, a highly immersive video experience can be enabled.

Why did you decide to mount 360° Spatial Audio on new THETA?

OhkumaOriginally, for the initial THETA model in 2013 we used the 2nd generation m15 with video response, and during THETA S development, as well, we considered use of the 360° Spatial Audio. However, at the time, in the omnidirectional ultra-wide angle the video resolution was not all that high, so that image and sound balance was not achieved. As a result, we continued in a state of searching for the best timing for mounting.

“Making the high resolution 4K image into a highly immersive video experience. This is Spatial Audio”

MatsuuraThis time, we have mounted the high-performance main processor, which is also used in smart phones, and therefore became able to process Ambisonics in the THETA body.
Up to this time we did not have the power in THETA S or SC sufficient for processing Ambisonics.
While Ambisonics itself is, as we stated at the beginning, an old technology, the spread of 360° VR video, and improvements in the machine power of smart phones for playing these back, we feel that the environment for making use of Ambisonics has now arrived.

OhkumaIn addition, we became able to shoot 4K video in new model. So then if the sound is cheap in comparison to the high resolution video, it could damage the realistic sensations so otherwise carefully obtained. So our development team set to work attempting an audio upgrade.

The mike itself that is based on the Ambisonics method just mentioned here is available for sale from other companies, as well. So saying, mere recording of sound alone requires a specialist high-value recorder, and after recording it needs to be matched to the video, and then also matched to the sound direction, technology settings that constitute an extremely high hurdle for ordinary users.

MatsuuraWhat is true for the equipment is also a requirement for software, as well. Both the equipment and software cost money, and even should such be successfully obtained, and the environment on the hardware side is put in good order, making use of this to create a finished product requires some quite significant skills.
While for persons who are professionally engaged in video production, this might be possible, for ordinary people it is just too hard... The hurdle is very high.

OhkumaNew THETA achieves 360° Spatial Audio with the body alone. With a framework that at a touch automatically matches the video front with the sound front, without time-consuming actions, it can link to the 360° video and perform connection and replay of spatial audio. As an entry-level device for 360° Spatial Audio, this should, in the same way as the THETA series that have become entry-level devices for VR video, spread the range for VR.

“Most of all, we thought about the characteristic thin-type design and the balance between microphone frequency characteristics and directionality.”

What areas proved most difficult in mounting 360° Spatial Audio?

OhkumaThe new model has four microphones built-in. What was most difficult was the placement of these mikes. Depending on the positional relationship, the mike's sound frequency characteristics, which can retain directionality due to the signal process, can change. For example, where the frequency of a person's voice is such and such a number, the frequency of a musical instrument can also be measured in a certain number range. Of course, responding to a wider frequency can link to a more realistic sound.
Here, the problem was the design. A characteristic of THETA is its thin type when compared with other company products, and the issue was how well to maintain the mike performance while placing it beautifully within a product body, and achieving a balance, including the content wiring. In new THETA, there are four built-in mikes, positioned symmetrically top and bottom, and left and right, and it is designed to preserve as much as possible the directivity related to frequency characteristics, and the quality of the recorded sound.

Ricoh Company, Ltd. Smart Vision Business Group, Product Development Center Atsushi Matsuura

Ricoh Company, Ltd. Smart Vision Business Group,
Product Development Center
Atsushi Matsuura

MatsuuraWe studied for quite a while how to record the spatial audio in items included in MP4 video. The problem we saw was how to repeat the specification changes that occurred, and play it back.
Then, since the issue was processing the video and audio together, and if the sound and video were to go out of sync this would naturally be a problem. We were able to verify quite a lot of this. Regarding this point, Ambisonics was an old technology and known to be stable, so that verification on this point was easy.

Speaking of which, I can remember when people heard the sound upon shooting the video incorporating parameters temporarily received in the prototype, in an actual device, they would say that “It is strange to hear the sound coming from one direction only. Isn't this an installation error?” This went on for some time. Even when I said it was a parameter problem, I couldn’t get anybody to accept that (laughs).

OhkumaNow that you mention it, yes, it was like that. (laughs)
The first prototype that we made was a 3D printing with nothing inside it, so that the inside the casing was nothing but echoes, but I think that actually stuffing the inside was a different issue than the parameters...Sorry about that. (laughs)

In shooting, what sort of scenes would 360° Spatial Audio be most effectively recorded?

OhkumaI am myself now running field tests, and making confirmations. Sounds include the Shibuya scramble intersection where sounds from all directions, the middle of a forest where a downpour of chirping cicadas can be heard from all directions, etc. Among these, the place where tensions rose particularly high was the park located right next to the Haneda Airport. The roar of the airplanes accompanied take-offs and landings that seemed to be coming right over one’s head. This is an interesting shot with the sound naturally being part of it, that I would want to show off to someone.
And then for a hobby I play the sax, and took a shot of myself playing a musical piece.
Compared with past instruments, the sound had changed completely. I think that I had recorded it in a form close to the sound actually heard by ear. It surely showed the strengths of a music live recording.

This also is one of my hobbies, but when I play tennis, and record myself in the center near the net I felt a reality of image and sound never seen before. If I don’t pay attention, I could get hit. (laughs)
I think that it is a good thing to switch between various locations.
The taiko drum and flute sounds, etc., at summer festivals, etc.., were also good.
The atmospheres at theme parks and tourist attractions can also be shot without omitting anything. I have the feeling that it will change the quality of our memories.
Because of the high sensitivity, sounds can be collected that the person shooting were not even aware of, and it is fun to listen for these later on.

“Most of all, we thought about the characteristic thin-type design and the balance between microphone frequency characteristics and directionality.”

MatsuuraWhile my scenes were quite different from those of Mr. Ohkuma, I think that the effects of closed spaces are easy to understand.
Setting aside whether these were visually interesting or not, the scenes of persons talking in quiet rooms, or of locations with no loud noises, etc., are surely scenes “of sounds with clear provenance”, where the effect of spatial audio is easy to understand. Or taking shots of a drinking party at a friend’s house, etc.
I took some shots of walking in the mountains for my mountain climbing hobby, but the mountains were not really suited to spatial audio. (laughs) While the 4K video was really effective, the natural sounds arrived from all four directions, and there were no other sounds, and being surrounded by nature meant that direction was really not relevant. But since perhaps a waterfall or something would be interesting, maybe it would be fun to go shooting.

“A combination of the ear-wraparound type (closed type) headphones and HMD can give the most effective feeling.”

Can the videos with 360° Spatial Audio be published or shared on the Internet?

OhkumaJust as in the past, you can use the THETA site. And I think that you can also publish on YouTube or Facebook in regards to the 360° Spatial Audio. Just as Mr. Matsuura said, since there are issues among the video formats on the differences in recording spatial audio, this is one area that I would like to achieve in future. Although if we could get the SNS side to respond, it would be even more convenient for the user. (laughs)

What sort of environment is most suitable for listening to 360° Spatial Audio? Can the effects be noticed with ordinary speakers?

OhkumaWhat is most recommended is the ear-wraparound type (closed) headphones. Because a characteristic is their high resolution.
In particular, I think that the item known as the monitor headphone that directly outputs the recorded sound is good.
On the other hand, headphones with an over-tendency to create sound try to put too much on the headphone characteristics, and end up changing from the actual realistic feel.
The new THETA spatial audio is mainly mixed into 2 channels optimized for the headphones.

MatsuuraSince the environment recommended by Mr. Ohkuma has a hurdle go up, first of all this should be experienced by ordinary earphones. (laughs)

OhkumaSince earphones have sound even small than headphones, and are limited in frequencies, while not recommended individually for persons other than those with highly sensitive ears. But I want people to experience the sound with any devices first, and then I recommend headphones to persons who want a more realistic feel. (laughs)

If listening with component or other speaker types, if there is not a fixed distance between the person and 2ch speakers, the spatial audio balance will collapse. If the person moves, the position relationship with the speaker changes, and the realistic feel is steadily lost.
It is the same as the 5.1ch surround audiovisual location. When as with the HMD (head mount display), the head rotation aligns the image and sound for more realistic feel.

“A strengthened audio was also the desire of the users”

So in the first place were there voices from various users saying “we want strengthened audio”?

“A strengthened audio was also the desire of the users”

OhkumaIn previous machines, there were issues with audio distortion depending on the situation, and resolution of this was also a theme this time. In feedback from the various users, there were many voices calling for better sound quality, and we felt that we wanted to respond to this.

In THETA’s new model, the mike element unit performance has been sharply upgraded. Specifically, we switched from the long-standing analog mike to a MEMS microphone that is electronically micro-created, In recent years this has also been used in smart phones, and is capable of recording people’s voices no matter how small.
In other words, in the thin-design new THETA with four built-in mikes, the MEMS microphone is itself one of our current optimum solutions.

A characteristic of the MEMS microphone is the low amount of quality scattering. In normal mikes, with four mikes mounted the characteristics between the mikes will show quite an amount of scatter, making it difficult for the spatial audio performance to be maintained.
This lack of characteristics scatter means that spatial audio can be accurately recorded to that extent.

And even if a large volume of sound is input, it will be recorded without distortion, etc.
Two recording modes are provided. These include a mode where even if large sound is inserted the sound volume gain is suppressed to avoid distortion as much as possible, and a mode where sound is not distorted as long as it is within a reasonable range.
In particular, the live concert, etc., that has previously been difficult to achieve should effectively exhibit selection by scene.

MatsuuraThere were hugely many comments from the user perspective, from people who had actually been using THETA S regularly, and who had been worried that the sound would crack when large sounds were emitted, and so even wanted to have a mode that would shoot video without any sound at all.
So there was this image of taking video of a fun scene of enjoyment with friends, where the sound isn't so much “wai-wai (excited)” as it is “bari-bari (crackle)”. (laughs)

OhkumaIn many cases involving cameras to date, a method would be used where the gain would be controlled to within a certain range to make it easier to listen to.
While this is good when wanting to take a certain fixed direction, since it differs from the actual sound, the realistic feel is lost.
Since orchestra or instrumental concerts, and vocals, etc., are expressions of intonations, it is not suited to music.
Even in spatial audio, differences in distance or in direction are important for audio intonation.
In THETA S, as well, the gain when compared with the initial video mounting THETA m15 is adjusted in an improving direction, but there is a limit to the capture coding performance.

MatsuuraWell, yes, I think that omission of the auto gain can lead to a more realistic audio.
Since cameras like new THETA barely yet exist in the world, we ourselves have a hard time finding references to compare against, and deciding what is good to aim for is a matter of trial and error.

We are now in the stage of having it spread around, and first get it on ordinary players in the world, to have playback of sound and video above and beyond THETA S or SC, and then, if used on Ricoh’s own players (apps), enable the playback of spatial audio.
In view of this, what we have created in consideration of the video format is the new THETA and its spatial audio.

We would love to see its use in various scenes, and for attempts at various sounds.​

OhkumaWell, let's see.
First, we get people to buy it (laughs) so they can have a new experience!


Ohkuma Takafumi Ricoh Company, Ltd. Industrial Product Division, Advanced Technology Development Office

Ohkuma Takafumi

Ricoh Company, Ltd.
Industrial Product Division,
Advanced Technology Development Office

Joined Ricoh in 2006
While at university, engaged in research into environmental electromagnetic engineering
At Ricoh, has worked on compact digital cameras such as GR series, and on THETA electrical hardware development (analog, systems), and is also in charge of software action specifications alongside the hardware.
Enjoys playing tennis and saxophone regardless of holiday or work day.

Matsuura Atsushi Ricoh Company, Ltd. Smart Vision Business Group, Product Development Center

Matsuura Atsushi

Ricoh Company, Ltd.
Smart Vision Business Group,
Product Development Center

Joined Ricoh in 2010.
At university was involved in color engineering, and at Ricoh has worked as a software engineer, including GUI for GR GUI and voice incorporation in new THETA. On holidays, likes mountain climbing, movie appreciation, or other pleasures not seen on a normal day, all while looking forward to daily development work.

*The contents written in this interview is based on the information as of Aug. 10th, 2017