Which came first: The Picture or the Sound?

By Larry Bloomfield

for Tektronix to be published in World Broadcast Engineering

(1586 words)

The performance that got the greatest press in the early attempts to give the public sound with pictures was the 1927 Warner Brothers - Vitaphone "talkie;” The Jazz Singer.  Although it was not the first talking-feature, it was the first feature-length "talkie" film in which spoken dialogue was used as part of the dramatic action. 

  Al Jolson's first words on screen are probably the most famous lines of dialogue in media history:  “Wait a minute! Wait a minute! You ain't heard nothin' yet. Wait a minute, I tell ya, you ain't heard nothin'!” continuing then with a rendition of 'Toot, Toot, Tootsie!'  Ever since Jolson broke into song, audiences have become accustomed to listening to what they are seeing.

 Even from the early beginnings, keeping the 78-rpm records synchronized with the 35mm projectors was not an easy task.  Things got simpler as optical soundtracks were printed onto the film, but even then, more than one presentation didn’t have the proper loop-lengths between the picture “gate” and the sound head, causing a similar problem, which assured the term “lip-sync” a permanent place in the vocabulary of the entertainment industry.

 As television emerged from the days of the Iconoscope and Image Orthicon cameras, it too found that it was not immune to the lip-sync issues.  Early on, about the only lip-sync issues were found in a now nearly forgotten part of the station, telecine, where film was a big part of the broadcast facilities daily appetite. 

 It doesn’t make any difference: Material, weather it is film, tape or live, that is out of lip-sync, is probably one of the most annoying and distracting things that could happen while experiencing a performance of any kind.

 As television grew and the old “genlock” method of synchronizing video signals from different sources gave way to digital devices, such as the frame synchronizer, delays began to creep into the video path that delayed, or retarded the video with respect to the audio.   

 Although the small individual delays caused by various pieces of equipment inserted to perform particular or special function jobs along the video path is usually not too perceptible, but these errors will compound, becoming additive as additional equipment is introduced into the video path.  Even though we are told that the electrons, which together comprise the picture, move at nearly the speed of light, it still takes time to go through each and every device.  A few microseconds here and a few milliseconds there, begin to add up. 

 If there was some way of guaranteeing that the video paths could always be the same, each and every time through a facility, it would be a simple task to delay (retard) the audio accordingly, but we know that is nearly impossible in today’s plants.  Even if frame synchronizers, which bear the burden of the greatest delays, are removed from the equation, the plethora of other video equipment, such as distribution amplifiers, routing switchers, production switchers, patch bays, patch cords and the very volatile exigencies of the day-to-day operations, belie anything close to consistency of operation from being a given in most any facility.

One of the early attempts to keep audio and video in sync was the introduction by Tektronix of the 118AS, a companion unit to the very familiar 110S frame synchronizer.  This device proved to be invaluable in the days of analog – NTSC plant operations.  Fixed, predetermined amounts of audio delay have also been used, often set to ensure that the audio is always late with respect to the video.  With such an audio device, added to the audio path at each point were there’s a frame synchronizer, the audio is never perfect, but is kept within reasonable lip-sync tolerances.

Only one major problem; this only takes into account what takes place inside the plant; it does not take into account the diversity of paths the audio/video program material may have taken getting there.  It also does not take into account the variable delays that will and do occurred with in or out side the plant due to routing of the video through different equipment. 

Say, for example your station is covering a soccer game a reasonable distance from the television station and there is no direct microwave shot to a relay station or to the studios.  It is most likely the audio would be sent via conventional high quality telephone circuits while the video will get sent via satellite.  Comparatively speaking, even if an outside broadcast (OB), Electronic News Gathering (ENG) event or remote is only a short distance, the audio will travel only a few kilometers compared to the nearly 72,000 kilometer  (44,600 US mile) path the video will travel.  Since radio waves and light travel at approximately 225,000 kilometers per second (140,000 miles/sec.), the video delays will be in the order of hundreds of milliseconds. This is not only unacceptable, but also very annoying and is unacceptable to the human brain and all concerned.

As if this audio-video delay business wasn’t complicated enough, broadcasters began running a second channel of audio to accommodate stereo programming.  This only tended to compound these problems. The insertion of a predetermined fixed delay in the audio path helped but just wasn’t a complete solution.

With the advent of digital equipment at television studios and post-production houses over the past several years, this problem seemed to have gotten worse rather than better with the addition of new wiz-bang digital processing devices and special effects equipment.  As each new device finds its way into the video path, it only serves to increase the latency of the video signal and compounds the sip-sync issue.  Add to all this the introduction of on/off line editing sessions, it’s a small wonder that any of the sound ever matches the video. 

One other contributing factor to be considered is the possibility of improperly functioning time-code, which can create sudden shifts or gradual variations in the relative timing of the audio and video signals.  This all adds to the fact that, to date, there are only limited controls over the lip-sync issue, the problem appears to continue to grow.      

Common sense dictates that to address the lip-sync problem in television, one must start at the very beginning and ensure that some form of “indelible relationship” between video and audio is established at the very source, or very early on in the life of the material under consideration and is imprinted on either the audio or video.  This “indelible relationship”, then, can be carried throughout the divers, often separate paths these two elements may chance to take.  By doing so, would then permit the pictures and sound to be resynchronized at or near the end of its travels through the myriad of devices, editing scenarios, etc.

Some of the criteria that would probably be imposed in solving this problem would be not to reinvent the wheel and possibly use technology that already exists, so let’s look around. 

Bingo!  The imbedding of information into video has been around for a while and is probably most closely associated with identification and copyright protection of intellectual property.  Does the term “watermarking” ring a bell?  Watermarking, which was developed during the early part of this past decade, is currently most notorious for its application to digital media ranging from digitally recorded music, DVD and digital television broadcasts.

Digital watermarking technology uses an arrangement of very low-level “patterns”, representing digital bits, to encode extremely low-level ID information. Although embedded within the program material (video or audio), it is imperceptible to the listener or viewer, but can easily be decoded for identification or other purposes that might be desired. 

A reference code can be derived from the audio program’s natural enveloped.  This is called its “signature.”  That signature can then be embedded as a “watermark” into the video, thus giving a reference point, which can be decoded and used for comparison purposes at any point in the life of the program materials travels through the electronic maze.

Any time-shift between the watermarked audio timing reference and the original audio signal is an indication that an audio-to-video delay “error” has occurred.  This “error” information can then be used to automatically control corrective audio delay circuitry that may be desired while remaining transparent to the end viewer. 

The only down side to this type of correction is the slew rate that may possibly crop up when switching between video sources with little or no error and those sources that might contain serious errors. 

Similar types of circuitry are used in the 7-second automatic delay devices used on radio talk shows.  When a caller says something that has to be deleted, it takes the equipment a while to build back to the 7-seconds of delay again.  Station owners and managers prefer to have those rare instances when this is a minor glitch than to suffer the adverse consequences.

The same would be true of watermarking video with a digitized version of the audio envelopes signature, as a production tool, the slew rate could be adjusted to compensate for any abrupt changes.  This is truly a production call whether to have annoying lip-sync issues through out a clip, story or show than a momentary advance or delay to compensate for timing errors.  Few will disagree that it is better to have the capability to make such correction with their momentary glitches than to have an audience suffer through one of the most annoying issues in the entertainment business.  

END

++++++++++++++++++++++++++

Stay tuned for more.

(Side Bar) – 105 words

Tektronix Fixes Lip-Sync Problems

During NAB2000, in Las Vegas, Nevada, earlier this year, Tektronix announced the development of a new product that would automatically correct real-time errors in audio-to-video delays. The technology employed by Tektronix is based on watermarking, which is an ultra-low-level 3D pseudo-random noise digital pattern subliminally embedded into the active portion of a video frame. 

In trials, Tektronix’ digital watermarking withstood compression and decompression using MPEG-2 and JPEG formats. The product is expected to be introduced shortly! If you want to treat your audiences to near lip-sync free program material, contact your nearest Tektronix representative or visit their web site at: http://www.tektronix.com.  

                                                                                              END