Overview of the Audio Tools

In Open Hypermedia, links are first class citizens - they are managed separately from documents.

This is a powerful approach and it is well established in the hypermedia community, but there is little experience of its application to temporal media.

We have built our prototype audio tools to demonstrate and explore the issues involved in applying the principles of Open Hypermedia to Audio.

Audio Production

Before exploring linking in Open Hypermedia, it is useful to consider a simpler application of open principles - the production of a news item for radio.

An interview has been transferred to linear digital audio (eg: a Microsoft Windows wave file format) complete with mistakes and lengthy answers. Segments in the interview are identified; questions, full answers, edited answers and sound bites. A segment is given as a start and end time (in milliseconds) in a piece of linear audio, with an optional description of the segment.

The segments are then arranged in a sequence to suit the program. Segments may be repeated, ommitted or placed out of order. There may be several sequences of the same interview, eg: one for hourly news and one for a full news program.

Screenshot: Sequence PlayerThe sequences can then be played using a special player or mastered to a new digital audio file. We have produced a prototype sequence player to demonstrate this. The screen shot shows it playing the Thunder.seq sequence. The sunken window identifies the segment being played (and gives the first glimpse of the format we are using to represent endpoints).

This approach has several advantages:

  1. Editing is non-destructive
  2. Editing of a sequence does not require access to the digital audio
  3. Segments may be re-used in different sequences which saves space

This principle is not limited to one digital audio source; documentaries may have several interviews linked by a narrative. Once the segments have been identified the producer can arrange the sequence without needing access to the audio. This makes it easier to work on a production as sequences can be transferred by something as simple as e-mail.

The same approach can also be applied to arranging music and video.

There is a need for tools to help create the sequences.

Prototype tool for linking speech transcripts

Historians conduct research mostly from text sources. A lot can be learnt from the way a speech is structured but the ability to link a speech transcript to the audio would greatly improve understanding. We have written a tool to do just that.

It is a small HTTP server which runs on the local machine on an arbitrary port number and is accessed using a standard Internet browser, such as Netscape. The digital audio file and the transcript file are specified in a browser form.

The 'All words' option produces a form in which each word is a hyperlink. The user clicks on any word to start the audio file playing. Just before each word is spoken, the matching word is clicked. This sends a request to the tool which records the time offset of that word. This would typically be done for the first word of each sentence. All words
Once the offsets have been recorded the audio is stopped and the 'Clicked word only' option is selected. This displays only those words which have an associated offset. Clicking on these words starts the audio playing from that point. When the audio is playing, clicking on a word stores the time offset again, allowing the user to make adjustments. Clicked words only
When the file is 'exported' it is displayed in a format which can be saved from the browser, intended for delivery and for later imprort (this format includes the time offset data). Clicking on words now causes playback only.  
When displaying clicked words only, sequences can be captured by clicking alternately on the beginning and end words of each 'endpoint' (anchor). The sequence can be exported and saved from the browser. Sequence capture
When displaying sequences, links can be captured by clicking alternately on the beginning and end endpoints (anchors). The set of links (linkbase data) can be exported and saved from the browser. Linkbase export

Link Manager

The Link Manager is the central application which resolves the links. It is loaded once and remains on screen for the other applications to communicate with. The Link Manager displays the last source endpoint at the bottom of the dialog box, with the resolved links in the list box. Following a link is achieved by double clicking on the entry in the listbox or, if the 'Auto Follow Links' option is enabled, it will be followed as soon as it is added to the listbox. Note that the source endpoint is displayed for demonstration purposes only, it is not envisaged that the user would need this information.

The Link Manager

The linkbase format

A link consists of one or more source endpoints, one or more destination endpoints and, optionally, other attributes of the link (such as a description or a script). A linkbase is a collection of links. A textual representation has been chosen, making it simple to transfer and view linkbases. An extract of a sample linkbase follows:

<src="sinew01.wav?start=400&end=4000" dest="church1.jpg">
<src="sinew01.wav?start=4000&end=6000" dest="sinews.html">

Link Player

The Link PlayerLink Player looks very much like the standard Media Player, a tool most users are already comfortable with. The main difference is the addition of a Link button, which sends the current endpoint information to the Link Manager.

We have written our own tool, which some might argue is not very Open. However it should be noted that the Media Player is only a wrapper application which communicates with the MS Windows MCI (Media Control Interface). The MCI can be given commands like 'play sinew01.wav' and 'stop', the Media Player only issues those commands.

Link Player optionsThe player has several options which control when an endpoint is sent to the Link Manager for resolving. Link on Play send an endpoint whenever the play button is pressed. Auto Link sends an endpoint more frequently (approximately every second).

Linking from HTML

HTML linking to the Link ManagerThe Audio Linking tool can communicate with the Link Manager. This allows links from the text / audio to be held in a linkbase. This is achieved by selecting the 'Enable linkman' option and exporting clicked words only.

The screenshot shows the links resulting from clicking on "indeed". The word has been converted into a time offset which is sent to the Link Manager as an endpoint.

The Audio Linking tool is used for authoring the links but there is a cut down version which only has the ability to play the audio and pass the endpoints to the Link Manager.


A radio programme can be regarded as a "guided tour" through available resources. By delivering branching material, the user can interact and take their own path. A simple example of this is a news broadcast, where the headlines are given and the user selects the headlines they are interested in.

Real-Time Streaming Protocol (RTSP)

How might hyper-radio content be delivered? We need a network protocol that may be used to support the real-time delivery of multimedia data. One such protocol is the Real Time Streaming Protocol.

The Real Time Streaming Protocol is a proposed Internet standard which is being jointly developed by Netscape Communications and RealNetworks (formerly known as Progressive Networks). The current internet draft describes RTSP as being an "application-level protocol", which controls the delivery of streaming media with real-time properties. RTSP itself does not actually deliver the media data; this is handled by a separate protocol and therefore RTSP can be described as a "network remote control" to the server that is streaming the media.

Streaming Links

We are working on streaming links along side the media. This is useful in situations where a linkbase may not contain all appropriate infomation. Live broadcast is an example of this, temporal based links can not (usually) be precomputed and stored in a linkbase.

Content Based Retrieval (CBR)

Links require source documents, but how do you locate the first one? This is a straightforward and well researched problem for text documents. Internet users are familiar with search engines which find documents based on keywords. We need something similar for audio files.

We have developed a tool which allows a database of MIDI files to be searched using a pitch contour. MIDI is an acronym for Musical Instrument Digital Interface. It is a hardware and software specification for exchanging information between musical instruments and related devices (i.e. sequencers, light controllers, mixers, etc.). It consists of very low-level information, ie: note on, off messages. files which tell the computer what notes to play and when.

A pitch contour describes series of relative pitch transitions, an abstraction of a sequence of notes. A note in the input is classified in one of three ways: it is either the same as the previous note (S), higher than previous note (U), or lower than the previous note (D). Thus, the input is converted into a string with a three letter alphabet (U, D, S). For example, the introductory theme Beethoven's 5th Symphony would be converted into the sequence: - S S D U S S D (the first note is ignored as it has no previous pitch).

The query contour need not be typed in, another MIDI file may be used to find similar files. The tool could form the 'back-end' of a search by humming system, where the user hums a query which is converted into a contour by a pitch tracker.

The screen shot shows a contour from the theme of Axel F. which has been searched for. The three versions of Axel F. are in the top four matches. The positions of the match in the file can be calculated, which is what the Time Index window is showing. These times may be used as endpoint for links. Note that the contour matches more than once in the file, as the theme is repeated.

The ability to search MIDI might appear to be limiting but the technology to convert from digital audio to MIDI is emerging. The Audio CBR search engine could then be used to index the audio track from a movie. The screen shot shows an example of this, where a MIDI file has been associated with a music video. The video can be displayed using the Link Player, which uses the Link Manager to resolve any links from the video.

Pitch contours could be used as generic links. (note difference between CBR and CBN?)

Link Hider

One of the problems with the Open approach is that it is not always possible to have a separate linkbase, such as in Digital Audio Broadcast or the digital audio on a music CD.

We have developed an tool for hiding text in a digital audio file in such a way that it is not noticeable when listening to the audio.

In the example shown, the text "L:\Sview\media\cant.wav" has been hidden the "L:\Media\major.wav" digital audio file. Not that this is a link, not an endpoint; the text can be sent to the Link Manager which treats it as a resolved link. This is because the Link Manager only supports temporal link resolving, which is initiated via Link Player.

There is a concept of positional data when encoding, the number of seconds into the file to where the text is placed. This is fixed in the demonstration but could be used in an editing environment. Markers could be placed every few seconds which could be used as endpoints instead of time based data. This ensures endpoints in the linkbase remain valid, even if the data is edited.

The problem comes when the audio is compressed. Modern audio compression technologies work by removing sounds we can not hear while we are trying to encode data in audio so that it can not be heard.

Future work


The tools were written by the following grad students in the Multimedia Research Group at the University of Southampton, UK:

Under the guidance of Dr David DeRoure.

See http://www.mmrg.ecs.soton.ac.uk for more information on the research group.

Department of Electronics and Computer Science
University of Southampton
Highfield, Southampton SO17 1BJ, United Kingdom
Tel. +44 (0)1703 592493     Fax +44 (0)1703 594489