This thesis addressed the problem of reducing the time to market and costs of audio software development by defining a software architecture that enables the construction of complete and rich audio applications by visual means.
Existing platforms provide visual prototyping for the processing aspect using data flow visual languages. When constructing final user products either the processing prototype is embedded into a fully programmed application logic, which implies a lot of development, or the visual prototyping tool itself is used as final user interface, which clutters the user interface with useless, or even inconvenient, functionalities.
Visual user interface builder tools cover visual specification of the interface layout and a minimal reactivity of the interface of an applications. Even by using such tools, the developer must still provide low level programming to connect the user interface to the processing core.
The proposed architecture reuses the concepts from both, visual user interface builders and data flow visual languages for signal processing. Both concepts are broadly used but are hard to be combined to build full applications. The architecture enables the designer to visually bind both sides, enabling her to ignore most implementation details such as communication among real-time threads and platform dependant audio subsystem programming.
Thanks to the architecture a designer could build an application by performing the following steps:
To evaluate the architecture, an instance of such architecture was built using the CLAM framework and the Qt toolkit. The Qt toolkit was extended by adding new audio related widgets that can be reused for audio data visualization and audio systems control. Such widgets were made available as components for visual composition of interfaces using the Qt toolkit prototyping tool. A visual audio prototyping tool was constructed on the top of the CLAM framework so that it integrates well on the prototyping architecture, and a run-time engine was implemented to join both worlds and providing a minimal application logic.
The development process of existing open source audio projects have been compared with the development of similar functionality by using tools based on such architecture, ascertaining that the architecture contributes to shorten development times and increases the software quality by enabling more iterations over the design.
The benefits of using such architecture has been argued according different criteria that has been proved valid for historical software development tools, such as learning threshold, complexity ceiling, path of least resistance... concluding the following: The architecture lowers the learning threshold, by no requiring programming skills and hiding hard audio related implementation details to the designer. Still the architecture provides a relatively high complexity ceiling to the set of applications that can be built just visually. But it also offers mechanism to raise such ceiling by extending existing components or by programming a more complex application logic.
It has also been shown that a tool based on such architecture offers a path of least resistance which leads to good design decisions, for example, separating processing and interface in different threads, setting controlled real-time and thread-safe means for communicating them, modularizing processing and the visual elements, and reusing components among designs.
This thesis makes significant contributions to the state of the art in audio software engineering. Such contributions are summarized in this section.
A way of systematically analyse the requirements of an audio application. Existing literature on audio software engineering addresses concrete aspects of audio application development. This thesis provides, in chapter chap:AudioApplications, systematic and generic criteria to determine when such issues must be addressed. We have analyzed several facets of audio applications such as data sources and sinks, data-time dependencies, processing modes, and user interface interactions. Such elements conforms the application logic and each one is bound to a set of low level issues.
An architecture that transparently solves low-level issues related to real-time audio applications. The architecture presented in chapter chap:PrototypingArchitecture solves transparently most of the issues which are related to the application logic of real-time application. For instance, multi-threading, lock-free thread safe communication with the user interface, system context handling, buffered file access...
An architecture that enables visual building of real-time applications. The architecture presented in chapter chap:PrototypingArchitecture also enables the visual building by reusing existing technologies, such as data-flow languages and user interface visual builders, by providing means to dynamically join their outputs: relating them in definition time and bind them in run-time. Application logic can be specified at high level by relating entities of the interface with the ones at the processing core.
Means of reusing visually built components as components in other stereotypes of audio applications. Although visual building is limited to real-time audio applications, section sec:ReusingInNonRealTime provides some hints on how components visually built can be reused as is on other stereotypes of applications.
Component extensibility. This thesis provides means to extend the available components of a visual prototyping architecture in a very flexible way. The architecture allows (section sec:TypedConnections) to define protocols among interface and processing components without fixing such protocols on compile time but still allowing protocol checking. The architecture ceiling can be extended in many fronts.
An analysis on how the architecture performs when developing real use cases. The implementation of the architecture and its use to implement several use cases has provided some insights on how the tool performs and how it could be improved.
As result of the work in this thesis, a number of publications have been published in conferences and journals which are listed in appendix chap:Publications. Also the work presented in this thesis has been awarded at the ACM Multimedia Conference 2007 Open Source Competition.
The set of target applications for this first approach of visual prototyping, was intentionally limited to real-time processing applications. That is, applications which process streams of audio related data. They are often referred as audio analysis, synthesis and transformation tools, depending on how many audio input and outputs they take. For instance, a chord extraction application which takes an audio and shows the chords while playing, is an analysis tool. An application which takes an input human voice and changes the speaker gender, is an instance of transformation tool. Also, an application that takes a point on the vowel triangle and makes the corresponding vowel sound is an instance of synthesis tool.
They all share a common trait: data is synchronously flowing and processing elements are taking the decisions based on the past and present data. Some kind of processing is outside of this description. Some algorithms perform what we call 'off-line processing'. This kind of processing is commonly use in Music Information Retrieval when, taking all the data extracted from an audio, the algorithm do an overall reasoning.
Also application with an extended application logic fall out of the scope of the solution provided by the architecture. The architecture just gives the user interaction means to view processing data, to send controls to the processing core and to control the audio sources and sinks. This is insufficient for a large number of audio applications. For instance, audio authoring tools introduce the concept of timeline and are audio object centered instead processing centered. Audio authoring tools are thought to work with several audio objects and different processing pipelines can be applied. Portions of the architecture could be used but they are not enough to fully build the application.
We have seen that the described architecture has some clear limits on how much can be done with it, just by visual means. Visual means are not as rich as programming languages but they are easier to learn and use.
Non visual programming is an acceptable learning threshold if we consider that current audio tool development is mostly done by programming. Furthermore, the architecture also lowers the costs of the programming task in the cases when it is needed. So, the ceiling of visual means seems to be something that is not that critical because users are used to work in worse scenarios and the architecture supposes a better one even it could be enhanced. But, visual design introduces a new segment of users which are not skilled on programming. Programming might represent an abrupt slope increase on their learning curve. That's why the visual means ceiling becomes important here.
Moreover, programming is not just a learning threshold, it is also a workload threshold. Programming involves a work context change, and dealing with tedious and time consuming tasks such as deploying a build environment, facing compiling errors, or debugging runtime errors.
There are several aspects where the visual means have a clear ceiling:
As explained before, the set of target applications has been limited to real-time processing applications. But even when deploying a real-time processing application, designer might want to allow more interaction than just controlling the audio sinks and sources, visualizing data and sending controls. For instance, in a synthesiser application, we could want to choose the instrument we are using, choosing presets, configuring the application, showing the help... Also, the user interface design limits the visual design options to a single window. As soon as the target requires more application logic, the architecture can not solve that visually and programming is required.
Another ceiling for visual means, is the extensibility of the architecture by providing new interface and processing components. Such extensibility is offered to raise the ceiling of what the architecture is able to build. But again, although the architecture easies the process, programming is still required.
The later ceiling are the human limits of dealing complex wired networks. When a network gets large, visual design becomes harder. Connections are more difficult to trace and display limitations do not allow the designer to fit a large network in the canvas to analyze it. This can be solved by aggregating the processing elements into a single higher level one. Again, with the current proposal, this can be done just by programming.
As this thesis has been a first approach to address the visual building of audio applications, the scope was limited to simple use cases. Further research might address other stereotypes of audio applications which may have enough homogeneity to be constructed mostly visually. Chapter chap:AudioApplications gave insights on further generalizations and in section sec:ReusingInNonRealTime, we suggested new ways of reusing the components the architecture to other kind of processing modules. We have seen two of those stereotypes that could be addressed due to their homogeneity but that went out of the scope of this thesis: audio authoring tools and music information retrieval systems.
Still there is enough room to raise the visual prototyping ceiling. Further research could address new ways of extending the set of components available that minimize the learning threshold of programming. Several proposals might be explored. One could be integrating the component programming cycle into the audio processing prototyping tool. This way we are saving the designer a lot of work load including a work context change, the deployment of the development environment, and the cycle of compiling and installing before testing. The other proposal could be using scripting languages to develop new components. Scripting languages are known to be easy to learn, and because the lack of compiling cycle users feel more comfortable with them. Both integrated development environment into the processing prototyping tool, and the scripting approach could be combined.
There is also a lot of ways that the processing prototyping tool usability could be enhanced. For example, providing debugging facilities such as step to step execution. Using a console with completion could add speed to the definition of the network as now dragging interface requires mouse precision. Also interfaces to edit the structure of a new processing module with actions such as 'adding ports', adding controls', 'add configuration parameter', 'edit the code', and 'recompile' could help to close the development cycle.
Another way to enable the user to provide new components is by composition: aggregating several processing elements into a bigger one to be reused in a different network. Besides being a way of creating new processing, aggregation is a mean to hide processing design complexity addressing the forehand mentioned problem of the wiring complexity. Further research could address this and other ways of reducing it. For instance, a common issue in audio is that several channels are pipelined on equivalent processing chains by multiplying the number of elements and wires. A solution for this problem could reduce the design complexity.
Ideas presented in this research could be also extended to other multimedia areas than just audio. Video and image processing environments could benefit from prototyping tools.
The ability to handle multiple types of data tokens, is perfect to allow high level processing of video. Currently, platforms that provides such data flexibility exist. GStreamer, for instance, provides full routing of complex objects through a processing network and, being a free software platform, it has a wide adoption in the industry. But such platforms are available just at the programming level. Full visual prototyping and visual binding to a user interface to control and monitor the stream would be valuable.
Current interface prototyping tools address common arena of desktop computers where the interface paradigm has been stable for so many years. Nowadays, we have a large diversity of computerized devices to interact with. Ubiquitous computing brings new opportunities to architectures similar to the one exposed by embedding rich audio applications in such devices: For instance, adding voice transformation on cellular phones. But those devices imposes new requirements to the interface building and old tools for interface prototyping are becoming obsoleted.
For example, on new devices we can not assume a keyboard and a pointer as input devices, cellular phones just have a numeric keypad and they have input, some devices have no keyboard, and just have a pointer, other provide their custom buttons... Display capabilities may changes and also, we should cope with processing and audio input/output limitations.
Such diversity of devices is a new challenge for interface prototyping, but also for audio processing prototyping. The architecture already provides audio back-end abstraction but there is no way of abstracting the interface for the device input/output capabilities.
Being such devices a commodity, one interesting research area could be to define a device prototyping tool where all, software and hardware interface and the audio processing core can be designed.
This research have demonstrated how the development process of audio applications can be fastened by providing visual means to construct them. We have focused on a reduced but very common subset of audio applications offering an architecture that joins two existing technologies: data-flow visual languages and graphical interface builders. Exercising such architecture addressing different problems has also provided useful insight for future research on extending visual prototyping beyond its current limits.