Apple Files for Text-to-Speech Synthesis System Patent, Updates Objective–C Trademark, more
On August 16, the US Patent & Trademark Office published a number of Apple’s patent applications respectfully titled Multi-unit approach to text-to-speech synthesis, Method and apparatus for measuring die-level integrated circuit power variations and Method and apparatus for managing Internet transactions. And in Europe today, Apple has filed an updated trademark application for Objective-C.
A Text-to-Speech Synthesis System
Apple’s patent FIG. 1 is a block diagram illustrating a system 100 for text-to-speech synthesis. The system includes one or more applications such as application 110, an operating system 120, a synthesis block 130, an audio storage 135, a digital to analog converter (D/A) 140, and one or more speakers 145. The system is merely exemplary. The proposed system can be distributed, in that the input, output and processing of the various streams and data can be performed in several or one location. The input and capture, processing and storage of samples can be separate from the processing of a textual entry. Further, the textual processing can be distributed, where for example the text that is identified or received can be at a device that is separate from the processing device that performs the text to speech processing. Further, the output device that provides the audio can be separate or integrated with the textual processing device. For example, a client server architecture can be provided where the client provides or identifies the textual input, and the server provides the textual processing, returning a processed signal to the client device. The client device can in turn take the processed signal and provide an audio output. Other configurations are possible.
Returning to the exemplary system, application can output a stream of text, having individual text strings, to synthesis block either directly or indirectly through operating system. Application can be, for example, a software program such as a word processing application, an Internet browser, a spreadsheet application, a video game, a messaging application (e.g., an e-mail application, an SMS application, an instant messenger, etc.), a multimedia application (e.g., MP3 software), a cellular telephone application, and the like. In one implementation, application displays text strings from various sources (e.g., received as user input, received from a remote user, received from a data file, etc.). A text string can be separated from a continuous text stream through various delimiting techniques described below. Text strings can be included in, for example, a document, a spread sheet, or a message (e.g., e-mail, SMS, instant message, etc.) as a paragraph, a sentence, a phrase, a word, a partial word (i.e., sub-word), phonetic segment and the like. Text strings can include, for example, ASCII or Unicode characters or other representations of words. In one implementation, application includes a portion of synthesis block (e.g., a daemon or capture routine) to identify and initially process text strings for output. In another implementation, application provides a designation for speech output of associated text strings (e.g., enable/disable button).
Operating system can output text strings to synthesis block. The text strings can be generated within operating system or be passed from application. Operating system can be, for example, a MAC OS X operating system by Apple Computer, Inc. of Cupertino, Calif., a Microsoft Windows operating system, a mobile operating system (e.g., Windows CE or Palm OS), control software, cellular telephone control software, and the like. Operating system may generate text strings related to user interactions (e.g., responsive to a user selecting an icon), states of user hardware (e.g., responsive to low battery power or a system shutting down), and the like. In some implementations, a portion or all of synthesis block is integrated within operating system. In other implementations, synthesis block interrogates operating system to identify and provide text strings to synthesis block.
Apple’s Patent Summary
This disclosure generally describes systems, methods, computer program products, and means for synthesizing text into speech. A proposed system can provide more natural sounding (i.e., human sounding) speech. The proposed system can form speech from phonetic segments or a combination of higher level sound representations that are enunciated in context with surrounding text. The proposed system can be distributed, in that the input, output and processing of the various streams or data can be performed in several or one location. The input and capture, processing and storage of samples can be separate from the processing of a textual entry. Further, the textual processing can be distributed, where for example the text that is identified or received can be at a device that is separate from the processing device that performs the text to speech processing. Further, the output device that provides the audio can be separate or integrated with the textual processing device. For example, a client server architecture can be provided where the client provides or identifies the textual input, and the server provides the textual processing, returning a processed signal to the client device. The client device can in turn take the processed signal and provide an audio output. Other configurations are possible.
The resulting speech takes into account prosody characteristics including the tune and rhythm of the speech. Moreover, the proposed system can be trained with a human voice so that the resulting speech is even more convincing.
In one aspect, a method is provided that includes matching first units of a received input string to audio segments from a plurality of audio segments including using properties of or between the first units, such as adjacency, to locate matching audio segments from a plurality of selections, parsing unmatched first units into second units, matching the second units to audio segments using properties of or between the second units to locate matching audio segments from a plurality of selections and synthesizing the input string, including combining the audio segments associated with the first and second units.
Aspects of the invention can include one or more of the following features. Properties can include those associated with unit and concatenation costs. Unit costs can include considerations of one or more of pitch, duration, accentuation, and spectral characteristics. Unit costs measure the similarity or difference from an ideal model. Predictive models can be used to create ideal pitch, duration etc. predictors that can be used to evaluate which unit from a group of similar units (i.e., similar text unit but different audio sample) should be selected. Concatenation costs can include those associated with articulation relationships such as adjacency between units in samples. Concatenation costs measure how well a unit fits with a neighbor unit. Matching the first and second units can include searching metadata associated with the plurality of audio segments and that describes properties of or between the plurality of audio segments. The method can further include parsing unmatched second units into third units having properties of or between the units, matching the third units to audio segments including, searching metadata associated with the plurality of audio segments and that describes the properties of the plurality of audio segments.
The method can further include providing an index to the plurality of audio segments and generating metadata associated with the plurality of audio segments. Generating the metadata can include receiving a voice sample, determining two or more portions of the voice sample having shared properties and generating a portion of the metadata associated with a first portion of the voice sample to associate a second portion of the voice sample, and a portion of the metadata associated with the second portion of the voice sample to associate the first portion of the voice sample.
The first units can each comprise one or more of one or more sentences, one or more phrases, one or more word pairs, or one or more words. The input string can be received from an application or an operating system. The method can further include transforming unmatched portions of the input string to uncorrelated phonemes or other sub-word units. The input string can comprise ASCII or Unicode characters. The method can further include outputting amplified speech comprising the combined audio segments.
Aspects of the invention can include one or more of the following features. Synthesizing can include synthesizing both matching audio segments for successfully matched portions of the input stream and uncorrelated phonemes or other sub-word units for unmatched portions of the input stream.
In another aspect, a computer program product including instructions tangibly stored on a computer-readable medium is provided. The product includes instructions for causing a computing device to match first units of an input string that have desired properties to audio segments from a plurality of audio segments, parse unmatched first units into second units having desired properties, match the second units to audio segments and synthesize the input string, including combining the audio segments associated with the first and second units.
In another aspect, a system is provided that includes an input capture routine to receive an input string that includes first units having properties, a unit matching engine, in communication with the input capture routine, to match the first units to audio segments from a plurality of audio segments, a parsing engine, in communication with the unit matching engine, to parse unmatched first units into second units having properties, the unit matching engine configured to match the second units to audio segments, a synthesis block, in communication with the unit matching engine, to synthesize the input string, including combining the audio segments associated with the first and second units and a storage unit to store audio segments and properties.
In another aspect a method is provided that includes providing a library of audio segments and associated metadata defining properties of or between a given segment and another segment, the library including one or more levels of units in accordance with a hierarchy, and matching, at a first level of the hierarchy, units of a received input string to audio segments, the received input string having one or more units at a first level having defined properties. The method includes parsing unmatched units to units at a second level in the hierarchy, matching one or more units at the second level of the hierarchy to audio segments having defined properties and synthesizing the input string including combining the audio segments associated with the first and second levels.
Apple lists the inventors of this patent as Neeracher; Matthias; (Santa Clara, CA) ; Naik; Devang K.; (San Jose, CA) ; Aitken; Kevin B.; (Cupertino, CA) ; Bellegarda; Jerome R.; (Los Gatos, CA) ; Silverman; Kim E.A.; (Mountain View, CA).
Today’s Other Apple Patent Filings
Other patents were published today including one titled Method and apparatus for measuring die-level integrated circuit power variations and another, being a continuation patent by the title Method and apparatus for managing Internet transactions.
Apple Updates Objective-C Trademark
Found today on the European Trademark Office’s website is Apple’s latest trademark update for Objective-C. Apple’s original filing was noted in a report filed by MacNN in February of this year with Apple covering International Class 009 for computer software. Today, Apple has added International Classes 038 relating to communications and telecommunication services and International Class 042 relating to computer hardware and software consulting services.
NOTICE: MacNN presents only a brief summary of patents with associated graphic(s) for journalistic news purposes as each such patent application and/or grant is revealed by the U.S. Patent & Trade Office. Readers are cautioned that the full text of any patent applications and/or grants should be read in its entirety for further details.
Written and researched by Neo

