The basic purpose of this application will be the batch analysis of the spoken information for the purpose of identifying profanity. The analysis of the language, is already in development with someone I am in partnership with. That is all they will be concerned with handling is the actual analysis. My end is going to be providing them with at least 10,000 files daily, scalable up to 100,000 files daily, and I understand this will have certain hardware demands.
This process will entail setting up a process by which primarily YouTube videos (majority flash) will be downloaded and the audio information will be extracted. I know this can be done already on an individual basis, but -
this system will
-be input a list of video ID's from a MySeq driven list
-will have to download the files from the original source / API
-will extract the audio to an appropriate WAV file for analysis (perhaps MP3, still developing the analysis mechanism - (below))
-will export said audio file to temporary queue for analysis process -( below)
(separate patent pending process shall be run on audio file - (currently in development))
-after process is complete, audio file and original video file will be deleted, with confirmation of delete data being applied back into the MySeq database.
NOTE - this shall be designed for analysis only, and once the process is complete, original video file and extracted audio file will be deleted. this is in no way going to be used to obtain or distribute any copyrighted material.
This data will be used in a PHP / MySeq site currently in development. The code will have to be minimal, and the timeline once terms are reached will be asap.
I am putting this as a small project just because I know the code exists to do this on an individual basis and in open source. If I am wrong in my calculation here, show me.