I would like to have a simple GUI that runs on Ubuntu (optional: cross platform) and gives the following options:
1. Hook video input. This option allows to select either a section on the screen (for scrapping) or a video file from disk (optional: camera input).
2. Hook audio input. If the video input does not contain sound, this allows to read an audio file or scrape sound from the speakers (optional: microphone input).
3. Start button. When hit, the program starts to read the hooked inputs (it's OK if one is missing). Video is parsed as a stream of images. Each image and the audio (if present) corresponding to it's time frame is packaged into a "temporal state object" with it's time. The object has the input as modular sub-objects and should be extendable to support more in the future.
4. Each object is sent to 2 processing functions, one for audio and one for video (send the top object). The functions themselves are mocked, and should not be implemented. They should be left for me to implement.
5. Each function will return an object containing 1 String and 2 floating point numbers between 0 and 1 (the mocks can return randoms, or whatever you wish). The GUI should display in real time the list of strings with the two floats as bars near each (so 0 is empty bar, 1 is full bar, etc.).
This project has to integrate OpenCV and TensorFlow into it's build, or, if you convince me otherwise, alternative libraries.
All libraries used must be open source with permissive licenses (MIT, BSD, Apache etc. No GPL and such). This includes the GUI.
All code should be editable by me, so provided in a properly documented and organised manner.
Use either C++, Java or Python.
Good results can lead to future improvement requests to this project.