Version of documentation: 2021-02-16
Current version of the Valossa Core metadata JSON format: 1.4.2 (changelog).
Current version of the "frames_faces" metadata (face bounding boxes) JSON format: 1.0.3.
Current version of the "seconds_objects" metadata (object bounding boxes, seconds-based) JSON format: 1.0.2.
Current version of the "frames_objects" metadata (object bounding boxes, frames-based) JSON format: 1.0.0.
The Valossa Video Recognition API (which is the endpoint for the Video Analysis function and was formerly known as the Core API) is a REST API for automatic video content analysis and metadata creation.
This Video Analysis API provides access to a broad set of high-quality audiovisual AI Features for recognition and analysis, and returns analysis results with details and associated video time segments in the Valossa Metadata format.
Valossa AI is constructed of a set of AI Features (formerly know as capabilities), which can be unlocked in various Subscription Options. Subscriptions enable a set of AI feature configurations for a specific function, such as providing Standard Metadata for describing video details in full, or executing Face Expression and Voice Sentiment analysis. (Read further for more details on Subscriptions.)
Each AI Feature is specialized in detecting and recognizing a set of concepts that define the semantic output to describe a content entity. Detections are grouped and concatenated in JSON-format as Valossa Metadata, the main downloadable output format being called Valossa Core metadata. There are also additional output formats for special, separately downloadable results such as bounding boxes for faces or objects, and also SRT files containing the speech-to-text results from the audio track of the videos.
Video AI features (available in different Video Analysis Subscription configurations)
Valossa Subscription Options (contact sales to order and activate):
Additional capabilities available in customized setups (contact sales to request more information). For example:
Detections are also provided in different practical groupings in the metadata: grouped by detection type (best concepts first, and including time-coded occurrences of the concept when applicable) and grouped by second ("What is detected at 00:45 in the video?"). See explanation of the practical detection groupings
The REST API is invoked using HTTP (HTTPS) requests. You can also assign new video analysis jobs to the API on the easy API call page. The responses from the REST API are in the JSON format. The Valossa Report tool helps you to visualize the results. Please use the secure HTTPS transport in all API calls to protect your valuable data and confidential API key: unencrypted HTTP transport is not allowed by the API.
Get your API key from under "My account" - "API keys" in Valossa Portal. If you have several applications, you may request a different API key for each of them; for this, contact Valossa service personnel.
Note: As the administrator user of your organization, you can create new users under "My account" - "Manage users". If your organization has several people who need to use the Portal, you should add them manually in "Manage users", so they are all mapped to your organization and may view analysis results (if you give the rights to the users in Portal) and post new jobs (if you give the rights). The permissions are mappings between users and API keys ("this user has read-write access to this API key so she can both view results and make new job requests"), so please configure the permissions understanding this; the API key permissions per user can be edited in "Manage users". For your company/organization, you must have only one customer account (created by the Valossa sales staff), but there can be multiple users under the customer account!
The API consists of 6 different functions:You can conveniently monitor the status of your jobs in Valossa Portal. There you can also call the new_job function of the API with an easy API request generator.
Your API key is shown in Valossa Portal on the subscriptions page. Keep the key confidential.
Please note regarding speech analysis:
Supported video formats: we support most typical video formats, including but not limited to MP4, MPEG, AVI, FLV, WebM, with various codecs. Currently, we cannot provide a fixed list of supported formats and codecs, but for example MP4 with the H.264 codec works.
Video file size limit: 5GB per input video file.
Video duration limit: 3 hours of playback time per input video file.
Video vertical resolution limit: 1080 pixels.
Currently, the supported languages for speech-based detections are English, French, German, Spanish, Finnish, Italian, Brazilian Portuguese, European Portuguese, Swedish and Dutch. By default, speech is analyzed as English language. See more information about language selection.
If the video file contains several video streams, only the first one is analyzed.
If the video file contains several audio streams, only the first one is analyzed. (Please note that audio keyword detection and audio speech-to-text will be performed only if you did not provide your own SRT-based speech transcript; however, providing or omitting the SRT transcript does not affect the audio.context detections.) The audio stream can be either mono or stereo.
Supported transcript format: SRT.
File size limit: 5MB per input SRT file.
Currently, the only supported transcript language is English.
You must pay for the video analysis. How to pay? How to gain access to Valossa AI? Start a subscription by contacting sales and pay the invoices according to the agreement.
Which of the many video recognition capabilities of Valossa will be used for the specific job, depends on the configuration of the API key used in the new_job request. The subscription you agreed with Valossa determines the video recognition capabilities of your API key(s). An example of subscription types is the Standard Metadata subscription, containing a wide range of audiovisual detection types. View your API keys on the subscriptions page in Valossa Portal. Keep the API keys confidential, otherwise someone can impersonate you or your application with a leaked key.
How to create a new video analysis job using the REST API? Send an HTTP POST to the URL:
https://api.valossa.com/core/1.0/new_job
Example new_job request body in JSON format:
{ "api_key" : "kkkkkkkkkkkkkkkkkkkk", "media": { "title": "The Action Movie", "description": "Armed with a knife, Jack Jackson faces countless dangers in the jungle.", "video": { "url": "https://example.com/content/Action_Movie_1989.mp4" }, "transcript": { "url": "https://example.com/content/actionmovie.srt" }, "customer_media_info": { "id": "469011911002" }, "language": "en-US" } }
There are two different ingestion methods for your video file: download and upload. Downloading involves a URL that points to your downloadable video file, and you use the download URL directly in your new_job request. Uploading involves first the use of the upload functionality of the Valossa Video Recognition API. After the upload has been completed, you use the resulting valossaupload:// URL in your new_job request to refer to the specific uploaded file. See instructions for video file uploading
In the download video ingestion method, the video URL and transcript URL can be either http:// or https:// or s3:// based. If the URL is s3:// based, you should first communicate with us to ensure that our system has read access to your S3 bucket in AWS (Amazon Web Services).
Whether you used the download or upload video ingestion method, the video URL is mandatory in the new_job request. The URL must directly point to a downloadable video file (in which case, our system will download the file from your system) or it must be a valossaupload:// URL of your uploaded file.
The transcript URL is optional – but recommended, because an existing SRT transcript is a more reliable source of speech information than audio analysis. The URL must directly point to a downloadable SRT transcript file. Our system will download the file from your system.
The title is optional – but recommended: a human-readable title makes it easy for you to identify the video on the results page of Valossa Portal, and will also be included in the metadata file.
The description is optional. Description is any freetext, in English, that describes the video.
If title and/or description are provided in the call, the text in them will also be analyzed, and the detected concepts will be included in the analysis results (the "external" concepts in the metadata JSON).
The customer media info is optional. If you provide a customer media ID in the "id" field inside the "customer_media_info" field, you may use the customer media ID (a string from your own content management system) to refer to the specific job in the subsequent API calls, replacing the "job_id" parameter with a "customer_media_id" parameter in your calls. Note: Our system will NOT ensure that the customer media ID is unique across all jobs. Duplicate IDs will be accepted in new_job calls. It is the responsibility of your system to use unique customer media IDs, if your application logic requires customer media IDs to be unique. If you use duplicate customer media IDs, then the latest inserted job with the specific customer media ID will be picked when you use the "customer_media_id" parameter in the subsequent API calls.
The language is optional. It specifies the language model to be used for analyzing the speech in the audio track of your video. The allowed values are "de-DE" (German), "en-US" (US English), "es-ES" (Spanish), "fi-FI" (Finnish), "fr-FR" (French), "it-IT" (Italian), "pt-BR" (Brazilian Portuguese), "pt-PT" (European Portuguese), "sv-SE" (Swedish) and "nl-NL" (Dutch). More languages will be supported in the future. If the language parameter is not given, the default "en-US" will be used so the speech in the video is assumed to be in US English.
Please note that for other languages than US English, the following exceptions apply.
The language-specific details are subject to change in the future.
If the analysis is technically successful (i.e. if the job reaches the "finished" state), the job will be recorded as part of the used volume on your ongoing subscription period. No subscription yet? Contact sales.
Here is an example new_job request body with only the mandatory fields present:
{ "api_key": "kkkkkkkkkkkkkkkkkkkk", "media": { "video": { "url": "https://example.com/my-car-vid.mpg" } } }
Here is an example new_job request body with the specification to use a non-default, self-created face gallery, so the faces in that gallery will be used for identifying the persons in the video:
{ "api_key": "kkkkkkkkkkkkkkkkkkkk", "media": { "video": { "url": "https://example.com/my-car-vid.mpg" } }, "analysis_parameters": { "face_galleries": { "custom_gallery": { "id": "468a2b70-3b55-46f8-b209-8ad2fcabd5c8" } } } }
If you have a default face gallery and want to use it in your video analysis job, just leave out the gallery selection in the new_job call. The default face gallery will be implicitly selected, when no other gallery is explicitly selected.
The response of a successful new_job call always includes the job_id of the created job.
Example response in an HTTP 200 OK message:
{ "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958" }
Jobs are identified by UUIDs, which appear in "job_id" fields in various messages. Your script that calls the API must, of course, save the job_id from the new_job response in order to be able to query for the status and results later.
Example test call with Curl on the command line, if your test request JSON is in a file created by you:
curl --header "Content-Type:application/json" -X POST -d @your_request.json https://api.valossa.com/core/1.0/new_job
If you want a HTTP POST callback and/or email notification when your video analysis job reaches an end state, you may specify one or both of those in the new_job request. The HTTP POST callback mechanism in our system expects your system to send a 200 OK response for the request (callback) initiated by our system. The request will be re-tried one time by our system, if the first attempt to access your specified callback URL returns a non-200 code from your system or times out. Due to the possibility of network problems and other reasons, you should not rely on the HTTP POST callback to be received by your system. In any case, whether the HTTP POST callback event was received or not, your system can always check the status of the job using the job_status function in the REST API. The email notification will be sent to those users that have the permission to view job results for the chosen API key.
Example of a job request with a HTTP POST callback:
{ "api_key": "kkkkkkkkkkkkkkkkkkkk", "callback": { "url": "https://example.com/your_callback_endpoint" }, "media": { "title": "Lizards dancing", "video": { "url": "https://example.com/lizards_dancing.mkv" } } }
The HTTP POST callback message is formatted as JSON, and contains the job ID in the "job_id" field and the reached end status of the job in the "status" field. It also contains the customer media ID in the "customer_media_id" field, if you had given a customer media ID for the job. Here is an example of a HTTP POST callback message body:
{ "job_id": "ad48de9c-982e-411d-93a5-d665d30c2e92", "status": "finished" }
Example of a job request with an email notification specified:
{ "api_key": "kkkkkkkkkkkkkkkkkkkk", "email_notification": { "to_group": "users_with_access_to_api_key_results" }, "media": { "title": "Lizards dancing", "video": { "url": "https://example.com/lizards_dancing.mkv" } } }
The generated email notification message is intended for a human recipient. So, unlike the HTTP POST callback, the email notification message is not intended for machine parsing.
There are 3 different API requests that make it easy to upload your file. Because videos often are huge, the files must be uploaded in chunks.
First, you initialize the upload. Send an HTTP POST to the URL:
https://api.valossa.com/core/1.0/initialize_file_upload
You must specify the size of the file to be uploaded (in bytes), and the response will tell you the chunk size (in bytes) that you must use when sending your file chunks. Save the upload ID (from the "upload_id" field of the response) in order to be able to refer to the same upload action in the subsequent requests.
Curl example:
curl -F "api_key=kkkkkkkkkkkkkkkkkkkk" -F "file_size_bytes=12740839" https://api.valossa.com/core/1.0/initialize_file_upload
Example response in an HTTP 200 OK message:
{ "upload_id": "7c62bc7b-e143-4a81-aa83-a7eb0ec37077", "file_chunk_size_bytes": 4194304 }
The Content-Type header of the initialize_file_upload request must be "multipart/form-data". If you use Curl and its "-F" option, Curl will set this Content-Type as default and will also use POST as the HTTP method in the request.
Next, upload each chunk of your video file. The size of each chunk must be exactly the number of bytes indicated in the "file_chunk_size_bytes" field in the response you received for your initialize_file_upload request. However, the last chunk of the file may have a different size; this is natural, because the total size of the file usually is not an exact multiple of the chunk size.
How to split your video file into chunks? In Linux-style operating systems, the "split" command is suitable for the task. We recommend using the "-d" option of the "split" command, because then the chunks will be named so that the chunk index in the chunk file names is a number sequence rather than a letter sequence that is the default. The number-based indexing is probably easier to use in your own helper scripts.
Split example:
split -d --bytes=4194304 supervideo.mp4 supervideo.mp4.chunk
The above command creates the files supervideo.mp4.chunk00, supervideo.mp4.chunk01, supervideo.mp4.chunk02 with a size of 4194304 bytes and the file supervideo.mp4.chunk03 with a size of 157927 bytes.
When uploading file chunks, you must specify the upload ID and chunk index. The chunk index is an integer, and the first chunk of the file has the index 0. The temporal order of the send_file_chunk requests with the different chunk indexes does not need to be 0, 1, 2... but you must specify the correct index for each chunk, so the uploaded file will be correct when the chunks are reassembled into the full file in our system. Of course, probably it is easiest for you to use a helper script that just sends the chunks in the order 0, 1, 2..., increasing the value of chunk_index in a loop.
Send an HTTP POST to the URL:
https://api.valossa.com/core/1.0/send_file_chunk
Curl example for uploading the first chunk:
curl -F "api_key=kkkkkkkkkkkkkkkkkkkk" -F "upload_id=7c62bc7b-e143-4a81-aa83-a7eb0ec37077" -F "chunk_index=0" -F "file_data=@supervideo.mp4.chunk00" https://api.valossa.com/core/1.0/send_file_chunk
The Content-Type header of the send_file_chunk request must be "multipart/form-data".
Example response in an HTTP 200 OK message:
{}
When all chunks have been uploaded, you need to finalize the upload, referring to the correct upload ID. Send an HTTP POST to the URL:
https://api.valossa.com/core/1.0/finalize_file_upload
Curl example:
curl -F "api_key=kkkkkkkkkkkkkkkkkkkk" -F "upload_id=7c62bc7b-e143-4a81-aa83-a7eb0ec37077" https://api.valossa.com/core/1.0/finalize_file_upload
The Content-Type header of the finalize_file_upload request must be "multipart/form-data".
Example response in an HTTP 200 OK message:
{ "uploaded_file_url": "valossaupload://7c62bc7b-e143-4a81-aa83-a7eb0ec37077" }
Save the valossaupload:// URL from the "uploaded_file_url" field of the response, so you can use it in your new_job request for that video.
For an entirely manual upload of a file in a graphical user environment, use the Analyze page in Valossa Portal.
The following pertains to the HTTP error responses, which are returned immediately for your API call if your request was malformed or missing mandatory fields. In other words, the following does not pertain to the separate HTTP callback messages, which were discussed above. (Callback events are not even generated for the errors that are returned immediately in the HTTP error response of an API call.)
Error responses from the API calls (new_job calls or any other calls) contain an error message, and can be automatically separated from 200 OK responses, because error responses are sent along with an HTTP error code (non-200). Error responses are also formatted as JSON, and they contain an "errors" array, where one or more errors are listed with the corresponding error messages.
Example error response in an HTTP 400 message:
{ "errors": [ { "message": "Invalid API key" } ] }
The status of a single analysis job is polled using HTTP GET.
Example request:
https://api.valossa.com/core/1.0/job_status?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=6faefb7f-e468-43f6-988c-ddcfb315d958
Example response in an HTTP 200 OK message:
{ "status": "processing", "media_transfer_status": "finished", "details": null, "poll_again_after_seconds": 600 }
Possible values for the "status" field: "queued", "on_hold", "preparing_analysis", "processing", "finished", and "error". More status values may be introduced in the future.
If the job status is "error", something went wrong during the analysis process. If there is an explanation of the error in the "details" field, please see if the cause of the error is something you can fix for yourself (such as a non-video file in the video URL of the job request). Otherwise, contact us in order to resolve the issue.
If the job status is "queued" or "processing", you should poll the status again after some time.
If the job status is "finished", you can fetch the job results using the job_results function.
The "details" field may contain some additional details about the status of the job.
The "media_transfer_status" field indicates whether the media to be analyzed has been transferred from your system to our system. Possible values for the "media_transfer_status" field: "queued", "downloading", "finished" and "error". If "media_transfer_status" is "finished", your video (and the transcript if you provided it) have been successfully transferred to our system.
The value in "poll_again_after_seconds" is just a suggestion about when you should poll the job status again (expressed as seconds to wait after the current job_status request).
If there was a problem with the job_status query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.
After a job has been finished, the resulting video metadata can be fetched using HTTP GET.
Example request:
https://api.valossa.com/core/1.0/job_results?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=6faefb7f-e468-43f6-988c-ddcfb315d958
Response data is in the JSON format. For more details, see chapter "Output metadata JSON format".
Save the metadata and use it from your own storage disk or database for your easy and quick access. We will not necessarily store the results perpetually.
If there was a problem with the job_results query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.
Want to search for specific recognized things among your accumulated video analysis job results, by typing search words such as "airplane" or "Emilia Clarke"? Please try the easy-to-use Search functionality in Valossa Portal.
Convenience function for listing all your jobs, optionally with also their job statuses (optional parameter "show_status" with the value "true"), using HTTP GET:
Example request:
https://api.valossa.com/core/1.0/list_jobs?api_key=kkkkkkkkkkkkkkkkkkkk&show_status=true
Example response in an HTTP 200 OK message:
{"jobs": [ { "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958", "job_status": { "status": "finished", "media_transfer_status": "finished", "details": null, "poll_again_after_seconds": null } },{ "job_id": "36119563-4b3f-44c9-83c6-b30bf69c1d2e", "customer_media_id": "M4070117", "job_status": { "status": "processing", "media_transfer_status": "finished", "details": null, "poll_again_after_seconds": 600 } } ]}
If you had given a customer media ID when creating the job, the "customer_media_id" field exists and contains the customer media ID value.
Showing video titles and other media information in the job listing is often useful. This can be done by using the optional GET parameter "show_media_info" with the value "true". Example request:
https://api.valossa.com/core/1.0/list_jobs?api_key=kkkkkkkkkkkkkkkkkkkk&show_status=true&show_media_info=true
Example response in an HTTP 200 OK message:
{"jobs": [ { "job_id": "36119563-4b3f-44c9-83c6-b30bf69c1d2e", "customer_media_id": "M4070117", "job_status": { "status": "finished", "media_transfer_status": "finished", "details": null, "poll_again_after_seconds": null, "media_info": { "title": "Birds clip #22", "description": "Birds having a bath", "video": { "url": "https://example.com/contentrepository/project1/aabhk-gg4rt-gi5aq-jjv6t/birds_22_original.mp4" } } } },{ "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958", "job_status": { "status": "finished", "media_transfer_status": "finished", "details": null, "poll_again_after_seconds": null, "media_info": { "video": { "url": "https://example.com/my-car-vid.mpg" } } } } ]}
By adding the optional GET parameter "n_jobs" to the request (example: n_jobs=500), you can control how many of your jobs will be listed if your job list is long. The default is 200. The maximum possible value for "n_jobs" is 25000.
If there was a problem with the list_jobs query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.
Cancel a job by sending a HTTP POST to the URL:
https://api.valossa.com/core/1.0/cancel_job
Example cancel_job request body:
{ "api_key": "kkkkkkkkkkkkkkkkkkkk", "job_id": "be305b1e-3671-45b1-af88-1f052db3d1bb" }
Example response in an HTTP 200 OK message:
{ "job_status": "canceled" }
The job must be in a cancellable state for this function to succeed. For example, a finished job is not cancellable.
If there was a problem with the cancel_job query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.
Delete a job by sending a HTTP POST to the URL:
https://api.valossa.com/core/1.0/delete_job
Example delete_job request body:
{ "api_key": "kkkkkkkkkkkkkkkkkkkk", "job_id": "f3cd3108-444a-4c06-84a6-730ac231e431" }
Example response in an HTTP 200 OK message:
{}
Deleting a job will remove it from the set of your existing jobs. When a job is deleted, the assets (video file etc.) associated with the job are also deleted. There may be some delay between completing the delete_job query and the purging of all assets from the storage systems in the background.
If there was a problem with the delete_job query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.
The Valossa Training API is part of those Valossa AI subscriptions that contain faces-related functionality. In addition to REST API access, the functionalities of the Valossa Training API can be accessed using a graphical user interface in Valossa Portal.
Using the Valossa Training API, you can train the system to detect custom faces. The custom faces will be detected in those videos that you analyze after the training. By training your custom faces, you acknowledge and accept the fact that using custom-trained faces may cause some additional delays in the processing of your video analysis jobs.
The detected face identities will appear in the "similar_to" fields of the "human.face" detections in Valossa Core metadata. Your API key(s) that work for creating new analysis jobs with the Valossa Video Recognition API will also work for face training with the Valossa Training API, if faces-related functionality is included in your active Valossa AI subscription.
How to create your custom face gallery? There are two ways:
From these Curl-based request-and-response examples it is easy to modify REST API calls for use in your application. Just like with the Video Recognition API, the HTTP response code 200 indicates a successful operation, and a non-200 code indicates error (an error message is provided in that case). The response body is in the JSON format.
As you can see from the examples, any "read data" requests use the HTTP GET method, while any "write data" or "erase data" requests use the HTTP POST method.
The trained faces can be used in your subsequent video analysis jobs to create the "similar_to" items in the "human.face" detections with the correct personal identities. Of course, the faces to use are picked from the gallery specified in the new_job request. If no gallery is specified, then your default face gallery is used for the video analysis job. A non-default gallery must be explicitly specified in the new_job request in order to be used in the video analysis job.
Adding sample images for a face has been designed to work with both, file uploads and file downloads. Thus, the file reference mechanism used in the add_face_image request of the Valossa Training API uses an easy URL-based syntax for both file input styles. Currently, only uploads are supported, but download support will be added in the future. Download, obviously, means that our system downloads the image file from an HTTP(S) URL provided by your system. Uploaded files each get assigned a valossaupload:// URL that uniquely identifies the successfully received file that resides in our storage system.
When you upload a file, first use the REST function upload_image to move the file content. Then, you will use the valossaupload:// URL of the correct file when referring to the file in the add_face_image request in the actual face identity training.
Before using the service in a way that makes you a "processor" and/or "controller" of the personal data of EU-resident natural persons, you are required to make sure that your actions are compliant with the General Data Protection Regulation. See the Terms and Conditions of Valossa services
An image must be in the JPG or PNG format. The maximum image file size is 8MB. The maximum width of an image is 4096 pixels. The maximum height of an image is 4096 pixels.
At least 10 different sample images of each face, photographed from different angles etc., should be given in order to get good training results. The more images the better. Training may in some cases work even with only a few images, but the results are better with more samples: a lot of clear, diverse, high-quality images of the face to be trained.
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/upload_image
Curl example:
curl -F "image_data=@ricky_1.jpg" -F "api_key=kkkkkkkkkkkkkkkkkkkk" https://api.valossa.com/training/1.0/upload_image
The upload_image call is similar regardless of the gallery type (default vs. non-default), because this request does not operate on a gallery; it just uploads an image for further use that probably involves a gallery.
Example response in an HTTP 200 OK message:
{ "uploaded_file_url": "valossaupload://ff357efe-1086-427d-b90c-1d1887fb1017" }
The Content-Type header of the file upload request must be "multipart/form-data". If you use Curl and its "-F" option, Curl will set this Content-Type as default and will also use POST as the HTTP method in the request. There must be one file per upload_image request.
Note! As you can see from the Curl request example above, the API key must be sent as a form parameter (not URL parameter). This is quite natural, taking into account that the Content-Type of the request is "multipart/form-data".
All the POST-based REST functions listed below accept a JSON-formatted input string, which contains the parameters of the specific function. The GET-based REST functions read their parameters from the request URL.
You may decide to use your default face gallery that is created implicitly when you start training faces without specifying a face gallery. In that case, please skip face gallery creation. Otherwise, please explicitly create your non-default face gallery based on the following instructions.
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/create_face_gallery
Curl example:curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "gallery":{"name":"My Special Faces"}}' https://api.valossa.com/training/1.0/create_face_gallery
Example response in an HTTP 200 OK message:
{ "gallery_id": "468a2b70-3b55-46f8-b209-8ad2fcabd5c8" }
Save the gallery ID locally. You will need it when you add face identities to the gallery or when you do any other operations with the specific gallery.
The maximum length of the "name" parameter of a face gallery is 1024 characters.
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/update_face_gallery
The "updates" structure contains the face gallery parameters to update. Currently the only allowed parameter is "name". The data type for this value is string. The maximum length of the value of "name" is 1024 characters.
Curl example:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "gallery":{"id":"468a2b70-3b55-46f8-b209-8ad2fcabd5c8", "updates":{"name":"My Extra-Special Faces"}}}' https://api.valossa.com/training/1.0/update_face_gallery
Example response in an HTTP 200 OK message:
{}
Only a non-default face gallery can be updated. The default face gallery cannot be modified by this request. If you need named galleries, you should be using non-default, explicitly created face galleries.
Send an HTTP GET to the URL:
https://api.valossa.com/training/1.0/list_face_galleries
Curl example:
curl 'https://api.valossa.com/training/1.0/list_face_galleries?api_key=kkkkkkkkkkkkkkkkkkkk'
Example response in an HTTP 200 OK message:
{ "face_galleries": [ { "id": "cc003b8e-6c67-491b-95c2-9155dc894549" }, { "id": "468a2b70-3b55-46f8-b209-8ad2fcabd5c8" } ] }
It is also possible to list existing face galleries with details.
Curl example:
curl 'https://api.valossa.com/training/1.0/list_face_galleries?api_key=kkkkkkkkkkkkkkkkkkkk&show_details=true'
Example response in an HTTP 200 OK message:
{ "face_galleries": [ { "id": "cc003b8e-6c67-491b-95c2-9155dc894549", "created_at": "2018-01-27 10:03:29", "is_default": true }, { "id": "468a2b70-3b55-46f8-b209-8ad2fcabd5c8", "name": "My Extra-Special Faces", "created_at": "2018-10-18 12:22:47", "is_default": false } ] }
The string fields "name" and "gender" are optional. We recommend setting at least the name, because a nameless face identity might cause confusion for you later on (however, it is perfectly acceptable to have a nameless face identity, if your application logic requires creating such an identity). The maximum length of the value of "name" is 1024 characters. The gender is "male" or "female". The response contains the unique identifier of the face identity (person).
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/create_face_identity
Curl example if you are using your default gallery:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"name":"Ricky Rickson", "gender":"male"}}' https://api.valossa.com/training/1.0/create_face_identity
Curl example if you are using a non-default gallery:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "gallery":{"id":"468a2b70-3b55-46f8-b209-8ad2fcabd5c8"}, "face":{"name":"Ricky Rickson", "gender":"male"}}' https://api.valossa.com/training/1.0/create_face_identity
Example response in an HTTP 200 OK message:
{ "face_id": "bb254a82-08d6-4498-9ddb-3de4c88f1f66" }
Save the face ID locally. You will need it when you add images for the face or when you do any other operations with the specific face identity.
Referring to your previously uploaded files, you need to add the correct files to a specific existing face identity, one image file per add_face_image request. The response contains a unique identifier of the processed, accepted training image, from which a sample face has been detected. You need the ID later, if you want to do any operations with this training image that has been added to a specific face identity.
There must be exactly one face visible per image. This REST function may take a few seconds to complete, because the system checks that exactly one face is clearly visible (otherwise, an error response is generated).
In the future, also image download URLs will be able to be used with the same easy add_face_image call syntax. Currently, only the valossaupload:// URLs created as a result of file uploads are supported.
Please make sure that each of the images is actually an image of the correct person. Typically, checking this involves some human work. Wrong images will deteriorate the quality of face detections.
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/add_face_image
In this request, there is no need to specify the gallery, even when using a non-default gallery. The face ID is a unique identifier for the correct face.
Curl example:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"id":"bb254a82-08d6-4498-9ddb-3de4c88f1f66"}, "image":{"url":"valossaupload://ff357efe-1086-427d-b90c-1d1887fb1017"}}' https://api.valossa.com/training/1.0/add_face_image
Example response in an HTTP 200 OK message:
{ "image_id": "8ac7ab90-44d1-4860-9a2f-2afbb175638a" }
Send an HTTP GET to the URL:
https://api.valossa.com/training/1.0/list_face_identities
Curl example if you are using your default gallery:
curl 'https://api.valossa.com/training/1.0/list_face_identities?api_key=kkkkkkkkkkkkkkkkkkkk'
Curl example if you are using a non-default gallery:
curl 'https://api.valossa.com/training/1.0/list_face_identities?api_key=kkkkkkkkkkkkkkkkkkkk&gallery_id=468a2b70-3b55-46f8-b209-8ad2fcabd5c8'
Example response in an HTTP 200 OK message:
{ "face_identities": [ { "id": "a99a59e3-ba33-4b00-8114-8bdd92a71dfa" }, { "id": "bb254a82-08d6-4498-9ddb-3de4c88f1f66" } ] }
It is also possible to list existing face identities with details.
Curl example if you are using your default gallery:
curl 'https://api.valossa.com/training/1.0/list_face_identities?api_key=kkkkkkkkkkkkkkkkkkkk&show_details=true'
Curl example if you are using a non-default gallery:
curl 'https://api.valossa.com/training/1.0/list_face_identities?api_key=kkkkkkkkkkkkkkkkkkkk&gallery_id=468a2b70-3b55-46f8-b209-8ad2fcabd5c8&show_details=true'
Example response in an HTTP 200 OK message:
{ "face_identities": [ { "id": "a99a59e3-ba33-4b00-8114-8bdd92a71dfa", "name": "Lizzy Blythriver", "gender": "female" }, { "id": "bb254a82-08d6-4498-9ddb-3de4c88f1f66", "name": "Ricky Rickson", "gender": "male" } ] }
Send an HTTP GET to the URL:
https://api.valossa.com/training/1.0/list_face_images
In this request, there is no need to specify the gallery, even when using a non-default gallery. The face ID is a unique identifier for the correct face.
Curl example:
curl 'https://api.valossa.com/training/1.0/list_face_images?api_key=kkkkkkkkkkkkkkkkkkkk&face_id=bb254a82-08d6-4498-9ddb-3de4c88f1f66'
Example response in an HTTP 200 OK message:
{ "face_images": [ { "id": "8ac7ab90-44d1-4860-9a2f-2afbb175638a" }, { "id": "b5559837-62a5-4f10-b250-a554ab2ce54c" } ] }
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/update_face_identity
In this request, there is no need to specify the gallery, even when using a non-default gallery. The face ID is a unique identifier for the correct face.
The "updates" structure contains one or more face parameters to update. The allowed parameters are "name" and "gender". The data type for these values is string. The maximum length of the value of "name" is 1024 characters. The value for "gender" is "male" or "female".
Note: To unset a field such as "name" or "gender" completely, just set it to null in an update_face_identity call. In an update, a value that is not mentioned in the "updates" structure will retain its old value if it had one (in other words, omitting the field from the update does not unset the value of the field, while setting it explicitly to null will unset it).
Curl example:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"id":"bb254a82-08d6-4498-9ddb-3de4c88f1f66", "updates":{"name":"Ricky Rixon-Borgmann"}}}' https://api.valossa.com/training/1.0/update_face_identity
Example response in an HTTP 200 OK message:
{}
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/remove_face_image
In this request, there is no need to specify the gallery, even when using a non-default gallery. The image ID is a unique identifier for the correct image.
Curl example:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "image":{"id":"b5559837-62a5-4f10-b250-a554ab2ce54c"}}' https://api.valossa.com/training/1.0/remove_face_image
Example response in an HTTP 200 OK message:
{}
Send an HTTP POST to the URL:
https://api.valossa.com/training/1.0/remove_face_identity
In this request, there is no need to specify the gallery, even when using a non-default gallery. The face ID is a unique identifier for the correct face.
Curl example:
curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"id":"bb254a82-08d6-4498-9ddb-3de4c88f1f66"}}' https://api.valossa.com/training/1.0/remove_face_identity
Example response in an HTTP 200 OK message:
{}
Valossa Portal provides an easy-to-use visualization tool, called the Valossa Report, for you to get a quick visual overview of the most prominent detections, and also a more detailed heatmap for browsing the results. Remember to contact sales if you do not have access to Valossa Portal and Valossa Report yet.
On the home page, each displayed information box that is related to a successfully analyzed video contains a link to the Valossa Report of the video analysis results. To see examples of Valossa Report, click "Demos" on the home page (you must be logged in to Valossa Portal in order to do this).
Below you'll find example screenshots of Valossa Report.
(Actually the Valossa Report is a tool for viewing your Valossa Core metadata in an easy way for humans. When you're ready to integrate Valossa Core metadata to your application, please see the instructions for machine-reading the Valossa Core metadata.)
The Valossa Report's Overview gives you a quick visual overview of the analyzed video content.
The tags are an overview of the detected concepts. By clicking the arrows you can browse through the detections in the video. You can also search the concept within the video by clicking the magnifying glass symbol.
The Valossa Report's Heatmap displays the timeline of a video, and detections of concepts are placed on the timeline. Each detection is shown on its own row (its own timeline). Detections are grouped by their detection type such as human.face, visual.context, audio.context, etc. Please note that different colors are given to different detection types for distinguishing them visually.
Within a detection type, detections are grouped by prominence. For example, the most prominent faces are shown first.
With the Valossa Report controls, you can change the resolution of the timeline (how many distinct timeslots are shown) and the number of detections shown. You can also adjust the confidence threshold for several detection types. The detections below the chosen threshold are hidden.
The depth of color in the colored spots on the timeline for a detection shows how many detections of that concept are in that timeslot and/or how confident the detections are. Click on a colored spot, and the video player on the Valossa Report page will playback the video from the corresponding timecode. Thus, you are able to see the main concepts of the video arranged by time and prominence, and verify their correctness. With the main timeline and the seek bar under the video player, you can also move to any time-position in the video.
The Tag & Train naming tool can be used to edit names and genders of the detected faces. Changes will be saved to the metadata of the video analysis job and indexed into the search automatically. Training functionality that allows the AI to learn from the changes is available.
Click the "Tag & Train" button above the face detections or the pencil next to a person name to open the tool.
Valossa Core metadata is provided in downloadable JSON files, which are available via the REST API (function job_results) or via the results page in Valossa Portal that shows the results and other info about your most recent jobs.
The sizes of the JSON files vary depending on the size of the videos and the number of detections, ranging from a few kilobytes to several megabytes. You should save the metadata JSON in your local database or file system. The metadata will not necessarily be stored perpetually in our system, download count limits may be imposed in the future, and it is also faster for your application to access the metadata from your local storage space.
The version number of the metadata format is continuously updated, when the format changes (version changelog of Valossa Core metadata). The version number is a concatenation of three integers, with a dot (.) as the delimiter: starting from the beginning of the string the version number x.y.z contains a major version number, a minor version number and a patch number. If only the patch version number (z in x.y.z) changes, the changes are purely additions to the structure i.e. they can't break the parsing code.
Valossa Core metadata has been designed to address several needs. It answers questions such as:
Please see the images below for a quick explanation of how to read these things from the metadata.
Valossa Video Recognition AI addresses the needs 1 and 4 by detecting a varity of things and then ranking the most dominant detections from the video, so that the Valossa Core metadata can be used for answering questions such as "What are the visuals about?", "Who are the faces appearing in the video?", "What sounds are in the audio track?", "What are the spoken words about?", "What is the entire video about?", etc. The detections are grouped conveniently by the detection type, see more below. The needs 2 and 3 are addressed by Valossa Video Recognition AI with a smart time-coding logic that makes it easy to read either all the temporal occurrences of a specific detection or all the detections at a specific time position, whichever way is the most useful for your application.
A more detailed explanation of the fields "detections" and "by_detection_type" can be found in the subchapter Detections.
Detections are grouped by Valossa Video Recognition AI in a way that makes it easy for your application code to iterate over all instances (occurrences) of, for example, cats:
By reading the "by_second" field, your application code can easily list everything at a given time position. More details about the "by_second" field are provided in the subchapter Detections.
Using IAB categories, the metadata tells the topics of the video to your application code:
Valossa Core metadata about your videos is hierarchical and straightforward to parse for your application code. High-level structure of the current Valossa Core video metadata JSON format, not showing detailed subfields:
{ "version_info": { "metadata_type": "core", "metadata_format": "...", "backend": "..." }, "job_info": { "job_id": "...", "request": {...} }, "media_info": { ... }, "detections": { ... }, "detection_groupings": { "by_detection_property": { ... }, "by_detection_type": { ... }, "by_frequency": { ... }, "by_second": [ ... ] }, "segmentations": { ... } }
Currently there are four supported values for the "metadata_type" field: "core", "frames_faces", "seconds_objects" and "frames_objects". The default type is "core" (Valossa Core metadata) — if you need "frames_faces" metadata that contains the bounding box information for the detected faces or "seconds_objects" or "frames_objects" metadata that contain the bounding box information for the detected visual objects, you must specify this in your API call when downloading metadata.
The version number of the metadata format (x.y.z, explained above) can be found in the "metadata_format" field under "version_info". Of course, the version numbering of Valossa Core metadata (the files with "core" as the value of the "metadata_type" field) is separate from the version numbering of "frames_faces" or "seconds_objects" or "frames_objects" metadata for the same video.
Using special metadata types such as "frames_objects" requires, obviously, that your video analysis job was run with capabilities that produce the special metadata files (in addition to the corresponding detection IDs in the Core metadata file).
You will best understand the details of the metadata structure by viewing an actual metadata JSON file generated from one of your own videos! As the very first thing you'll probably want to view your results using the easy Valossa Report visualization tool.
Note: In order to save storage space, JSON provided by the API does not contain line-breaks or indentations. If you need to view JSON data manually during your software development phase, you can use helper tools in order to get a more human-readable (pretty-printed) version of the JSON. For example, the JSONView plugin for your browser may be of help, if you download JSON metadata from the Portal: the browser plugin will display a pretty-printed, easily navigable version of the JSON. In command-line usage, you can use the "jq" tool or even Python: cat filename.json | python -m json.tool > prettyprinted.json
In the following subchapters, the JSON metadata format is described in more detail.
All concept detections from the video are listed in the field "detections". This is an associative array, where the key is a detection ID and the value is the corresponding detection. Please note that the detection ID is a string, and you must not assume that the string always represents an integer, even though the IDs often look like "1" or "237". So, the ID is a string, unique within the key space of the "detections" structure, but your code cannot assume that the string has a specific internal format.
The detection IDs are used in "detection_groupings" to refer to the specific detection, so the detailed information about each detection resides in one place in the JSON but may be referenced from multiple places using the ID. Inside the field "detection_groupings", four practical groupings of detections are given for you:
The following image helps understand the usage of detection IDs as references within the JSON data:
How to get an overview of the most prominent detections? That's easy: in "by_detection_type", start reading detections from the beginning of the lists under each detection type. Because the detections are sorted the most relevant ones ffirst, reading e.g. the 20 first detections from "human.face" gives you an overview of the most prominent faces in the video. For an easy and quick overview of detections, you may view the Valossa Report (visualization of detections) of the video in Valossa Portal.
However, please note that the "audio.speech" detections (speech-to-text results) are not ordered by prominence, as they are just raw snippets of speech detected from a specific time-range from within the video's audio track. The complete speech-to-text data of a video are also available in the SRT format from the Valossa Video Recognition API (see speech-to-text SRT download) and in Valossa Portal on the page that lists the most recent analysis results. The content of the downloadable SRT file is generated from the "audio.speech" detections from the Valossa Core metadata JSON file, so the information is the same whether you read the speech-to-text results from the metadata or from the SRT downloaded from Valossa Portal. Please note that the newlines in the generated speech-to-text SRT file are Unix-newlines (LF only, not CRLF).
Every detection in the JSON has, at minimum, the fields "t" (detection type identifier) and "label". The "label" is just the default human-readable label of the detected concept, and for many detection types, more specific information is available in additional data fields. The following is the list of currently supported detection type identifiers.
Fields that exist or don't exist in a detection, depending on the detection type and situation, include "occs", "a", "ext_refs", "categ" and "cid".
Currently, the following detection types are supported.
visual.context visual.object.localized audio.context audio.speech human.face human.face_group transcript.keyword.compliance transcript.keyword.novelty_word transcript.keyword.name.person transcript.keyword.name.location transcript.keyword.name.organization transcript.keyword.name.general audio.keyword.compliance audio.keyword.novelty_word audio.keyword.name.person audio.keyword.name.location audio.keyword.name.organization audio.keyword.name.general external.keyword.novelty_word external.keyword.name.person external.keyword.name.location external.keyword.name.organization external.keyword.name.general topic.iab topic.general explicit_content.nudity explicit_content.audio.offensive explicit_content.transcript.offensive visual.color visual.text_region.full_frame_analysis visual.text_region.lower_third visual.text_region.middle_third visual.text_region.upper_third
The identifiers are mostly self-explanatory. Please note that "visual.context" offers a broad range of visual detections such as objects; "audio.context" offers a broad range of audio-based detections; "topic.iab" and "topic.general" are categories for the entire video; "external.keyword.*" refers to keywords found from video description or title; "human.face_group" are people who have a temporal correlation high enough to probably have meaningful interaction with each other.
In addition to the clearly offensive, explicit words or phrases (such as swearwords) detected from speech and having the detection type "explicit_content.audio.offensive", any Content Compliance related keywords from speech or transcript have the detection types "audio.keyword.compliance" and "transcript.keyword.compliance", respectively. They include words and phrases related to violence, sex and substance use. Currently, these Content Compliance specific keyword detections are only available for English-language content.
The field "occs" contains the occurrence times of the detection. There is a start time and an end time for each occurrence. For example, a visual object "umbrella" might be detected 2 times: first occurrence from 0.3 seconds to 3.6 seconds, and another occurrence from 64.4 seconds to 68.2 seconds — so there would be 2 items in the "occs" array. Time values are given as seconds "ss" (seconds start) and "se" (seconds end), relative to the beginning of the video.
Detections that are not time-bound (such as topic.iab and external.keyword.*) cannot contain "occs".
If applicable to the detection type, occurrences have a maximum confidence ("c_max") detected during the occurrence period. (Because confidence varies at different moments during the occurrence, it makes sense to provide just the maximum value here. To find out the confidence during a particular moment, check out the "c" field of each second in the "by_second" data.) Currently, only visual.context and audio.context detections have "c_max".
Please note that if you want to answer the question "What is in the video at time xx:xx?", then you should see the "by_second" array in the "detection_groupings". Occurrences, on the other hand, are good when you want to answer the question "At what time-sections is Y detected?"
Each occurrence also contains the ID of the shot where the occurrence starts. The shot ID, stored as an integer in the field "shs", is just a numerical index to the array "detected_shots" within segmentations (the first shot is at index 0, the next one is at index 1, and so on). Similarly, the field "she" provides the ID of the shot where the occurrence ends. These shot references make it easy to integrate detection occurrences to your video workflow that utilizes the shot boundaries for clipping or for a similar purpose. For example, when your use case involves finding Content Compliance related concepts (such as nudity-related concepts), the entire shot can be easily exported to your video editor application or MAM information system, instead of just the occurrence, if this is what your workflow needs.
As you remember, "t" and "label" are always given for a detection. The field "occs" might not be there. Besides "occs", there are other optional fields for a detection: "a", "ext_refs", "categ"
If exists, the object-field "a" contains attributes of the detection. For example, the "human.face" detections may have attributes: "gender" that includes the detected gender, "similar_to" that includes the possible visual similarity matches to persons in a face gallery, and "s_visible" i.e. the total screen-time of the face (note: nearly always less than the combined duration of the occurrences of the face, because during an occurrence some frames usually do not have the detected face or other concept — some frame-gaps are allowed in occurrences in order to create a practical simplification of time-bound visibility, while "s_visible" is the combined duration of only those frames where this face has actually been seen by the AI). The "gender" structure also contains the field "c" that provides the confidence of the detected gender (0.0 to 1.0).
If exists, the string-field "cid" contains the unique identifier of the concept in the Valossa Concept Ontology. All visual.context detections and audio.context detections have "cid". However, for example audio.speech detections don't have "cid".
If exists, the array-field "ext_refs" contains references to the detected concept in different ontologies. Most visual.context detections have "ext_refs", expressing the concept identity in an external ontology, such as the Wikiedata ontology or the Google Knowledge Graph ontology (or several ontologies, depending on the availability of the concept in the various external ontologies). Inside "ext_refs", the ontology identifier for Wikidata is "wikidata" and the ontology identifier for Google Knowledge Graph is "gkg" (see examples). If a specific external ontology reference object such as "wikidata" exists, there is an "id" field inside the object; the "id" field contains the unique identifier of the concept within that external ontology. Then you may search information about the concept from exteral services such as https://www.wikidata.org/. For "topic.iab" detections, the "ext_refs" field contains the ontology identifier "iab", and the ontology reference object describes the topic (IAB category) in the industry-standard IAB classification.
If exists, the object-field "categ" provides a useful list of the concept categories (such as "food_drink", "sport", "violence"...) for the detection. See detailed information on reading and understanding the detection categories.
If exists in a detection, the object-field "categ" contains the key "tags", and under the key "tags" there is an array-field that contains one or more category identifier tags (string-based identifiers such as "flora" or "fauna") for the concept of the detection. For example, a "dog" detection has the category tag array ["fauna", "pets"]. As another example, a "train station" detection has the category tag array ["place_scene", "public_transport", "traffic", "buildings_architecture"]. Many visual.context detections and some audio.context detections have "categ". Note! This is about the categories of a specific detection (a "single concept") — a completely different thing than the categories of the entire video (such as IAB categories).
By checking the "categ":"tags" of your detections, your reader code can easily filter detections, for example, to find all Content Compliance detections in your metadata. This is very useful in a Content Compliance use scenario, and Valossa Video Recognition AI is especially good at detecting Content Compliance related detections for the media industry (a typical use case is to detect audiovisual content that is inappropriate for a specific viewing time or viewer demographic). You may be interested in the blog post about Valossa reaching best accuracy in visual Content Compliance benchmark.
Currently, the following tag categories are supported.
Category name shown in Valossa Report | Identifier tag of the tag category in metadata | Explanation |
---|---|---|
Accidents and destruction | accident | Severe situations such as accidents, explosions, conflicts and destruction after natural or civil catastrophes. |
Act of violence | act_of_violence | Act of violence that could injure a victim. |
Automotive | automotive | Cars, trucks and motorcycles. |
Aviation | aviation | Airplanes and spacecraft. |
Boats and ships | boats_ships | Boats and ships. |
Bombs and explosions | bomb_explosion | Explosions and smokes. |
Brand or product | brand_product | Brands and products such as branded vehicles. This category does not include logos. Logo detection model can be purchased separately. |
Buildings and architecture | buildings_architecture | Different buildings and architectural details. |
Celebrations and holidays | celebrations_holidays | Personal celebrations and holidays like wedding, graduation, Christmas and Thanksgiving. |
Children, family and play | children_family_play | Childen, toys and accessories. Also board games and amusement park rides which can be enjoyed by people of all ages. |
Computers and video games | computers_video_games | Computers and video games. |
Consumer electronics | consumer_electronics | Mobile devices, gadgets, cameras, televisions, home appliances, etc. |
Content compliance | content_compliance | This tag is present on all the Content Compliance concepts. In addition to this tag, a Content Compliance concept will also have a more specific tag. Usually, the more specific tag is one of these: act_of_violence, threat_of_violence, gun_weapon, injury, sensual, sexual, substance_use, violence, video_structure. |
Explicit content | explicit_content | Groups highly visual sexual and violent content. All concepts in this category also belong to more specific categories like "sexual", "violence" or "injury". |
Fashion and wear | fashion_wear | Clothing, shoes, accessories, jewelry and makeup. |
Animals | fauna | Animals. |
Plants and mushrooms | flora | Plants. |
Food or drink | food_drink | Foods and drinks, also eating and drinking. | Graphics | graphics | Graphics that enrich the media content. |
Guns and weapons | gun_weapon | Weapons such as guns, knives and bows. |
Home and garden | home_garden | Interior design & decoration, furniture, rooms, home textiles, tableware, garden etc. |
Basic human actions | human_basic | Basic human actions like sitting, standing, sleeping etc. |
Human features and body parts | human_features | Human features and body parts, also some similar non-human body parts such as animal eyes. This category does not contain emotion or Content Compliance detections. Emotions (from faces) are available by contacting Valossa and requesting us to enable them. Content Compliance detections are in their specific categories. |
Social situations and human life | human_life_social | Human activities in social contexts such as pride parade, gambling, stunt, dog walking, a student. Sports, professional and religional activities are not included, they can be found from separate categories. |
Industrial | industrial | Machinery, power plants, cables etc. |
Injury | injury | Signs of injury such as blood, wounds and bruises. |
Lights and effects | lights_effects | Media enhancing light-based effects. |
Materials | materials | Materials such as concrete, wood, iron etc. |
Military equipment and people | military | Military staff, military vehicles, aircraft and vessels. |
Music | music | Musical instruments, events and settings. |
Natural disasters and severe weather | natural_disaster_severe_weather | Natural disasters such as flood and severe weather such as thunderstorm. |
Natural phenomena | natural_phenomena | Natural phenomena and events. |
Landscape and environment | nonlive_natural | Natural objects such as rock, mountain, sun, river and glacier. |
Other man-made objects | other_manmade_object | Objects, which are not in any other specific category. |
Pets | pets | Dogs, cats and their accessories. |
Place, location or scene | place_scene | Places and locations like living room, stadium or road. |
Professions and work | professions_work | Humans at work. Please note that special professionals like athletes and military personnel are in their specific categories. |
Public transport | public_transport | Public transport such as bus and train. |
Religion | religion | Religion-related symbols, persons and places of worship. |
Sensual | sensual | Hinting towards sexuality such as bikinis and underwear, kissing, navel. |
Sexual | sexual | Clearly sexual material and intimate body parts. |
Sports | sport | Sports, sporting events and athletes. Please note that Valossa offers a separate football (soccer) event model, which is not included in the general sports category. We also would like to train a custom sports event model for your purposes (a specific type of sport etc.). |
Sports equipment | sport_equipment | Sports equipment, protection and clothing. |
Sport locations | sport_locations | Sport locations such as swimming pool. |
Style | style | Image styles such as diagram or cartoon. |
Substance use | substance_use | Smoking, drugs, medicines and alcohols. |
Threat of violence | threat_of_violence | Threat towards a person, for example, aiming with a gun. |
Traffic, traffic areas and signs | traffic | Traffic stations, parking lots and town squares, different roads, streets, paths, bridges and underpasses. Traffic lights, other signs and traffic congestion also belong to this category. |
Travel destinations | travel_destinations | Travel destinations such as famous landmarks around the world. |
Video structure | video_structure | Video structure elements such as black frame. |
Violence, injuries and threats | violence | Acts and signs of violence. |
Visual arts and crafts | visual_arts_crafts | Pieces and making of visual arts and crafts like sculptures, paintings and handicrafts. |
Category name shown in Valossa Report | Category tag in metadata | Explanation |
---|---|---|
Football | football_soccer (19 concepts) | Any actions during the sport of soccer (a.k.a. football). This tag and the related detections are available only for those customers that have purchased the Valossa soccer (football) model separately. |
Logo | logo (298 concepts) | Brand logos. |
This is a practical example of using the "categ" information: how to find all those detections that are related to Content Compliance. You can loop over all the detections (or, if so desired, only all the detections having a particular detection type such as "visual.context") and check if the "categ":"tags" array contains the tag "content_compliance".
The same logic for reading "categ" entries works, of course, also for other tags than just "content_compliance". We are using "content_compliance" as the example, because Content Compliance is a popular use scenario for Valossa Video Recognition AI.
Example pseudo-code for finding all "content_compliance" tagged detections in the Core metadata of your video:
content_compliance_dets_by_det_id = [] foreach det_id --> det_item in metadata["detections"]: if exists det_item["categ"]: if det_item["categ"]["tags"] contains "content_compliance": content_compliance_dets_by_det_id[det_id] = det_item
Similar pseudo-code example but finding "content_compliance" tagged detections only from among the "visual.context" detections, not from among all detections of all types:
content_compliance_dets_by_det_id = [] if exists metadata["detection_groupings"]["by_detection_type"]["visual.context"]: foreach det_id in metadata["detection_groupings"]["by_detection_type"]["visual.context"]: det_item = metadata["detections"][det_id]: if exists det_item["categ"]: if det_item["categ"]["tags"] contains "content_compliance": content_compliance_dets_by_det_id[det_id] = det_item
For "audio.speech" (speech-to-text) detections, the detected sentences/words are provided as a string in the "label" field of the detection.
Information related to "human.face" detections: If and only if a face is similar to one or more faces in a face gallery, the "a" field of the specific face detection object will contain a "similar_to" field that contains an array of the closely matching faces (from a gallery), and within each item in the "similar_to" array there is a string-valued "name" field providing the name of the visually similar person and "c" that is the float-valued confidence (0.0 to 1.0) that the face is actually the named person. In "similar_to", the matches are sorted the best first. Please note that a face occurrence doesn't directly contain "c" — confidence for faces is only available in the "similar_to" items. Starting from Valossa Core metadata version 1.3.4, each "similar_to" item contains a "gallery" field and a "gallery_face" field. The "gallery" field contains an "id" field, the value of which is the face gallery ID (a UUID) of the gallery from which the matched face identity was found (see custom galleries). The "gallery_face" field contains a string-valued "name" field and an "id" field, the value of which is the face ID (a UUID), that is, a unique identifier of the face identity (person) within the specified face gallery. (Note: A "name" field exists in two places, for reader code compatibility with previous metadata formats.) For information about how to get the face coordinates (bounding boxes) of a face and how to find all occurrences of a specific gallery-matched face identity, see the separate subsection.
Information related to second-based timeslots (contained in the by_second grouping structure) for "human.face" detections: Starting from Valossa Core metadata version 1.3.12, information about the size of each face is provided per second. For each second-based timeslot of a "human.face" detection, the "a" (attributes) field contains the "sz" (size) field, which contains the "h" (height) field. The "h" field is float-valued and the value is given as relative to the height of the video frames, in other words, the value 1.0 means the full height of the video frame. While a face is typically seen in multiple frames within a particular second, the value of "h" is simply the height of this face at the moment when it is first encountered within the second. If you need complete, frame-by-frame size information (width, height, position for each face in each frame) for "human.face" detections, please see face bounding boxes in frames_faces metadata.
Example of face height data at a particular second for a particular face:
... { "d": "1", "o": ["1"], "a": { "sz": {"h": 0.188} } }, ...
The old detection type "explicit_content.nudity" has been deprecated and is not available in any jobs created after 2018-12-05. Instead of that deprecated detection type, you should use the new, more detailed, more accurate explicit content model. (Explicit content is, for example, nudity-related, violence-related or substance-use-related content.) The new model has been integrated as part of the "visual.context" detections. Those "visual.context" detections, which are related to explicit content, have such category tags that they can be easily distinguished from the non-explicit detections. See more information on detection category tags
The detected dominant colors in the video, per second, are provided in such a way that there is only one "visual.color" detection, which covers the entire video. The colors are provided as RGB values (6 hexadecimal digits in a string) stored as attributes in the "by_second" structure in those second-based data items, where the "d" field refers the single detection that has the detection type "visual.color". Please note that at the end of a video, there might be a second where the color item is not available, so please do not write reader code that assumes a "visual.color" detection reference to exist at absolutely every second of the video. The attributes are in the "a" field of the color-related data item of the given second, and the "a" field contains an array-valued "rgb" field, where each item is an object containing information about a particular detected color. In each of those objects, the "f" field is the float-valued fraction (max. 1.0) of the image area that contains the particular color (or contains a close-enough approximation of that color), and the "v" field is the RGB value. The "letter digits" (a-f) in the hexadecimal values are in lowercase.
Example of color data at a particular second:
... { "d": "12", "o": [ "13" ], "a": { "rgb": [ { "f": 0.324, "v": "112c58" }, { "f": 0.301, "v": "475676" }, { "f": 0.119, "v": "9f99a3" } ] } }, ...
An example of what is found in "detections", a visual.context detection:
... "86": { "t": "visual.context", "label": "hair", "cid": "lC4vVLdd5huQ", "ext_refs": { "wikidata": { "id": "Q28472" }, "gkg": { "id": "/m/03q69" } }, "categ": { "tags": [ "human" ] }, "occs": [ { "ss": 60.227, "se": 66.191, "c_max": 0.80443, "id": "267" }, { "ss": 163.038, "se": 166.166, "c_max": 0.72411, "id": "268" } ] }, ...
Another example from "detections", a human.face detection:
... "64": { "t": "human.face", "label": "face", "a": { "gender": { "c": 0.929, "value": "female" }, "s_visible": 4.4, "similar_to": [ { "c": 0.92775, "name": "Tina Schlummeister" "gallery": { "id": "a3ead7b4-8e84-43ac-9e6b-d1727b05f189" }, "gallery_face": { "id": "f6a728c6-5991-47da-9c17-b5302bfd0aff", "name": "Tina Schlummeister" } } ] }, "occs": [ { "ss": 28.333, "se": 33.567, "id": "123" } ] }, ...
An example of an audio.context detection:
... "12": { "t": "audio.context", "label": "exciting music", "cid": "o7WLKO1GuL5r" "ext_refs": { "gkg": { "id": "/t/dd00035" } }, "occs": [ { { "ss": 15, "se": 49 "c_max": 0.979, "id": "8", } ], }, ...
An example of a visual.text_region.* (OCR, text region) detection, in this case specifically a visual.text_region.lower_third detection that happens to contain two lines (rows) of subtitles recognized in the lower-third area of the video frames, along with line-specific confidence values:
... "26": { "t": "visual.text_region.lower_third", "label": "text region", "a": { "lang": "en", "text": { "lines": [ "and the winner of the competition", "is from the outskirts of the capital" ], "lines_with_splitting": [ ["and", "the", "winner", "of", "the", "competition"], ["is", "from", "the", "outskirts", "of", "the", "capital"] ], "as_one_string": "and the winner of the competition is from the outskirts of the capital", "c": { "lines": [ 1.0, 0.921 ] } } }, "occs": [ { "id": "27", "ss": 11.3, "se": 15.02, "shs": 4, "she": 5 } ] }, ...
An example of an IAB category detection:
... "173": { "t": "topic.iab", "label": "Personal Finance", "ext_refs": { "iab": { "labels_hierarchy": [ "Personal Finance" ], "id": "IAB13" } } }, ...
An example of keyword detection:
... "132": { "t": "transcript.keyword.name.location", "label": "Chillsbury Hills", "occs": [ { "ss": 109.075, "se": 110.975, "id": "460" } ] } ...
Please note that transcript keyword occurrence timestamps are based on the input SRT timestamps. In the future, if a non-timecoded transcript is supported, transcript keywords might not have occurrences/timecoding.
When viewing your video analysis results, you may have noticed that several different "human.face" detections (under different detection IDs) may be recognized as the same named person from a face gallery. This is natural, because to the AI, some face detections seem different enough from each other so they are classified as separate faces (separate face detections)... but each of those detections is similar enough to a specific face in the gallery, so each of the detections has a "similar_to" item for the same gallery face! For example, there could be two "human.face" detections which are "similar_to" the gallery face "Steve Jobs", the other one with confidence 0.61 and the other one with confidence 0.98.
Of course, a question arises: Is there an easy way to see list all the detections of a specific gallery face within a given Valossa Core metadata file? For example, find all "human.face" detections that were (with some confidence) matched to the gallery face "Steve Jobs"? Yes, there is.
Under "detection_groupings":"by_detection_property", certain types of detections are grouped by their certain shared properties. Currently, the only supported property-based grouping is for "human.face" detections, and for them the only supported property-based grouping has the identifier "similar_to_face_id". As shown in the example below, all detected faces that have at least one "similar_to" item (with a gallery face ID) are listed in the structure, indexed by the gallery face ID. A gallery face ID is a UUID that uniquely identifies the specific person (more precisely: the specific face identity) within the particular face gallery.
Please note that some legacy gallery faces might not have a face ID (UUID) and thus cannot be found in the "similar_to_face_id" structure. This restriction only applies to a few customers who have a face gallery that was created before the introduction of the "similar_to_face_id" grouping structure, and of course to the analysis jobs that have been run using an old version of the system: the face IDs and the "similar_to_face_id" grouping structure were introduced in the version 1.3.4 of Valossa Core metadata.
Under each gallery face ID, there is an object that contains the fields "moccs" and "det_ids".
In the "moccs" field, there is an array of objects that are the merged occurrences of the one or more "human.face" detections that share a specific gallery face ID in their "similar_to" items. The naming "moccs" highlights the difference of the format to the "occs" format that can be found in the actual "human.face" detections.
In the "det_ids" field, there is an array of the IDs of the detections that have this specific gallery face ID in their "similar_to" items. Thus, if you want to read all the original corresponding "human.face" detections (including, among other things, the original occurrences separately for each detection in a "non-merged form") for any specific gallery face ID, it is easy.
Of course, if there is only one "human.face" detection having a "similar_to" item with a given gallery face ID, then there is only one detection ID in the "det_ids" array under that gallery face ID, and the "moccs" array of that gallery face originates solely from the occurrences of the single corresponding "human.face" detection.
The name of each face is available in the "similar_to" items of the "human.face" detections, which are referred to with their detection IDs listed in "det_ids". So, for example, by looking at the item at index "3" in the "detections" field of the metadata you would see that the face ID "cb6f580b-fa3f-4ed4-94b6-ec88c6267143" is "Steve Jobs". Naturally, an easy way for viewing the "merged" faces information is provided by the Valossa Report tool.
Example of "similar_to_face_id" detection groupings data, where the occurrences of the face detections "3" and "4" with similarity to Steve Jobs (cb6f580b-fa3f-4ed4-94b6-ec88c6267143) have been merged into one easy-to-parse "moccs" structure:
... "detection_groupings": { ... "by_detection_property": { "human.face": { "similar_to_face_id": { "cb6f580b-fa3f-4ed4-94b6-ec88c6267143": { "moccs": [ {"ss": 5.0, "se": 10.0}, {"ss": 21.0, "se": 35.0}, {"ss": 64.0, "se": 88.0}, {"ss": 93.0, "se": 98.0}, {"ss": 107.0, "se": 112.0}, {"ss": 123.0, "se": 137.0}, {"ss": 157.0, "se": 160.0}, {"ss": 196.0, "se": 203.0}, {"ss": 207.0, "se": 212.0} ], "det_ids": ["3", "4"] }, "648ec86d-4d91-42a6-928d-a25d8dc2691c": { "moccs": [ {"ss": 194.0, "se": 197.0}, {"ss": 229.0, "se": 237.0} ], "det_ids": ["19"] }, ... } } }, ... }, ...
Do you need face coordinates, that is, the bounding boxes for each detected face at a specific point in time? They are available from the Valossa Video Recognition API, but because of the considerable file size, the bounding boxes are not part of the Valossa Core metadata JSON. The face bounding box data must be downloaded as a separate JSON file from the API. The metadata type identifier of this special metadata JSON is "frames_faces" (in "version_info":"metadata_type"). When downloading the metadata, you need to specify the metadata type with the parameter "type=frames_faces" in the job_results call.
Please note that "frames_faces" metadata may not be available for your old video analysis jobs, as the feature has not always been part of the system.
The "frames_faces" metadata is easy to parse. The "faces_by_frame" field, which always exists, is an array that is indexed with the frame number so that the information for the first frame is at [0], the information for the next frame at [1] and so on. For each frame, there is an array that contains one bounding box object for each face that was detected in that frame. Of course, a frame without any detected faces is represented by an empty array.
Every bounding box object contains the fields "id", "x", "y", "w", "h". The value of "id" is the same detection ID that the corresponding "human.face" detection has in the Valossa Core metadata file of the same video analysis job. The values of "x" and "y" are the coordinates of the upper-left corner of the bounding box (the x offset from the left edge of the frame, and the y offset from the top of the frame). The values of "w" and "h" are the width and height of the bounding box, respectively. The values of "x", "y", "w", "h" are all given as float values relative to frame size, thus ranging from 0.0 to 1.0, with the following exception. Because a detected face can be partially outside the frame area, some face coordinates may be slightly less than 0.0 or more than 1.0 in the cases where the system approximates the edge of the invisible part of a bounding box. For example, the "x" coordinate of a face in such a case could be -0.027.
Example job_results request for "frames_faces" metadata, using HTTP GET:
https://api.valossa.com/core/1.0/job_results?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=167d6a67-fb99-438c-a44c-c22c98229b93&type=frames_faces
Example response in an HTTP 200 OK message:
{ "version_info": { "metadata_type": "frames_faces", "metadata_format": "...", "backend": "..." }, "job_info": { ... }, "media_info": { ... }, "faces_by_frame": [ ... [], [], [], [], [ { "id": "1", "x": 0.4453125, "y": 0.1944444477558136, "w": 0.11953125149011612, "h": 0.21388888359069824 } ], [ { "id": "1", "x": 0.4351562559604645, "y": 0.19583334028720856, "w": 0.11953125149011612, "h": 0.2152777761220932 } ], [ { "id": "1", "x": 0.42578125, "y": 0.19722221791744232, "w": 0.12187500298023224, "h": 0.22083333134651184 }, { "id": "5", "x": 0.3382812440395355, "y": 0.23888888955116272, "w": 0.20468750596046448, "h": 0.3986110985279083 } ], ... ] }
In addition to the coordinates for faces, the Valossa Video Recognition API also provides coordinates ("bounding boxes") for general visual detections. Currently, however, this feature is available only for logos.
The relevant detection type is "visual.object.localized". Thus, in order to know which visual object detections have coordinates, your code that processes your Valossa Core metadata needs to read the "visual.object.localized" detections (not "visual.context").
Much like the "frames_faces" metadata containing the bounding boxes for "human.face" detections, the bounding box coordinates for the "visual.object.localized" detections are stored in a separate JSON file. This separate JSON file contains the "seconds_objects" metadata and can be downloaded using the API by specifying the metadata type "seconds_objects" in the job_results request. Because all detections are listed by their IDs in the Valossa Core metadata file (the main metadata file), also the "visual.object.localized" detections can be found in the "detections" structure within the Valossa Core metadata, but the time-specific, changing coordinates of their bounding boxes must be read from the "seconds_objects" metadata.
Please note that "seconds_objects" metadata may not be available for your old video analysis jobs, as the feature has not always been part of the system.
Again similar to the "frames_faces" metadata, the coordinates in "seconds_objects" metadata are given as float values relative to the picture frame size. The values range from 0.0 to 1.0, except that in some corner cases the coordinates may be slightly less than 0.0 or more than 1.0, so please take these possibilities into account in your reader code. Every bounding box object contains the fields "x", "y", "w", "h", "c". The values of "x" and "y" are the coordinates of the upper-left corner of the bounding box (the x offset from the left edge of the frame, and the y offset from the top of the frame). The values of "w" and "h" are the width and height of the bounding box, respectively. The value of "c" is the confidence of the detection in that bounding box area during that second.
Whereas the "frames_faces" metadata contains faces for each frame, the "seconds_objects" metadata contains objects for each second (not frame). The "objects_by_second" array always exists in the "seconds_objects" metadata. The "objects_by_second" array is indexed with the second number so that the information for the first second is at [0], the information for the next second at [1] and so on. Each second in the "objects_by_second" array is represented by an array, which of course may be empty, if there are no detected objects with bounding boxes in that specific second. In the array of a specific second, the items are detection-specific; there might be several different detections on the same second, for example, a Visa logo and a MasterCard logo. The detection ID is in the field "d" within the detection-specific item, and naturally details of the corresponding concept can be found under that specific detection ID in the "detections" structure of the Core metadata of the same video. The "seconds_objects" metadata does not contain any information on the actual detected concepts, because all that information is already available in the single correct place: the Core metadata of the video.
In the "seconds_objects" metadata, the occurrence(s) overlapping with the particular second-based timeslot of the particular "visual.object.localized" detection are referred to with their occurrence IDs in the array-valued "o" field, should you ever need to find the corresponding occurrences of a detection in an easy way. For your convenience, the detection and occurrence reference mechanism — based on the familiar "d" and "o" fields — has been deliberately designed to be similar to the mechanism used in the generic "by_second" structure of Valossa Core metadata. And yes, the "visual.object.localized" detections have references to them also in the "by_second" structure in the Core metadata, being consistent with the Core metadata specification, but the bounding box coordinates are only available in the separate "seconds_objects" metadata JSON, which exists to limit the size of the Core metadata file.
The bounding box coordinates for a specific detection are provided in the array-valued "b" field. For each "visual.object.localized" detection (e.g. a Coca-Cola logo), there might be one or more bounding boxes during a given second; in other words, the "b" array could contain more than one bounding box. For example, if there are two Coca-Cola logos simultanenously in the picture and the detection ID of the Coca-Cola logo "visual.object.localized" detection happens to be "137" in this particular video, then there are two bounding box items for the detection ID "137". Please note that in the "by_second" section in your Core metadata, the confidence value "c" of a "visual.object.localized" detection is the highest confidence of the possibly multiple simultaneously observed bounding boxes of the same detection (e.g. two simultanenous images of the same logo) during that one-second-long timeslot; to see the confidences (and coordinates) of the possibly multiple bounding boxes, your reader code needs to examine the "seconds_objects" metadata.
Please also note the existence of the frames-based "frames_objects" metadata for "visual.object.localized" detections. This metadata type has been introduced more recently than "seconds_objects" metadata: the metadata type "frames_objects" was made available in 2020.
Example job_results request for "seconds_objects" metadata, using HTTP GET:
https://api.valossa.com/core/1.0/job_results?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=167d6a67-fb99-438c-a44c-c22c98229b93&type=seconds_objects
Example response in an HTTP 200 OK message:
{ "version_info": { "metadata_type": "seconds_objects", "metadata_format": "...", "backend": "..." }, "job_info": { ... }, "media_info": { ... }, "objects_by_second": [ ... [], [], [ { "b": [ { "x": 0.2671875, "y": 0.7472222222222222, "w": 0.08072916666666667, "h": 0.06574074074074074, "c": 0.995 }, { "x": 0.4979166666666667, "y": 0.6685185185185185, "w": 0.08072916666666667, "h": 0.06944444444444445, "c": 0.984 } ], "d": "137", "o": ["314"] } ], [ { "b": [ { "x": 0.49583333333333335, "y": 0.6685185185185185, "w": 0.08489583333333334, "h": 0.06851851851851852, "c": 0.991 }, { "x": 0.2713541666666667, "y": 0.7435185185185185, "w": 0.07760416666666667, "h": 0.06759259259259259, "c": 0.865 } ], "d": "137", "o": ["314"] }, { "b": [ { "x": 0.9083333333333335, "y": 0.23333333333333, "w": 0.0101563746437, "h": 0.467467467467467, "c": 0.991 } ], "d": "138", "o": ["315"] } ], [], ... ] }
The speech-to-text results of the video analysis are available in the SRT format from the Valossa Video Recognition API. The same SRT file can also be manually downloaded in Valossa Portal.
When downloading the speech-to-text SRT, you need to use the parameter "type=speech_to_text_srt" in the job_results call. The language of the speech-to-text operation is the one that was specified in the language selection of the new_job call that created the video analysis job.
Please note that the newlines in the generated speech-to-text SRT file are Unix-newlines (LF only, not CRLF). The same speech-to-text information is also available in the "audio.speech" detections of the Core metadata of the video. The SRT is provided for your convenience; the SRT file makes it easier to achieve interoperability with the various systems that expect time-coded speech information in the SRT format.
Example job_results request for speech-to-text results in the SRT format, using HTTP GET:
https://api.valossa.com/core/1.0/job_results?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=167d6a67-fb99-438c-a44c-c22c98229b93&type=speech_to_text_srt
Example response in an HTTP 200 OK message:
1 00:00:01,910 --> 00:00:04,420 hello James and Jolie 2 00:00:04,420 --> 00:00:08,120 you shouldn't go there you know 3 00:00:08,119 --> 00:00:13,639 I had no idea that this could work please listen to me this is so fabulous
These analysis results are not available unless face & speech emotion analytics is separately activated for you before analyzing your videos. Contact sales to request activation of the feature.
There are three different kinds of sentiment and emotion related information in the metadata. How your software can read each of them is described in the following.
Valence is a form of sentiment that describes the emotional positivity or negativity of a person at a specific moment in time.
In the "by_second" structure, every "human.face"-related seconds-based item, if its valence has been detected by the AI, has a "sen" structure for the sentiment (inside an "a" field for attributes). In the "sen" structure, you will find a float-valued "val" field providing the valence of the specific face on the specific 1-second time interval. Valence ranges from -1.0 (most negative) to 1.0 (most positive), 0.0 being a neutral valence.
Example:
{ ... "a": {"sen": {"val": -0.82}, ...}, "d": "9", "o": ["51"] ... }
Several emotional states can be recognized on faces. What emotions are supported? Please see the following explanation:
Named emotions, with confidences (max. 1.0), are provided for a face at a specific moment in time. The identifier strings for the emotions are the same as the emotion names listed above (please note the V2 vs. V1 distinction regarding the available emotion identifiers).
In the "by_second" structure, every "human.face"-related seconds-based item, if its emotional state has been detected by the AI, has a "sen" structure for the sentiment (inside an "a" field for attributes). In the "sen" structure, you will find an "emo" field providing the emotions of the specific face on the specific 1-second time interval. The field "emo" is an array, because sometimes more than one emotion can be detected from the same face. In each item in the array, you will find a "value" field providing the emotion identifier string and a float-valued "c" field providing the confidence (maximum 1.0).
Example, showing also valence in addition to a named emotion for this specific face during the specific 1-second interval:
{ ... "a": {"sen": {"emo": [{"c": 0.772, "value": "disgust"}], "val": -0.796}}, "d": "1", "o": ["1"] ... }
Speech-based sentiment is currently available for English only. It contains valence information, which describes the emotional positivity or negativity of speech fragments that are heard on the video.
In the "detections" structure, each "audio.speech" detection has a "sen" structure for sentiment (inside an "a" field, for the attributes of the detection) if its valence has been detected by the AI. In the "sen" structure, you will find the valence of the speech fragment, in a float-valued "val" field. Valence ranges from -1.0 (most negative) to 1.0 (most positive), 0.0 being a neutral valence.
Example:
{ ... "t": "audio.speech", "label": "we profoundly believe that justice will win despite the looming challenges", "a": {"sen": {"val": 0.307}, ...} ... }
In "segmentations", the video is divided into time-based segments using different segmentation rules.
Currently we support automatically detected shot boundaries, hence "segmentations" contains "detected_shots". The array "detected_shots" in segmentations provides shot boundaries, as an object for each detected shot, with seconds-based start and end timepoints (float-valued fields "ss", "se") and with start and end frame numbers (integer-valued fields "fs", "fe"). The shot duration as seconds is also provided (float-valued field "sdur"). Note: frame-numbers are 0-based, i.e. the first frame in the video has the number 0. All the fields "ss", "se", "fs", "fe", "sdur" are found in every shot object. The ordering of the shot objects in the array "detected_shots" is the same as the ordering of the detected shots in the video.
Example data:
"segmentations": { "detected_shots": [ { "ss": 0.083, "se": 5.214, "fs": 0, "fe": 122, "sdur": 5.131 }, { "ss": 5.214, "se": 10.177, "fs": 123, "fe": 241, "sdur": 4.963 }, ... ] }
Shots are referred to in other parts of the metadata: Occurrences (within concept detections) contain the 0-based index of the shot, during which the occurrence starts. This makes your video workflow integration easier in cases, where the beginning of the ongoing shot is important in relation to the detected concept.
Example code snippet (in Python) that illustrates how to access the data fields in Valossa Core metadata JSON:
import json metadata = None with open("your_core_metadata.json", "r") as jsonfile: metadata = json.loads(jsonfile.read()) # Loop over all detections so that they are grouped by the type for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems(): print "----------" print "Detections of the type: " + detection_type + ", most relevant detections first:" print for det_id in detections_of_this_type: print "Detection ID: " + det_id detection = metadata["detections"][det_id] print "Label: " + detection["label"] print "Detection, full info:" print detection # Example of accessing attributes (they are detection type specific) if detection_type == "human.face": attrs = detection["a"] print "Gender is " + attrs["gender"]["value"] + " with confidence " + str(attrs["gender"]["c"]) if "similar_to" in attrs: for similar in attrs["similar_to"]: print "Face similar to person " + similar["name"] + " with confidence " + str(similar["c"]) # More examples of the properties of detections: if detection_type == "visual.context" or detection_type == "audio.context": if "ext_refs" in detection: if "wikidata" in detection["ext_refs"]: print "Concept ID in Wikidata ontology: " + detection["ext_refs"]["wikidata"]["id"] if "gkg" in detection["ext_refs"]: print "Concept ID in GKG ontology: " + detection["ext_refs"]["gkg"]["id"] if "occs" in detection: for occ in detection["occs"]: print "Occurrence starts at " + str(occ["ss"]) + "s from beginning of video, and ends at " + str(occ["se"]) + "s" if "c_max" in occ: print "Maximum confidence of detection during this occurrence is " + str(occ["c_max"]) # If you need the condifence for a particular time at second-level accuracy, see the by_second grouping of detections print print # Example of listing only audio (speech) based word/phrase detections: for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems(): if detection_type.startswith("audio.keyword."): for det_id in detections_of_this_type: detection = metadata["detections"][det_id] print "Label: " + detection["label"] # etc... You get the idea :) print # Example of listing only detections of a specific detection type: if "human.face" in metadata["detection_groupings"]["by_detection_type"]: for det_id in metadata["detection_groupings"]["by_detection_type"]["human.face"]: detection = metadata["detections"][det_id] # etc... print # Example of listing IAB categories detected from different modalities (visual/audio/transcript) of the video for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems(): if detection_type.startswith("topic.iab"): for det_id in detections_of_this_type: detection = metadata["detections"][det_id] # etc... print "IAB label, simple: " + detection["label"] print "IAB ID: " + detection["ext_refs"]["iab"]["id"] print "IAB hierarchical label structure:" print detection["ext_refs"]["iab"]["labels_hierarchy"] print # Time-based access: Loop over time (each second of the video) and access detections of each second sec_index = -1 for secdata in metadata["detection_groupings"]["by_second"]: sec_index += 1 print "----------" print "Detected at second " + str(sec_index) + ":" print for detdata in secdata: det_id = detdata["d"] if "c" in detdata: print "At this second, detection has confidence " + str(detdata["c"]) if "o" in detdata: # If for some reason you need to know the corresponding occurrence (time-period that contains this second-based detection) print "The detection at this second is part of one of more occurrences. The occurrence IDs, suitable for searching within the 'occs' list of the 'detection' object, are:" for occ_id in detdata["o"]: print occ_id print "Detection ID: " + det_id detection = metadata["detections"][det_id] print "Label: " + detection["label"] print "Detection of the type " + detection["t"] + ", full info:" # Of course, also here you can access attributes, cid, occurrences etc. through the "detection" object # just like when you listed detections by their type. In other words, when you just know the ID # of the detection, it's easy to read the information about the detection by using the ID. print detection print
1.4.2: audio.speech_detailed added
1.4.1: visual.text_region.full_frame_analysis, visual.text_region.lower_third, visual.text_region.middle_third, visual.text_region.upper_third added
1.4.0: some tag categories had their identifiers changed
1.3.12: face height per second added
1.3.11: by_frequency, topic.iab, topic.general added
1.3.10: shs added to occurrences
1.3.9: n_audio_channels added to technical media information
1.3.8: visual.object.localized added
1.3.7: changed representation of bitrate (bps) in technical media information
1.3.6: resolution, codecs and bitrates added to technical media information
1.3.5: visual.color added, violence-related concept categories added
1.3.4: detection grouping by_detection_property added, identifier information added for gallery faces
1.3.3: categ added to relevant visual.context and audio.context detections
1.3.2: similar_to in human.face detections supports role names
1.3.1: added metadata type field (supports distinguishing between different types of Valossa metadata in the future)
1.3.0: improved speech-to-text format
1.2.1: speech-to-text
1.2.0: field naming improved
1.1.0: more compact format
1.0.0: large changes, completely deprecated old version 0.6.1.
You can teach Valossa AI to recognize more people via Training API.