Investing in deep tech startups can often be challenging for three major reasons: the tech solution must be superior to that of the competition, challenging tech experts in their space can be intense, and in some cases, it is essential to anticipate the needs of a market that may not even exist yet.

But every once in a while, a deep tech startup will come along that will send all the senses buzzing. It’ll have an outstanding solution that is three steps ahead of the rest, address a growing market or carve a new one, and be led by a driven, focused, deeply knowledgeable team.

For us at Viola Ventures, Datagen was such a startup.

Founded in 2019 by an exceptional group of young entrepreneurs with rich backgrounds in computer vision, Datagen creates photorealistic synthetic visual data to train machine learning models for real-world tasks in sectors such as VR/AR, retail, facial recognition, Industry 4.0, and robotics. Just a year into their market launch, they are already working with 3 of the top US tech giants, as well as the AI research arms of several global consumer manufacturing giants, as customers

We were so impressed with Datagen’s superior tech, its outstanding team led by Ofir Chakon, CEO, and Gil Elbaz, CTO, and the promising traction it has gained in less than two years, that it took Viola Ventures Principal Rotem Shacham and I just a little over a week to finalize our decision to lead a $15 million Series A investment in the startup.

In this short timeframe, Ofir and Gil not only demonstrated Datagen’s wide application space and received phenomenal feedback from their existing customers who indicated their intention to expand their usage, they also nailed their performance when we coordinated meetings with potential clients in the computer vision sector, including top Israeli industry players OrCam and Lightricks.

The young startup also captured the attention of the former Head of engineering at Scale AI, a prominent US provider of data for machine learning teams, who – following a DD call – decided to join the investment round together with a list of eight global leaders in the computer vision space who decided to put their personal money into the company.

Datagen is advancing a growing, disruptive wave of technologies that simulate real-world data, and tapping into a whole new market that could accelerate the use of AI. The demand for synthetic data – artificially generated data using simulated scenarios to create completely new data (unlike anonymized data or real-world data) to train and test algorithms – is incredibly high, and Datagen has the potential to disrupt the first generation of training data companies such as Appen and Scale (each valued at ~ $3 billion.)

Automatically-generated synthetic faces based on  required demographics made by Datagen

The potential here is tremendous.

Fifteen years after the term “data is the new oil” was first coined, it is becoming even more clear that high-quality, well-annotated data is an absolute necessity for training computer vision models. By eliminating the heavy operational hassle associated with the manual collection and annotation of data, AI applications are likely to become much more like software applications in terms of time to development readiness, COGS, and required resources.

We estimate that synthetic data may surpass real data for training and testing and make up approximately 80 percent of data sets in the near future. And Datagen has a clear tech advantage.

With a mission to teach AI to “see” the world, the startup not only creates data, it builds the entire environment to help computer vision systems recognize the data by simulating three key elements – the physical parameters of the image sensor (camera), the objects in the scene, and the different types of noise (fog, lighting and so on). For example, Datagen can create training sets for facial recognition applications using an endless amount of facial features with different expressions, from different angles, with different lighting scenarios, etc. – and for specific camera resolutions. It does this through 3D visual simulations that reflect the domain uniformly and consistently with high variance to capture a wide range of interactions.

The results are highly photorealistic, variable, scalable, and efficient data sets that can be tailored across industries and applications to speed up product development.

It also significantly shortens a company’s time-to-market by saving on the time and money that would go into manually preparing data. Instead of having to collect the data for every new product or every new model, annotate it and then train the algorithm, the cycle is brief and efficient, and can be tailored to specific needs.

Another key advantage is Datagen’s ability to offer rich simulation of both common and edge cases, producing data for training that may otherwise be scarce or unavailable. For instance, training a computer vision application to identify falls among older people – a serious, sometimes fatal threat to the health of the elderly population – requires capturing these incidents. Rather than waiting for natural falls to occur, simulated data can easily be generated for a range of customizable scenarios.

Maybe most importantly, Datagen’s solutions enable the democratization of AI, giving smaller companies, not just tech giants, access to proprietary, high-quality machine learning training data. They also do away with the privacy and bias concerns involving real-world data that are increasingly raising questions (and ire) about consent, use, and the over- or under-representation of demographic groups.

We at Viola Ventures are strong believers in Ofir and Gil’s vision and Datagen’s ability to become the standard that will drive the data disruption. And we’re excited to be backing them on this journey.