
Next-Generation Conversational AI for Database Access:
The Technology Behind AutoQL
How innovation in conversational AI is revolutionizing the way today's leading businesses access and leverage their data

Chapter 7
Solving for the Lack of Training Data
For machine learning models to perform optimally and successfully facilitate the outcomes they were built for, they must first be exposed to enormous volumes of high-quality data. In the context of machine learning, this data is called a
“training corpus” and traditionally consists of a dataset that is built manually for the specific purpose of teaching, or training, the AI model.
The volume of training data that learning models require, though difficult and time-consuming to produce manually, can be generated through automated processes
In order for a machine to learn every way a human could ask for something and infer the optimal SQL statements that could be generated from those questions, the machine would need to “train on” far more natural language queries and database query language statements than are readily available.
Generating training data is a highly manual undertaking that takes an incredible amount of time and human labor to create. When it comes to enterprise-grade databases, the volume of training data required by the learning models — though difficult and time-consuming to produce manually — can be generated through automated processes. This means a high-quality customized training corpus that enables comprehensive coverage of a given database can be generated much more quickly.
In the next section, we’ll talk about the profound implications of automating the process of training data generation for a faster and more robust machine learning process.
The Value of Automating Training Data Generation
Humans need to be exposed to many ideas and experiences in order to know how to behave. For a machine, this learning process is very similar, except those ideas and experiences are just data points and the relationships that exist between them. And, just like a human, the machine also needs high-quality data to learn from. If a child is continuously told that a red square is a blue circle, they won’t be able to properly identify a red square in the future because they were trained on bad data. Like the child, in order for an AI system to learn correctly, it must first be exposed to good data, and lots of it. A robust and extensive training corpus that contains a high volume of accurately labeled and correctly annotated training data is absolutely vital
Robust training data is absolutely essential to build a system that can truly understand and execute on users’ natural language queries.
Manually generating high-quality training data takes a lot of time before the actual training of the machine models can even start. For a business to manually generate the volume and quality of training data that would be necessary to teach a conversational AI system to understand their unique database, they would need significant resources, trained specialists, and extensive machine learning expertise. By automating the process of generating training data, the machine learning models we build can get down to business much faster, saving extensively on both human labor and cost.
The more training data a machine learning model is exposed to, the better it can deliver under real-world circumstances. Therefore, having robust training data is absolutely essential in building a system that can truly understand and execute on users’ natural language queries. By automating the training data generation process, the models can start training faster, leading to more immediate integration with a database and decreased time-to-value for the implementer.