What is AI for data integration? And why does it matter?

What is AI for data integration? And why does it matter?

Reading time: 7 mins

Since our founding in early 2020, Precog has had a single vision: using AI and ML to solve the problem of creating ELT connectors to any and every API. We have realized this vision, and the market is taking note. The following blog describes how we did it.

What is AI for data integration?

At its simplest, AI is using computers to solve problems faster and, in most cases, more accurately than a human can by simply reading and responding to a question. Computers are designed to complete complex computations faster and more accurately. Using this basic definition, let’s explore AI for data integration further.

In common examples, AI consists of asking a question (a “prompt”) and then using LLMs trained on vast amounts of data to formulate a text-based answer. Today’s AI prompts are largely text, but sometimes they include code. In fact, AI can be applied to a wide variety of non-text-based problems. In the area of data integration, specifically ELT (Extract-Load-Transform), we can apply AI to do what human developers do today: read documentation and write code to create individual connectors to a SaaS or public API. Let’s explore this and break the problem down.

Can AI help make a connector? Yes!

The technical problem of connecting to any API, be it a SaaS application or even a public API like open.fda.gov, can be broken down into two parts. The first part is physically connecting to a source. This requires determining what web requests to make (GET, POST) and specifying functions like authentication, pagination, and other connection parameters. This step is critical but also the easiest part of the problem, as there are a finite number of ways to make these connections. Once the connection is established, the API “responds” to your request. This is the second part. In the case of ELT, the request made was for some data. The response is typically an object schema (JSON). Sometimes, it may be XML or CSV, but in most modern APIs, the object schema returned is JSON. It’s important to note a few things about a JSON response. First, the schema will be unique to each API. Second, JSON data can vary greatly in terms of complexity and heterogeneity. Often, the schema can be quite complex. This is the difficult part of working with API data for purposes of ETL/ELT, as the goal is to “normalize” the data into a well-defined SQL schema that can be loaded into any common data warehouse or RDBMS (to then be used in BI and ML tools). This second step is done by developers reading the schema provided in the API documentation or viewing the actual response data, then writing custom code to deconstruct the object schema into the relational target schema. This process is slow, brittle, and subject to mistakes since it’s being done by humans making subjective decisions about the data. This is the challenge of connecting modern APIs for ELT purposes.

Manually creating connectors can’t scale

Consider that this problem applies to thousands of APIs, and you can begin to see the scope of the problem. It’s important to note that all current solutions to this problem use this same manual approach. Even available “automated” ETL/ELT SaaS solutions rely on manual work. This is why the established vendors only support a tiny fraction of the data sources needed to meet market demand. The other more significant challenge of the manual approach is the massive technical debt incurred. And since this code must be actively maintained due to ever-changing APIs, it’s also incredibly expensive. The unit economics of the manual approach make it virtually impossible to build a long-term profitable connector business.

The sweet spot for AI in ELT

So, how can we solve this problem differently? The obvious answer is AI, or more specifically, using a combination of AI and ML (consider this to represent “machine intelligence”) to solve the problem. How can a computer automatically configure these connections (both the physical connection and the data normalization process) without human intervention? And how can you do it very fast and with much higher accuracy than a human? Conventional wisdom says this would be very hard. And indeed, it is. But if you start by thinking about the problem differently — by thinking about it as a problem of automation and AI and not an exercise in coding — you can begin to see how it can be solved. At Precog, we started with the second part of the problem first since it’s the bigger technical challenge and would require the most innovation. The Precog team has vast experience working with complex object schema. We have done significant R&D into the problems that object schema presents.

This is the biggest opportunity: make a system that can take an object schema of unknown structure and complexity and normalize it into a well-defined relational schema.

The three innovations transforming ELT

Each of these innovations is significant, but when taken together and made to work as a whole, the result is an engine that can do exactly what is needed to transform data integration: it can evaluate an object schema (JSON, XML) of unknown schema and complexity and normalize it automatically, without any human intervention and without writing a single line of code. And it doesn’t fail as complexity increases. When I describe this to people, the response is often disbelief. It’s understandable. But it works. We do this every single day for enterprise customers.

  1. A mathematical foundation that allows us to extend the algebra needed to manipulate data in more than two dimensions. This is called Multi-dimensional Relational Algebra (MRA). While this is very important, it’s just math, so we decided to make it publicly available.
  2. An evaluation engine that can evaluate data in more than two dimensions. Normal evaluation engines or parsers have vector evaluation, which allows them to evaluate data in lines (rows and columns), which, of course, make up normal relational data structures. What’s needed is another dimension of evaluation, scalar evaluation, or the ability to evaluate data in the dimension of magnitude. This capability is something we pioneered in our Tectonic framework.
  3. The final part of the problem ends up being the ability to execute significant advanced structural optimizations (query planning) over the data in order to avoid problems with complexity and heterogeneity. This innovation lives in a library we call Scion.

Using AI to connect to APIs

So, let’s bring this back to the original premise: AI. In this case, we use computers to do what humans do by hand now, which is write specific custom code to normalize object schema coming from APIs so it can be loaded into DW/DB’s. But we don’t stop here. What else is needed? Recall the first part of the API connection problem was making the actual connection. Well, it turns out this, too, can be largely automated as well. In Precog, this automation occurs in a library called Rootstock. In this case, Rootstock works hand in hand with Scion (the botanists reading this will get it) to help configure the API connection automatically. Since connecting is done using a limited number of methods, you can create a well-defined problem space for this, then apply “intelligence” or AI to produce the configuration file needed to establish the API connection. The result is a YAML file that precisely defines the connection and requires little human intervention.

How AI is delivering 1000s of connectors

We produce entire “intelligent” connectors to APIs in minutes without writing a single line of code. We recently produced a video, which we will publish soon, in which we create a new comprehensive connector to a SaaS ERP system (Odoo) in 12 minutes (that includes time to load the data into Snowflake). So, what exactly does this connector do?

  • It connects to the API, including ALL available endpoints and data sets
  • It then extracts the data
  • Then, it normalizes the data, automatically identifying primary keys (using ML)
  • It then provides incremental loading for each dataset (allowing the user to set the data to refresh on any schedule)

The end result is a fully automated data pipeline that loads the data into any data warehouse or database. Worth noting is that this includes all custom fields, as Precog detects them automatically. It also manages types and even type changes.

Why haven’t other established ELT vendors done this?

Good question. It comes down to thinking about and approaching both the problem and the solution from a different perspective. The conventional wisdom around hand-building every connector is entrenched, and most vendors have accepted it and all the related limitations. Some don’t believe an approach like Precog is possible. I’ve been told this many times. But it does work. We can prove it. We do it daily with our 100s of customers. We create 5-10 new connectors weekly, with only one dedicated resource. The rest of our engineering team is focused on building and refining the core AI and platform.

What’s the endgame?

Our goal from day one was to make data access a “utility.” Any user should be able to connect to any API in a matter of minutes and start using the data for analytics and ML. It should not require weeks of coding, complicated workflows, and brittle processes. Precog supports more APIs than any other platform. We’re at 750, and we aren’t stopping. We will support over 1000 within the next six months. One look at our growing enterprise customer base, and it’s easy to see we are innovating in ways our competitors simply can’t. Our long-term cost-of-delivery will be 1/10 of the cost of the traditional coding approach. Join us as we continue to build a scalable, long-term business transforming ELT with AI.

More Resources

See All Related Content