DataPelago emerged from stealth today with a new virtualization layer that it says will allow users to move AI, data analytics, and ETL workloads to any physical processor they want, without making code changes, bringing new gains potentially great in efficiency and yield to the fields. of data science, data analytics and data engineering, as well as HPC.
The advent of generative AI has sparked a scramble for high-performance processors that can handle the massive computing demands of large language models (LLMs). At the same time, companies are looking for ways to get more efficiency out of their existing IT spend for advanced analytics and big data pipelines, while facing the endless growth of structured, semi-structured and unstructured data.
The people of DataPelago have responded to these market signals by building what they call a universal data processing engine that eliminates the need to connect data-intensive workloads to the underlying computing infrastructure, thus freeing users to run large-scale workloads. data, advanced analytics, artificial intelligence and HPC for anything. public cloud or on-premises system that they have available or that meets their price/performance requirements.
“Just like Sun created the Java virtual machine or VMware invented the hypervisor, we are creating a virtualization layer that runs on software, not hardware,” says DataPelago co-founder and CEO Rajan Goyal. “It runs on software, which provides a clear abstraction for anything positive.”
DataPelago’s virtualization layer sits between the query engine, such as Spark, Trino, Flink, and regular SQL, and the underlying infrastructure composed of physical and storage processors, such as CPU, GPU, TPU, FPGA, etc. Users and applications can submit jobs. as they normally would, and the DataPelago layer will automatically route and execute the job to the appropriate processor to meet the availability or cost/performance characteristics set by the user.
At a technical level, when a user or application executes a job, such as a data pipeline job or a query, the processing engine, such as Spark, converts it into a plan, and then DataPelago will call an open source layer, such as Apche Gluten, to convert that plan into an Intermediate Representation (IR) using open standards such as Substrait or Velox. The plan is sent to the worker node in the DataOS component of the DataPelago platform, while the IR is converted into an executable data flow graph (DFG) that runs in the DataOS component of the DataPelago platform. DataVM then evaluates the DFG nodes and dynamically assigns them to the correct processing element, according to the company.
Having an automated way to match the right workloads to the right processor will be a boon to DataPelago’s customers, who in many cases have not benefited from the performance capabilities they expected by adopting accelerated compute engines, says Goyal .
“CPUs, FPGAs and GPUs have their own sweet spot, just like SQL workload or Python workload have a variety of operators. Not all of them run efficiently on CPU, GPU or FPGA,” says Goyal. BigDATAwire. “We know those sweet spots. Our runtime software then assigns the operators to the correct processing element. You can split this massive query or workload into thousands of tasks, and some will run on CPU, some on GPU, and some on FPGA. “That is innovative runtime adaptive mapping to the right computing element that is missing in other frameworks.”
DataPelago obviously cannot exceed the maximum performance capabilities that an application can achieve by developing natively in CUDA for Nvidia GPUs, ROCm for AMD GPUs, or LLVM for high-performance CPU jobs, Goyal says. But the company’s product can come much closer to maximizing the performance of any application available at those programming layers, and do so while shielding them from underlying complexity and without tying users and their applications to those middleware layers, he says.
“There is a huge difference between the maximum performance expected from GPUs and what applications achieve. “We are closing that gap,” he says. “You’ll be surprised that applications, even Spark workloads running on GPUs today, get less than 10% of the GPU’s maximum FLOPS.”
One reason for the performance gap is I/O bandwidth, says Goyal. GPUs have their own local memory, which means you must move data from host memory to GPU memory to use it. People often don’t factor that data movement and I/O into their performance expectations when migrating to GPUs, Goyal says, but DataPelago can eliminate the need to even worry about it.
“This virtual machine handles it in such a way that we merge operators, run data flow graphs,” says Goyal. “Things do not pass from one domain to another. There is no data movement. We run in transmission form. We do not store and forward. As a result, I/O is reduced much further and we can pin the GPUs to 80-90% of their maximum performance. “That is the beauty of this architecture.”
The company targets all types of data-intensive workloads that organizations are trying to accelerate by running accelerated computing engines. This includes SQL queries for ad hoc analysis using SQL, Spark, Trino, and Presto, ETL workloads built with SQL or Python, and data streaming workloads using frameworks like Flink. Generative AI workloads can benefit, both at the LLM training stage and at runtime, thanks to DataPelago’s ability to accelerate recovery augmented generation (RAG), fine-tuning, and creation of vector embeddings for a vector database, says Goyal.
“Therefore, it is a unified platform to perform both classic Lakehouse analysis and ETL, as well as data preprocessing with GenAI,” he says.
Customers can run DataPelago locally or in the cloud. When run alongside a cloud lake, such as AWS EMR or Google Cloud’s DataProc, the system has the ability to perform the same amount of work previously done with a 100-node cluster as it did with a 10-node cluster, Goyal says. . While the queries themselves run 10 times faster with DataPelago, the end result is a twofold improvement in total cost of ownership after licensing and maintenance are taken into account, he says.
“But the most important thing is that no changes are made to the code,” he says. “You’re writing Airflow. You’re using Jupyter notebooks, you’re writing Python or PySpark, Spark or Trino; whatever you are running, they remain unchanged.”
The company has compared its software to some of the fastest data lakehouse platforms out there. When compared to Databricks Photon, which Goyal calls “the gold standard,” DataPelago showed a 3- to 4-fold performance increase, he says.
Goyal says there is no reason why customers can’t use the DataPelago virtualization layer to accelerate scientific computing workloads running in HPC setups, including AI or workload simulation and modeling, says Goyal.
“If you have custom code written for a specific hardware, where you’re optimizing for an A100 GPU that has 80 gigabytes of GPU memory, that many SMs, and that many threads, then you can optimize for that,” he says. “You are now orchestrating your low-level code and kernels to maximize your FLOPS, or operations per second. What we have done is provide an abstraction layer where now that is done underneath and we can hide it, so it provides extensibility and distribution with the same principle.
“At the end of the day, it’s not like there’s magic here. There are only three things: compute, I/O, and the storage part,” he continues. “As long as you design your system with impedance matching of these three things, so that it’s not tied to I/O, it’s not tied to compute, and it’s not tied to storage, then life is good.”
DataPelago already has paying customers using its software, some of which are in the pilot phase and others heading into production, Goyal says. The company plans to formally launch its software with full availability in the first quarter of 2025.
Meanwhile, the Mountain View company came out of stealth today with the announcement that it has $47 million in funding from Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Venture Partners, Nautilus Venture Partners and Silicon Valley Bank, a division of First Citizens . Bank.
Related articles:
Nvidia seeks to accelerate GenAI adoption with NIM
Pandas on GPU runs 150 times faster, says Nvidia
Spark 3.0 to get native GPU acceleration