Python Data Science Framework Released

Written by Kay Ewbank

Monday, 12 July 2021

A data science framework for Python has been launched by researchers from Brown university. Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code.

The team says Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Tuplex is short for tuples and exceptions.

tuplex

Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++. Because Tuplex compiles data science pipelines with inline Python to native code, it runs them 5–91x faster than systems that call into a Python interpreter. Such data science pipelines are usually slow because they rely on user-defined functions (UDFs) written in Python.

The framework has been designed to be easy to use. Tuplex works interactively in the Python toplevel, integrates with Jupyter Notebooks, and provides familiar APIs. Developers write Tuplex pipelines using a LINQ-style API similar to PySpark’s and use Python UDFs without type annotations. The team says its jobs never crash on malformed inputs because Tuplex's dual-mode execution model separates the common-case inputs from exception-case inputs (e.g., malformed data, wrong types) and reports them separately.

Tuplex's dual-mode execution model compiles an optimized fast path for data that falls into the 'normal' category, and falls back on slower exception code paths for data that fail to match the fast path's assumptions. By concentrating on common case data, Tuplex keeps the code simple enough to apply aggressive optimizations.

Making dual-mode processing work in this fashion means Tuplex has to firstly establish what the "common case" is. Tuplex’s key idea is to sample the input, derive the common case from this sample, and infer types and expected cases across the pipeline. In addition, Tuplex’s generated native code has to match a semantically-correct Python execution in the interpreter. To guarantee this, Tuplex separates the input data into two row classes: those for which the native code’s behavior is identical to Python’s, and those for which it isn’t and which must be processed in the interpreter.

Tuplex consists of a Python package, Docker image, and instructions to build from source. It is available now on Pypi and Docker.

tuplex

More Information

Tuplex Python package on Pypi

Tuplex Docker image

Spark 3 Improves Python and SQL Support

New Database For Data Scientists

Amazon Open Sources Python Library for AWS Glue

Pandas Reaches 1.0

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Windows 11 Overtakes Windows 10 - But Not In Europe
08/07/2025

With the end of support of Windows 10 just three months away, Windows 11 has finally edged ahead of Windows 10 in terms of Desktop Windows Version Market Share on a Worldwide Basis. In Europe, h [ ... ]

+ Full Story

TIOBE - Two To Rule Them All
16/07/2025

The July Tiobe index is out and it isn't particularly interesting until you notice that it confirms the standard model of programming - code is written in Python and C and everything else is jus [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments