Harnessing the Power of the Data Explosion | Computer Science Department at Princeton University

May 31, 2018

By Doug Hulette
Photos by Frank Wojciechowski

“There’s a massive technology revolution coming down the pipeline,” professor Michael Freedman declares.

Professor Michael Freedman

Indeed, as Big Data swells into Enormous Data and the Internet of Things becomes the Internet of Everything, the torrent of information—much of it from sensors that monitor conditions in terms of fractions of a second—threatens to swamp traditional computer storage and processing systems.

But where others may worry, Freedman sees opportunity. “The question is how do we build a system, how do we do analytics, and how do we do data collection to enable this kind of revolution?” he says. “If you think about computers today, what is interesting about them is the services and applications that they enable. It’s not the computer itself, but what you can do with the computer. Thirty years ago, this was word processing. Today we think of it as everywhere around us. And the services we want from them are those that are available anywhere in the world and can serve many people 24/7—always on.

“Ultimately, that means we need to build these services as distributed systems, a lot of individual computers that work together to serve these types of applications and services,” explains Freedman, who joined the Computer Science Department in 2007. “The capacity of one individual server is limited. Back in the 70s and 80s, people were thinking of mainframes: How do we make computers that are very large? But there’s a limit on the size you can make one computer. In the 90s, we started to ask, ‘Instead of making the hardware bigger, how do we build software that allows standard, workstation-like computers to work together—coordinate—so we can build these very large applications.’”

From left to right: Matvey Arye *16, Ajay Kulkarni, Rob Kiefer *16, Mike Freedman

Freedman came to Princeton as an assistant professor and became a full professor in 2015. He earned his doctoral degree in computer science at New York University, during which he spent several years at Stanford to follow his advisor. He has also co-founded several companies, including Illuminics Systems and most recently Timescale. He took time from a busy schedule to talk about the things that drive his work.

What about computer science gets you pumped up these days?

Broadly speaking, what I find exciting is that we’re effectively entering a new wave of computing that many call operational technology—OT. For the last 50 years, people talked about the IT revolution. This was about how do we change the back office from paper to digital. What’s happening now involves not only things like smart thermostats and home alarm systems but industrial applications. Manufacturing lines are changing, supply chains are changing, self-driving cars are coming. All are examples of these connected devices that have computers on them that transmit and make sense of data. So, much like computers changed how the back office worked, all of these connected devices are going to change how buildings are run, how we do farming, how we do shipping—it’s going to pervade the rest of our lives.

Ajay Kulkarni and Mike Freedman, founders of Timescale

You and a colleague, Ajay Kulkarni, founded a company a few years ago on the premise that “humanity is now living with machines and swimming in machine data.” The company, Timescale, has raised $16 million “to help developers, businesses, and society make sense of it all.” How did this come about?

We started the company by trying to build a platform for data analysis. But we came to realize that everybody had different needs regarding how they had to analyze their data, where they needed it processed, what kind of processing they did. But they all required a store for their data, yet none of the databases that were readily available actually satisfied the need. So we realized that we could better solve the machine data problem by developing a time-series database rather than a data-analysis platform. We realized that everybody had slightly different needs, but what was common to all of them was a time-series database.

Time-series applications measure how and why things change by analyzing serial data taken by sensors at tiny intervals, sometimes many times a second. Why is that kind of data increasingly important in the digital world?

Essentially, you use time-series data to understand the past, evaluate the present, and predict the future. You want to be able to go back in time and figure out why your computers started overheating or your machinery started breaking. You might use that information to better understand the impact of external factors, for instance, our use of the air conditioning, based on the outside temperature at any specific time, the number of people in the room, the ambient humidity and other things. You need data to make good decisions.

Your company calls its main product, TimescaleDB, “the first open-source time-series database to combine the power, reliability, and ease-of-use of a relational database with the scalability typically seen in NoSQL databases.” What does that mean, in layspeak?

It’s a bit complicated to explain how this works. But picture an Excel spreadsheet with 500 billion rows of data. If we’re mostly only modifying data from the last day, we’re able to keep all that data directly in memory, and that makes things really fast. But if we’re trying to do random updates of any of our half a trillion rows, we’re jumping between all these different portions of the spreadsheet and therefore we can’t keep it all in memory – your laptop doesn’t have terabytes of memory. Every time we do an update, we have to read and write from disk, which could be a thousand times slower than writing to memory.

TimescaleDB allows users to see this as all one gigantic spreadsheet. But under the covers, it automatically creates all these little files corresponding to different time intervals, creates new ones whenever needed, and sometimes adapts the time range: should it be a day or should it be an hour? The answer depends on your data volume, and when your volume changes you might want to dynamically adjust that. All that happens transparently to users in TimescaleDB, which come from many industries, from web and mobile companies to space agencies to more traditional manufacturers, utilities, drilling and mining companies, financial firms, telecoms, and others.

It’s a fascinating time to be building computer systems!

Mike Freedman, Ajay Kulkarni, Tim Geisenheimer, Melanie Savoia, Matvey Arye, Andrew Staller, Rob Kiefer, Solar Olugebefola

The Timescale team: standing, left to right: Mike Freedman, Ajay Kulkarni, Tim Geisenheimer, Melanie Savoia, Matvey Arye *16, Andrew Staller sitting down, left to right: Rob Kiefer *16, Solar Olugebefola
*Missing Timescale team members: Erik Norström, Princeton University Postdoc/Research Scientist from 2010-2013 (Stockholm), Shane Ermitano (Los Angeles), David Kohn (New York), Lee Hampton (New York), Diana Hsieh (New York)