The Internet of Things (IoT) revolution is here, providing new streams of data that enable finer visibility and control over the world around us. One of the core offerings of Aquicore, load analytics, uses IoT hardware called **networking devices** with recorders called **meters** that monitor how much electricity, water, or other resources are being used in a building, allowing us to pinpoint inefficiencies and generate recommendations for how to cut waste and save money. These meters are pretty robust, but sometimes anomalies happen. A thunderstorm causes a power surge, resetting a counter and making it seem like an impossibly high amount of electricity was demanded (kiloWatts; kW) over a few seconds. A networking device battery dies and a meter stops sending data. To really benefit from the full power of IoT, we need to ensure the data is accurate.

Fortunately, data quality issues usually stick out like a sore thumb. You don’t need to be an engineer to understand that if a meter usually reports values around 10 kW and it’s suddenly reporting a value of 10,000 kW, something’s wrong. It’s easy to catch a problem if you’re watching the data from a handful of meters, but what if you’re in charge of data quality on hundreds, or thousands of meters? You need some way to automatically catch and report on these issues.

This is where data science comes in. At Aquicore, one of the core focuses of the data science team is making data quality assurance automatic and scalable, so no matter how many customers we have, you can expect the same top-tier level of service. To ensure data quality on thousands of meters, we write software that follows a set of rules built to catch problems. The challenge is creating rules that work equally well for utility meters monitoring electricity loads for sports stadiums, as well as submeters reporting on light bulbs in those stadiums’ closets.

**So how did we do it?**

The goal is to find a way to quantify what “normal data” looks like for a device, then flag patterns that don’t look like what we expect to see. One technique we use involves a simple statistical trick. Let’s say we’re trying to catch positive spikes (values much larger than usual, e.g. that 10,000 kW value from a meter normally reporting 10 kW). Below is a snapshot of normal 15-minute kW data for a utility meter and a submeter:

We see that we can’t set some hard threshold for “normal” behavior that works well for both meters. A threshold of 1,000 kW will catch errors in the utility meter but will miss most errors in the submeter. A threshold of 250 kW will catch spikes in the submeter, but it’ll trigger constantly for the utility meter.

You can clearly see this when you look at the meters’ **distributions** (right): there’s no vertical line you can draw anywhere that will work well for catching outliers for both meters.

The **means **(averages) of the two distributions are really different: the utility meter oscillates around 260 kW, while the submeter hangs out around 48 kW. The **standard deviations **(variation) are also wildly different between the meters: most of the utility meter’s values easily wander 120 kW in either direction around the mean, while most of the submeter readings tend to hug the mean more tightly, moving only about 23 kW.

So what can we do? The trick is to **scale **the data. Scaling means subtracting the mean and dividing by the standard deviation.

Subtracting the mean centers our data on zero, turning lower-than-average values negative and higher-than-average values positive. Dividing by the standard deviation creates a common “step size” when we’re trying to identify values that are too small or too large. Is a value of 50 kW too large? Well, it depends on the meter. Is a value of 20 standard deviations too large? Yes, definitely.

When you scale a data set, every data point becomes rephrased as “the number of standard deviations from the mean.” This means that instead of talking about raw kW, whose values are different for each meter, we can talk in a standardized unit of distance: the standard deviation. Despite our utility meter and submeter having distributions that barely overlap (above), when we scale the data, the distributions fall neatly on top of each other (below). This makes it easy to identify outliers, as we now have a common scale. With scaled data, instead of a hard threshold like 1,000 kW, we can specify a *number of standard deviations above the mean *as our detector for whether we have an anomaly.

Scaling also reveals more interesting patterns in our data. When we plot the scaled time series, we see that the submeter actually has much more variable behavior than the utility meter: the peaks regularly jump to three or four standard deviations above the mean, while the utility meter consistently hovers under two standard deviations. We can fine-tune our data quality assurance even further by taking details like these into account when setting our rules for catching issues.

It’s not enough to just identify a data quality issue, though – we need to address it! At Aquicore, our priority is to have proactive data quality, the kind of seamless service you don’t have to think about because it’s being taken care of in the background. Any potential anomalies we find are addressed automatically if they’re known simple issues, like a meter that lost internet connectivity for a few hours. Uncommon anomalies are compiled and reviewed daily by our engineers, who reach out to buildings’ on-site engineering teams if something looks off. Combining the best in classical statistics, machine learning, and industry knowledge, we strive to ensure our customers’ experience meets the high expectations we set on ourselves as a company.

*This post was written by Matt Sosna, Data Scientist.*