## Introduction

One of many reasons to love Performance Co-Pilot, is the fact that it is a
fully fledged framework to do performance analysis. It makes it extremely simple to extend
and to build anything on top of it. In this post we shall explore how simple it is to analyze your
performance data using iPython and pandas.

## Setup

To start we will need some PCP archives which contain some collected metrics from a system. In
this post I will use the data I collect on my home firewall and will try to analyze some of
the data there in. To learn how to store performance metrics in an archive, take a look at
pmlogger and the Quickstart guide.
For this example I collected data over the course of a day with a 1 minute interval.

## iPython and PCP

First of all you need to import a small python module that bridges PCP and pandas/numpy:

```
git clone https://github.com/mbaldessari/pcpinteractive.git
cd pcpinteractive
```

Now let us start our iPython console, import our python module and load our archive:

```
$ ipython qtconsole --pylab=inline
In [1]: import pcparchive, pandas
In [2]: p = pcparchive.PcpArchive('testfiles/monday')
In [3]: p.parse()
Parsing archive: [########## 100.0%]
```

At this point the data is fully parsed in memory and we can start analyzing it, using
all the standard tools like pandas and matplotlib.
Let’s start by looking at how many metrics are present in the archive:

```
In [4]: metrics = p.get_metrics()
In [5]: len(metrics)
Out[5]: 253
```

## Pandas and PCP

Now we can get a pandas object out of a metric. Let’s take incoming and outgoing network traffic
expressed in bytes over time.

```
In [6]: net_in = p.get_pandas('network.interface.in.bytes')
In [7]: net_out = p.get_pandas('network.interface.out.bytes')
```

We can now graph the data obtained with a simple:

And we can also explore the data with the use of the `describe()`

method, but first
let’s force the output to be in non-scientific notation as it is more readable for
network data:

```
In [9]: pandas.set_option('display.float_format', lambda x: '%.2f' % x)
In [10]: net_in.describe()
Out[11]:
eth0 eth1 eth2 eth3 lo
count 1439.00 1439.00 1439.00 1439.00 1439.00
mean 23209831415.26 0.00 0.00 257865786302.88 32743773553.20
std 133904051.55 0.00 0.00 1677032669.74 887369.78
min 23074135536.00 0.00 0.00 256117467223.00 32742236187.00
25% 23077865299.00 0.00 0.00 256128768785.50 32743005086.50
50% 23160237636.00 0.00 0.00 257410715707.00 32743771068.00
75% 23319280110.50 0.00 0.00 259252325129.50 32744537961.50
max 23470418350.00 0.00 0.00 260826247120.00 32745311229.00
ppp0 redhat0 tun0
count 1439.00 1439.00 1439.00
mean 253307755506.55 6897358284.79 12675081585.52
std 1648648131.02 163459557.15 8568.61
min 251589569686.00 6726948179.00 12675070001.00
25% 251600158007.50 6727362445.50 12675070001.00
50% 252860366375.00 6859405494.00 12675088375.00
75% 254670529194.50 7051678675.50 12675088375.00
max 256217896469.00 7171608913.00 12675088375.00
```

## Manipulate the data

Now let’s see what is possible to do in terms of data manipulation:
* Drop columns we do not care about:

`In [12]: del net_in['eth1']`

Or, alternatively:

`In [13]: net_in = net_in.drop('eth2', 1)`

- Resample data at lower intervals

`In [14]: net.resample('10min')`

- Filter out all the zero columns

`In [15]: net_clean = net.ix[:,(net != 0).any(axis=0)]`

`In [16]: net_clean.ix[-1]`

- Select a smaller timeframe:

`In [17]: net_clean['2014-12-28 12:00':'2014-12-28 14:00']`

```
In [18]: net_in.eth4
In [19]: net_in['lo']
```

- Apply a function on the whole dataframe:

- Sum all values for each column:

```
In [21]: net_in.sum()
Out[22]:
eth0 33398947406561
eth3 371068866489847
lo 47118290143065
ppp0 364509860173927
redhat0 9925298571818
tun0 18239442401565
dtype: int64
```

- Calculate the mean for each column:

- Find the time of day when the max values are reached

```
In [24]: net_in.idxmax()
In [25]: net_in.idxmax(axis=1)
```

- Select only the tun0 and eth0 devices:

`In [26]: test = net_in.ix[:, ['tun0', 'eth0']]`

## Merge and group dataframes

Now let’s merge the `net_in`

and the `net_out`

dataframes into a single one, in order
to try and do some analysis on both traffic directions at the same time.

`In [27]: net = pandas.merge(net_in, net_out, left_index=True, right_index=True, how='left', suffixes=('_in', '_out'))`

Another very interesting aspect is the plethora of statistical functions that come for free through
the use of pandas. For example, to find covariance() and correlation() we can use the following methods:

```
In [28]: net.cov()
In [29]: net.corr()
In [30]: net.corrwith(net.eth0_in)
Out [31]:
eth0_in 1.000
eth3_in 0.993
lo_in 0.958
ppp0_in 0.993
redhat0_in 0.991
tun0_in 0.748
eth0_out 0.994
eth3_out 0.986
lo_out 0.958
ppp0_out 0.983
redhat0_out 0.978
tun0_out 0.748
```

We can also group columns like the following:

```
In [32]: mapping = { 'eth0_in': 'in', 'eth3_in': 'in', 'ppp0_in': 'in', 'tun0_in': 'in',
'redhat0_in': 'in', 'lo_in': 'in', 'eth0_out': 'out', 'eth3_out': 'out',
'ppp0_out': 'out', 'tun0_out': 'out', 'redhat0_out': 'out', 'lo_out': 'out'}
In [33]: test = net.groupby(mapping, axis=1)
In [34]: test.sum()
```

Calculate the rolling mean of an interface and plot it:

`In [35]: pandas.rolling_mean(net.eth0_in, 250).plot()`

## Export data

Save the data in csv file or in excel format:

```
In [36]: net.to_csv('testfiles/out.csv')
In [37]: net.to_excel('testfiles/out.xls')
```

Other outputs like latex, sql, clipboard, hd5f and more are supported.

## Conclusions

The versatility of PCP allows anyone to use many currently available frameworks (numpy, pandas, R, scipy) to analyze
and display the collected performance data. There is some work to be done to make this process a bit simpler with an
out of the box PCP installation.