Exploring the ExaMon database¶
This introductory notebook is a tutorial that will help you take your first steps with the examon-client
, a tool that allows you to interact with the ExaMon database.
The tutorial will show you the basics of how to obtain information about the data stored in the database (metadata) and how to make real queries and obtain a dataframe as a result.
The tutorial will use the ExaMon instance running at CINECA as an example and also the notebook execution can take place directly on Google Colab. In this case it is recommended to mount your Drive account. Alternatively, you can download the notebook (top right button) and run it locally.
If you are interested in working on the CINECA ExaMon instance, to obtain the ExaMon credentials please contact:
Please note: by using your ExaMon account you are able to access data owned by CINECA and are therefore subject to the same privacy regulations that every CINECA user is required to follow.
%matplotlib inline
# Mount Drive and install the examon-client
#
# Mounting Drive is an optional step but heavily suggested to have optimal
# performance in Google Colab
# (optional)
from google.colab import drive
drive.mount('/content/drive')
# Create and change to the Examon workspace folder (optional)
! mkdir -p /content/drive/MyDrive/examon_workdir
%cd /content/drive/MyDrive/examon_workdir
# Install (required)
! pip install https://github.com/fbeneventi/releases/releases/latest/download/examon-client.zip
Mounted at /content/drive /content/drive/MyDrive/examon_workdir Collecting https://github.com/fbeneventi/releases/releases/latest/download/examon-client.zip Downloading https://github.com/fbeneventi/releases/releases/latest/download/examon-client.zip (353 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 353.9/353.9 kB 5.1 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: pytz in /usr/local/lib/python3.10/dist-packages (from examon-client==0.4.0b1) (2023.3.post1) Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from examon-client==0.4.0b1) (5.3.1) Requirement already satisfied: pandas>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from examon-client==0.4.0b1) (1.5.3) Requirement already satisfied: dask[complete] in /usr/local/lib/python3.10/dist-packages (from examon-client==0.4.0b1) (2023.8.1) Collecting diskcache>=5.2.1 (from examon-client==0.4.0b1) Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 1.4 MB/s eta 0:00:00 Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.20.0->examon-client==0.4.0b1) (2.8.2) Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.20.0->examon-client==0.4.0b1) (1.23.5) Requirement already satisfied: click>=8.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (8.1.7) Requirement already satisfied: cloudpickle>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (2.2.1) Requirement already satisfied: fsspec>=2021.09.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (2023.6.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (23.1) Requirement already satisfied: partd>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (1.4.0) Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (6.0.1) Requirement already satisfied: toolz>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (0.12.0) Requirement already satisfied: importlib-metadata>=4.13.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (6.8.0) Requirement already satisfied: pyarrow>=7.0 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (9.0.0) Collecting lz4>=4.3.2 (from dask[complete]->examon-client==0.4.0b1) Downloading lz4-4.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 24.8 MB/s eta 0:00:00 Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=4.13.0->dask[complete]->examon-client==0.4.0b1) (3.16.2) Requirement already satisfied: locket in /usr/local/lib/python3.10/dist-packages (from partd>=1.2.0->dask[complete]->examon-client==0.4.0b1) (1.0.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=0.20.0->examon-client==0.4.0b1) (1.16.0) Requirement already satisfied: bokeh>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (3.2.2) Requirement already satisfied: jinja2>=2.10.3 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (3.1.2) Requirement already satisfied: distributed==2023.8.1 in /usr/local/lib/python3.10/dist-packages (from dask[complete]->examon-client==0.4.0b1) (2023.8.1) Requirement already satisfied: msgpack>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (1.0.5) Requirement already satisfied: psutil>=5.7.2 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (5.9.5) Requirement already satisfied: sortedcontainers>=2.0.5 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (2.4.0) Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (2.0.0) Requirement already satisfied: tornado>=6.0.4 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (6.3.2) Requirement already satisfied: urllib3>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (2.0.4) Requirement already satisfied: zict>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from distributed==2023.8.1->dask[complete]->examon-client==0.4.0b1) (3.0.0) Requirement already satisfied: contourpy>=1 in /usr/local/lib/python3.10/dist-packages (from bokeh>=2.4.2->dask[complete]->examon-client==0.4.0b1) (1.1.0) Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from bokeh>=2.4.2->dask[complete]->examon-client==0.4.0b1) (9.4.0) Requirement already satisfied: xyzservices>=2021.09.1 in /usr/local/lib/python3.10/dist-packages (from bokeh>=2.4.2->dask[complete]->examon-client==0.4.0b1) (2023.7.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2>=2.10.3->dask[complete]->examon-client==0.4.0b1) (2.1.3) Building wheels for collected packages: examon-client Building wheel for examon-client (setup.py) ... done Created wheel for examon-client: filename=examon_client-0.4.0b1-py3-none-any.whl size=18778 sha256=343c1a6258da8c47ece95cb0756ca482f00ed6528ac31662e7d26e38f29ead11 Stored in directory: /root/.cache/pip/wheels/80/40/5e/6415332cf365491ebec39e418f3ddfaf3cea94b01e5bc54f79 Successfully built examon-client Installing collected packages: lz4, diskcache, examon-client Successfully installed diskcache-5.6.3 examon-client-0.4.0b1 lz4-4.3.2
Examon setup¶
# Init steps
import os
import getpass
import numpy as np
import pandas as pd
from examon.examon import Client, ExamonQL
# Connect
USER = input('username:')
print('password:')
PWD = getpass.getpass()
ex = Client('examon.cineca.it', port='3002', user=USER, password=PWD, verbose=False, proxy=True)
print('Creating the local metadata cache (one-time task). Please wait ...')
sq = ExamonQL(ex)
username:admin password: ·········· Creating the local metadata cache (one-time task). Please wait ...
Metric list¶
To start with Examon, it is recommended that you first get a list of the sensors contained in the database. The initial object (ExamonQL) instantiation will do a full db scan checking for all the metrics tags. This will happen only the first time since the client uses caches where possible to save the database bandwith.
display(pd.DataFrame(sq.metric_list))
name | |
---|---|
0 | 0_0 |
1 | 12V |
2 | 1U_Stg_HDD0_Pres |
3 | 1U_Stg_HDD1_Pres |
4 | 1U_Stg_HDD2_Pres |
... | ... |
2318 | vm_pgpgin |
2319 | vm_pgpgout |
2320 | vm_vmeff |
2321 | wind_deg |
2322 | wind_speed |
2323 rows × 1 columns
Tag Keys¶
Each metric in the database comes with a set of tags (key;value) useful for filtering during queries. It is possible to obtain from the database all the possible tags (keys) associated to a specific metric.
df = sq.DESCRIBE() \
.execute()
display(df)
name | tag keys | |
---|---|---|
0 | 0_0 | [chnl, cluster, node, org, plugin, rack, slot,... |
1 | 1U_Stg_HDD1_Pres | [chnl, cluster, health, node, org, plugin, type] |
2 | 12V | [chnl, cluster, health, node, org, plugin, typ... |
3 | 1U_Stg_HDD0_Pres | [chnl, cluster, health, node, org, plugin, type] |
4 | 1U_Stg_HDD3_Pres | [chnl, cluster, health, node, org, plugin, type] |
... | ... | ... |
2317 | vm_pgmajfault | [chnl, cluster, gcluster, group, node, org, pl... |
2318 | state | [chnl, cluster, description, host_group, nagio... |
2319 | swap_total | [chnl, cluster, gcluster, group, node, org, pl... |
2320 | swap_free | [chnl, cluster, gcluster, group, node, org, pl... |
2321 | PS1_Temperature | [chnl, cluster, health, node, org, part, plugi... |
2322 rows × 2 columns
The database contains this number of valid metric names:
df.shape[0]
2322
To get an entry from the table:
df[df.name == 'Ambient_Temp']['tag keys'].values[0]
['chnl', 'cluster', 'health', 'node', 'org', 'part', 'plugin', 'type', 'units']
Tag values¶
It is possible to obtain all the possible values of all the tag keys of a given metric:
df = sq.DESCRIBE(metric='CPU_Utilization') \
.execute()
display(df)
name | tag key | tag values | |
---|---|---|---|
0 | CPU_Utilization | chnl | [data] |
1 | CPU_Utilization | cluster | [galileo, marconi] |
2 | CPU_Utilization | health | [ok] |
3 | CPU_Utilization | node | [node001, node002, node003, node004, node005, ... |
4 | CPU_Utilization | org | [cineca] |
5 | CPU_Utilization | part | [knl, skylake] |
6 | CPU_Utilization | plugin | [confluent_pub, ipmi_pub] |
7 | CPU_Utilization | type | [Other] |
All the possible values of a given tag key¶
In this example we will search all the plugin names currently available in the Examon database.
df = sq.DESCRIBE(tag_key = 'plugin') \
.execute()
display(df)
tag values | |
---|---|
0 | ipmi_pub |
1 | confluent_pub |
2 | vertiv_pub |
3 | schneider_pub |
4 | pmu_pub |
5 | logics_pub |
6 | predictive_maintenance_pub |
7 | ganglia_pub |
8 | slurm_pub |
9 | nvidia_pub |
10 | weather_pub |
11 | dstat_pub |
12 | examon-ai_pub |
13 | nagios_pub |
Metrics having a given tag value¶
Assume that we need to know the list of the metrics having a given tag (key, value). In this example, we get the list of all metrics inserted into the db by the 'confluent_pub' examon plugin.
df = sq.DESCRIBE(tag_key = 'plugin', tag_value='confluent_pub') \
.execute()
display(df)
name | |
---|---|
0 | 12V |
1 | 1U_Stg_HDD0_Pres |
2 | 1U_Stg_HDD1_Pres |
3 | 1U_Stg_HDD2_Pres |
4 | 1U_Stg_HDD3_Pres |
... | ... |
380 | Vcpu2 |
381 | Voltage_Fault |
382 | XCC_Corrupted |
383 | XCC_SWitchover |
384 | XCC_Switchover |
385 rows × 1 columns
Metrics valid only for Marconi skaylake nodes¶
Some metrics are valid (exist) only for a subset of the monitored resources. In this example we will search for the metrics collected by the 'confluent_pub' plugin and for the 'marconi' cluster and for only the 'skylake' partition. The 'JOIN' command let you 'intersect' ('inner' join) the results of each DESCRIBE command.
df = sq.DESCRIBE(tag_key = 'plugin', tag_value='confluent_pub') \
.DESCRIBE(tag_key = 'cluster', tag_value='marconi') \
.DESCRIBE(tag_key = 'part', tag_value='skylake') \
.JOIN(how='inner') \
.execute()
display(df)
name | |
---|---|
0 | All_CPUs |
1 | All_DIMMs |
2 | All_PCI_Error |
3 | Ambient_Temp |
4 | Aux_Log |
... | ... |
150 | TPM_TCM_Lock |
151 | TXT_ACM_Module |
152 | XCC_Corrupted |
153 | XCC_SWitchover |
154 | XCC_Switchover |
155 rows × 1 columns
Metrics collected by the 'nagios_pub' plugin¶
df = sq.DESCRIBE(tag_key = 'plugin', tag_value='nagios_pub') \
.execute()
display(df)
name | |
---|---|
0 | hostscheduleddowtimecomments |
1 | plugin_output |
2 | state |
Check the tags available for the 'plugin_output' metric
df = sq.DESCRIBE(metric='plugin_output') \
.execute()
display(df)
name | tag key | tag values | |
---|---|---|---|
0 | plugin_output | chnl | [data] |
1 | plugin_output | cluster | [galileo, marconi, marconi100] |
2 | plugin_output | description | [EFGW_cluster::status::availability, EFGW_clus... |
3 | plugin_output | host_group | [compute, compute,cincompute, containers, cumu... |
4 | plugin_output | nagiosdrained | [0, 1] |
5 | plugin_output | node | [aggregation-mgt, comlab01, deepops, dgx01, dg... |
6 | plugin_output | org | [cineca] |
7 | plugin_output | plugin | [nagios_pub] |
8 | plugin_output | rack | [201, 202, 205, 206, 207, 208, 209, 210, 211, ... |
9 | plugin_output | slot | [01, 02, 03, 04, 05, 06, 07, 08, 09, 1, 10, 11... |
10 | plugin_output | state | [0, 1, 2, 3] |
11 | plugin_output | state_type | [0, 1] |
The 'description' tag may have some hints about the services monitored by this plugin. Lets check it:
df[df['tag key'] == 'description']['tag values'].values[0]
['EFGW_cluster::status::availability', 'EFGW_cluster::status::criticality', 'EFGW_cluster::status::internal', 'GALILEO_cluster::status::availability', 'GALILEO_cluster::status::criticality', 'GALILEO_cluster::status::internal', 'afs::blocked_conn::status', 'afs::bosserver::status', 'afs::ptserver::status', 'afs::space::status', 'afs::vlserver::status', 'alive::ping', 'backup::afs::status', 'backup::eufus_gw::status', 'backup::local::status', 'backup::masters::status', 'backup::shared::status', 'batchs::JobsH', 'batchs::client', 'batchs::client::serverrespond', 'batchs::client::state', 'batchs::manager', 'batchs::manager::state', 'bmc::events', 'cluster::status::availability', 'cluster::status::criticality', 'cluster::status::internal', 'cluster::status::wattage', 'cluster::us::availability', 'cluster::us::criticality', 'container::check::health', 'container::check::internal', 'container::check::mounts', 'core::total', 'crm::resources::m100', 'crm::status::m100', 'dev::ipmi::events', 'dev::raid::status', 'dev::swc::bntfru', 'dev::swc::bnthealth', 'dev::swc::bnttemp', 'dev::swc::confcheck', 'dev::swc::confcheckself', 'dev::swc::cumulushealth', 'dev::swc::cumulussensors', 'dev::swc::isl', 'dev::swc::isleth', 'dev::swc::mlxhealth', 'dev::swc::mlxsensors', 'file::integrity', 'filesys::dres::mount', 'filesys::eurofusion::mount', 'filesys::local::avail', 'filesys::local::mount', 'filesys::shared::mount', 'firewalld::status', 'galera::status::Integrity', 'galera::status::NodeStatus', 'galera::status::ReplicaStatus', 'globus::gridftp', 'globus::gsissh', 'gss::rg::encl', 'gss::rg::pdisks', 'gss::rg::peer', 'gss::rg::vdisks', 'memory::phys::total', 'monitoring::health', 'net::ib::status', 'net::opa', 'net::opa::edge_director_links_status', 'net::opa::edge_link_err_rate', 'net::opa::edge_link_quality', 'net::opa::edge_status', 'net::opa::pciwidth', 'nfs::rpc::status', 'nvidia::configuration', 'nvidia::memory::replace', 'nvidia::memory::retirement', 'service::MedeA', 'service::cert', 'service::galera', 'service::galera::mysql', 'service::galera:arbiter', 'service::galera:mysql', 'service::ganglia', 'service::nxserver', 'service::nxserver::sessions', 'service::unicore::tsi', 'service::unicore::uftpd', 'ssh::daemon', 'sys::arcldap::status', 'sys::corosync::rings', 'sys::cpus::freq', 'sys::glusterfs::dgx', 'sys::glusterfs::examon', 'sys::glusterfs::home', 'sys::glusterfs::install', 'sys::glusterfs::install8', 'sys::glusterfs::scratch', 'sys::glusterfs::secinv', 'sys::glusterfs::slurm', 'sys::glusterfs::slurmstate', 'sys::glusterfs::status', 'sys::gpfs::status', 'sys::ldap_srv::status', 'sys::orphaned_cgroups::count', 'sys::pacemaker::crm', 'sys::rvitals', 'sys::sssd::events', 'sys::xcatpod::sync', 'unicore::tsi', 'unicore::uftpd', 'vm::virsh::state']
Lets see if there are services in a 'critical' state (2) and which node affect:
data = sq.SELECT('node','cluster','description','state') \
.FROM('plugin_output') \
.WHERE(plugin='nagios_pub', state='2') \
.TSTART(30, 'minutes') \
.execute()
display(data.df_table.head(10))
data.df_table.shape
(8548, 7)
data = sq.SELECT('cluster','part','node') \
.FROM('Sys_Power') \
.WHERE(cluster='marconi', part='skylake') \
.TSTART(30, 'minutes') \
.AGGRBY('avg', sampling_value=1, sampling_unit='minutes') \
.execute()
display(data.df_table.head())
data.df_table.shape
(46503, 6)
Check the number of nodes ('node' tag):
display(data.df_table.nunique())
timestamp 88 value 37 name 1 cluster 1 part 1 node 3121 dtype: int64
Time Series Format¶
Reshape the 'df_table' to a time series table: first column (index) = timestamp, remaining columns = nodes power vectors.
data.to_series(flat_index=True, interp='time', dropna=True, columns=['node'])
display(data.df_ts.head())
node | r129c01s01 | r129c01s02 | r129c01s03 | r129c01s04 | r129c02s01 | r129c02s02 | r129c02s03 | r129c02s04 | r129c03s01 | r129c03s02 | ... | r183c14s03 | r183c14s04 | r183c15s01 | r183c15s02 | r183c15s03 | r183c15s04 | r183c16s01 | r183c16s02 | r183c16s03 | r183c16s04 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||||||||||
2023-09-22 17:49:00.061000+02:00 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 134.996730 | 282.488555 | 262.508175 | ... | 270.001292 | 259.998708 | 319.988375 | 309.997417 | 299.997417 | 329.989667 | 290.0 | 249.998708 | 260.005167 | 329.997417 |
2023-09-22 17:50:00.027000+02:00 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 130.000625 | 265.002187 | 274.998438 | ... | 272.499865 | 257.500135 | 297.501219 | 305.000271 | 295.000271 | 310.001083 | 290.0 | 247.500135 | 269.999458 | 325.000271 |
2023-09-22 17:50:00.031000+02:00 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 130.000292 | 265.001021 | 274.999271 | ... | 272.500031 | 257.499969 | 297.499719 | 304.999938 | 294.999938 | 309.999750 | 290.0 | 247.499969 | 270.000125 | 324.999938 |
2023-09-22 17:51:00.027000+02:00 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 125.001687 | 247.505905 | 287.495782 | ... | 274.999854 | 255.000146 | 275.001313 | 300.000292 | 290.000292 | 290.001167 | 290.0 | 245.000146 | 279.999417 | 320.000292 |
2023-09-22 17:51:00.031000+02:00 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 125.001354 | 247.504739 | 287.496615 | ... | 275.000021 | 254.999979 | 274.999813 | 299.999958 | 289.999958 | 289.999833 | 290.0 | 244.999979 | 280.000083 | 319.999958 |
5 rows × 3121 columns
Skylake partition total power consuption¶
Total average power in the previous 30 minutes
data.df_ts.mean().sum()
859025.39826763
2) Looking for failures¶
First look for metrics with critical status. Use the intersection: search metrics having...
df = sq.DESCRIBE(tag_key = 'plugin', tag_value='confluent_pub') \
.DESCRIBE(tag_key = 'health', tag_value='critical') \
.DESCRIBE(tag_key = 'part', tag_value='skylake') \
.JOIN() \
.execute()
display(df)
name | |
---|---|
0 | All_DIMMs |
1 | All_PCI_Error |
2 | Ambient_Temp |
3 | CMOS_Battery |
4 | CPU_1_DTS |
... | ... |
61 | PSU2_Failure |
62 | PSU2_IN_Failure |
63 | Power_Supply_1 |
64 | Power_Supply_2 |
65 | SysBrd_Vol_Fault |
66 rows × 1 columns
For example, lets check for CPU_1_Overtemp metric over the last year to find the affected nodes and the time period
# show the tags to filter
df = sq.DESCRIBE(metric='CPU_1_Overtemp') \
.execute()
display(df)
name | tag key | tag values | |
---|---|---|---|
0 | CPU_1_Overtemp | chnl | [data] |
1 | CPU_1_Overtemp | cluster | [galileo, marconi] |
2 | CPU_1_Overtemp | health | [critical, failed, ok, warning] |
3 | CPU_1_Overtemp | node | [r054c02s01, r054c02s02, r054c02s03, r054c02s0... |
4 | CPU_1_Overtemp | org | [cineca] |
5 | CPU_1_Overtemp | part | [knl, skylake] |
6 | CPU_1_Overtemp | plugin | [confluent_pub] |
7 | CPU_1_Overtemp | type | [Temperature] |
# query
data = sq.SELECT('*') \
.FROM('CPU_1_Overtemp') \
.WHERE(part='skylake', health='critical') \
.TSTART(1,'years') \
.execute()
display(data.df_table.head())
timestamp | value | name | chnl | cluster | health | node | org | part | plugin | type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-12-13 14:05:00.032000+01:00 | critical | CPU_1_Overtemp | data | marconi | critical | r135c11s01 | cineca | skylake | confluent_pub | Temperature |
1 | 2023-09-14 11:53:00.038000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | r137c11s03 | cineca | skylake | confluent_pub | Temperature |
2 | 2023-05-26 11:47:00.151000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | r138c02s02 | cineca | skylake | confluent_pub | Temperature |
3 | 2023-05-26 23:47:00.034000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | r138c02s02 | cineca | skylake | confluent_pub | Temperature |
4 | 2023-05-30 13:16:00.034000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | r138c02s02 | cineca | skylake | confluent_pub | Temperature |
Show the first value of each node (when the anomaly appeared for the first time)
display(data \
.df_table \
.groupby('node') \
.first() \
.sort_values(by=['timestamp'],ascending=False) \
.head())
timestamp | value | name | chnl | cluster | health | org | part | plugin | type | |
---|---|---|---|---|---|---|---|---|---|---|
node | ||||||||||
r137c11s03 | 2023-09-14 11:53:00.038000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r143c04s03 | 2023-06-22 15:27:00.217000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r143c11s01 | 2023-06-13 19:40:00.031000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r138c13s02 | 2023-06-01 15:51:00.037000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r138c02s02 | 2023-05-26 11:47:00.151000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
Show the last value of each node (when the anomaly was removed/solved)
display(data \
.df_table \
.groupby('node') \
.last() \
.sort_values(by=['timestamp'],ascending=False) \
.head())
timestamp | value | name | chnl | cluster | health | org | part | plugin | type | |
---|---|---|---|---|---|---|---|---|---|---|
node | ||||||||||
r138c13s02 | 2023-09-17 22:29:00.175000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r137c11s03 | 2023-09-14 11:53:00.038000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r143c02s04 | 2023-09-02 08:09:00.031000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r143c04s03 | 2023-06-22 15:47:00.162000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
r143c11s01 | 2023-06-14 06:02:00.027000+02:00 | critical | CPU_1_Overtemp | data | marconi | critical | cineca | skylake | confluent_pub | Temperature |
For example, node 'r145c10s04' showed a crtical status for the CPU1 temperature starting from 2019-09-23 16:54 to 2019-09-26 04:27. Lets check it plotting that range plus 1 hour before and after:
data = sq.SELECT('*') \
.FROM('CPU_1_Temp') \
.WHERE(node='r145c10s04') \
.TSTART('23-09-2019 15:54:00') \
.TSTOP('26-09-2019 05:27:00') \
.execute()
data.to_series(flat_index=True, interp='time', dropna=True, columns=['node']).df_ts.plot(figsize=[15,12])
<Axes: xlabel='timestamp'>
Where we can see values greater than 90 °C for the CPU1
Job scheduler data¶
Currently the job scheduler data is collected as per-job data in plain Cassandra tables. The available tables in the database are
- job_info_galileo: Galileo jobs data
- job_info_marconi: Marconi jobs data
This is a description of the data currently stored (where available) for each executed job:
Table fields | Description |
---|---|
account | charge to specified account |
accrue_time | time job is eligible for running |
admin_comment | administrator's arbitrary comment |
alloc_node | local node and system id making the resource allocation |
alloc_sid | local sid making resource alloc |
array_job_id | job_id of a job array or 0 if N/A |
array_max_tasks | Maximum number of running tasks |
array_task_id | task_id of a job array |
array_task_str | string expression of task IDs in this record |
assoc_id | association id for job |
batch_features | features required for batch script's node |
batch_flag | 1 if batch: queued job with script |
batch_host | name of host running batch script |
billable_tres | billable TRES cache. updated upon resize |
bitflags | Various job flags |
boards_per_node | boards per node required by job |
burst_buffer | burst buffer specifications |
burst_buffer_state | burst buffer state info |
command | command to be executed, built from submitted job's argv and NULL for salloc command |
comment | arbitrary comment |
contiguous | 1 if job requires contiguous nodes |
core_spec | specialized core count |
cores_per_socket | cores per socket required by job |
cpu_freq_gov | cpu frequency governor |
cpu_freq_max | Maximum cpu frequency |
cpu_freq_min | Minimum cpu frequency |
cpus_alloc_layout | map: list of cpu allocated per node |
cpus_allocated | map: number of cpu allocated per node |
cpus_per_task | number of processors required for each task |
cpus_per_tres | semicolon delimited list of TRES=# values |
dependency | synchronize job execution with other jobs |
derived_ec | highest exit code of all job steps |
eligible_time | time job is eligible for running |
end_time | time of termination, actual or expected |
exc_nodes | comma separated list of excluded nodes |
exit_code | exit code for job (status from wait call) |
features | comma separated list of required features |
group_id | group job submitted as |
job_id | job ID |
job_state | state of the job, see enum job_states |
last_sched_eval | last time job was evaluated for scheduling |
licenses | licenses required by the job |
max_cpus | maximum number of cpus usable by job |
max_nodes | maximum number of nodes usable by job |
mem_per_cpu | boolean |
mem_per_node | boolean |
mem_per_tres | semicolon delimited list of TRES=# values |
min_memory_cpu | minimum real memory required per allocated CPU |
min_memory_node | minimum real memory required per node |
name | name of the job |
network | network specification |
nice | requested priority change |
nodes | list of nodes allocated to job |
ntasks_per_board | number of tasks to invoke on each board |
ntasks_per_core | number of tasks to invoke on each core |
ntasks_per_core_str | number of tasks to invoke on each core as string |
ntasks_per_node | number of tasks to invoke on each node |
ntasks_per_socket | number of tasks to invoke on each socket |
ntasks_per_socket_str | number of tasks to invoke on each socket as string |
num_cpus | minimum number of cpus required by job |
num_nodes | minimum number of nodes required by job |
partition | name of assigned partition |
pn_min_cpus | minimum # CPUs per node, default=0 |
pn_min_memory | minimum real memory per node, default=0 |
pn_min_tmp_disk | minimum tmp disk per node, default=0 |
power_flags | power management flags, see SLURM_POWER_FLAGS_ |
pre_sus_time | time job ran prior to last suspend |
preempt_time | preemption signal time |
priority | relative priority of the job, 0=held, 1=required nodes DOWN/DRAINED |
profile | Level of acct_gather_profile {all / none} |
qos | Quality of Service |
reboot | node reboot requested before start |
req_nodes | comma separated list of required nodes |
req_switch | Minimum number of switches |
requeue | enable or disable job requeue option |
resize_time | time of latest size change |
restart_cnt | count of job restarts |
resv_name | reservation name |
run_time | job run time (seconds) |
run_time_str | job run time (seconds) as string |
sched_nodes | list of nodes scheduled to be used for job |
shared | 1 if job can share nodes with other jobs |
show_flags | conveys level of details requested |
sockets_per_board | sockets per board required by job |
sockets_per_node | sockets per node required by job |
start_time | time execution begins, actual or expected |
state_reason | reason job still pending or failed, see slurm.h:enum job_state_reason |
std_err | pathname of job's stderr file |
std_in | pathname of job's stdin file |
std_out | pathname of job's stdout file |
submit_time | time of job submission |
suspend_time | time job last suspended or resumed |
system_comment | slurmctld's arbitrary comment |
threads_per_core | threads per core required by job |
time_limit | maximum run time in minutes or INFINITE |
time_limit_str | maximum run time in minutes or INFINITE as string |
time_min | minimum run time in minutes or INFINITE |
tres_alloc_str | tres used in the job as string |
tres_bind | Task to TRES binding directives |
tres_freq | TRES frequency directives |
tres_per_job | semicolon delimited list of TRES=# values |
tres_per_node | semicolon delimited list of TRES=# values |
tres_per_socket | semicolon delimited list of TRES=# values |
tres_per_task | semicolon delimited list of TRES=# values |
tres_req_str | tres reqeusted in the job as string |
user_id | user the job runs as |
wait4switch | Maximum time to wait for minimum switches |
wckey | wckey for job |
work_dir | pathname of working directory |
Query examples¶
Queries can be executed as usual but paying attention to the following limitations:
- both TSTART and TSTOP statements must be specified
- the date currently is supported only in the string format
- pushdown filters (executed on the datastore) are available only for a subset of table columns:
- job_id
- job_state
- account
- user_id
- node (keys of cpus_alloc_layout table column)
# Ask for all galileo jobs executed between '28-09-2019 08:09:00' and '30-09-2019 08:09:00'
import json
# Setup
sq.jc.JOB_TABLES.add('job_info_galileo')
data = sq.SELECT('*') \
.FROM('job_info_galileo') \
.TSTART('28-09-2019 08:09:00') \
.TSTOP('30-09-2019 08:09:00') \
.execute()
df = pd.DataFrame(json.loads(data))
df.head()
df.shape
# Ask for all galileo jobs executed between '28-09-2019 08:09:00' and '30-09-2019 08:09:00',
# allocated on node "r038c04s03"
data = sq.SELECT('*') \
.FROM('job_info_galileo') \
.WHERE(node='r038c04s03') \
.TSTART('28-09-2019 08:09:00') \
.TSTOP('30-09-2019 08:09:00') \
.execute()
df = pd.DataFrame(json.loads(data))
df.head()
df.shape
# Ask for all galileo jobs executed between '28-09-2019 08:09:00' and '30-09-2019 08:09:00',
# allocated on node "r038c04s03" and job_state = 'FAILED'
data = sq.SELECT('*') \
.FROM('job_info_galileo') \
.WHERE(node='r038c04s03', job_state='FAILED') \
.TSTART('28-09-2019 08:09:00') \
.TSTOP('30-09-2019 08:09:00') \
.execute()
df = pd.DataFrame(json.loads(data))
df.head()
df.shape
# Marconi100 jobs
# Setup for Marconi100
sq.jc.JOB_TABLES.add('job_info_marconi100')
data = sq.SELECT('*') \
.FROM('job_info_marconi100') \
.TSTART('28-09-2020 08:09:00') \
.TSTOP('30-09-2020 08:09:00') \
.execute()
df = pd.read_json(data)
df.head()
df.shape
(11614, 110)
Asynchronous queries¶
In case of big queries it can be useful to use the asynchronous mode (available from client version v0.4.0)
import time
# One month of data
tstart = '01-04-2021 00:00:00'
tstop = '30-04-2021 00:00:00'
t0 = time.time()
data = sq.SELECT('*') \
.FROM('job_info_marconi100') \
.TSTART(tstart) \
.TSTOP(tstop) \
.execute_async()
print('Elapsed Time: %f seconds' % (time.time() - t0))
df = pd.read_json(data)
df.head()
df.shape
(109608, 110)