Tip: To download a document, right-click and select "Save Link As..."
A Degraded Service Event is generated when a significant outage occurs in the ECHO Operational system. The purpose of this document is to provide a high level and detailed summary regarding the factors that contributed to the outage. Also included in this document is a detailed timeline of events, NCRs or TTs relating the issues, and future mitigation steps to prevent further outages of the same type.
Degraded Service Events are available for the following years.
2013 2012 2011 2010 2009
April 5th, 2013 Degraded Service Event
All of the operations services provided by ECHO systems experienced an outage Friday, 04/-5/13 at 17:20. The outage was noted by an external monitoring system. All Operations servers were up and online for the entire period. The systems automatically recovered four (4) minutes after the initial outage alert.
March 8th, 2013 Degraded Service Event
Intermittent outages of Operations were observed from 7:00pm EST onwards. This was due to an issue with NetOps’ primary DNS. The outages were not sustained enough for the switchover to the secondary DNS to kick in. This was performed manually at around 8:00pm and full connectivity was restored at 9:07pm.
March 6th, 2013 Degraded Service Event
Concurrent hits on the catalog-rest query metrics endpoint caused a performance hit severe enough to lock up all our catalog-rest nodes in Operations preventing access to new queries.
February 20th, 2013 Degraded Service Event
The Operational change of enforcing the redirecting of HTTP to HTTPS for catalog-rest endpoints caused legacy ingests to fail to propagate to catalog-rest. This failure did not result in any provider suspensions and remained undetected until 03/05 due to a misdiagnosis of the problem.
January 24th, 2013 Degraded Service Event
The operations elastic search node on dbrac2node3 failed due to lack of available memory on 01/25/13 at 1:56pm EST. The memory issues were due to problems with our facet endpoint implementation. At this point nodes 1 and 2 attempted to replicate the to compensate for the lost node. At 2:46pm node 1 ran into memory issues. We believe that the size of the elastic search index is now large enough that it cannot be housed on only 2 nodes. Consequently a decision was made to restart the entire cluster at 2:55pm and Operations came back on line around 3:00pm. The resultant re-balancing of the elastic search cluster completed at around 3:30pm.
January 8th, 2013 Degraded Service Event
A cascading failure of elastic search (our means of providing search functionality within ECHO) in operations caused all search and indexing capabilities within operations to fail, elements not dependent on elastic continued to function. For example, Legacy ingest, Reverb dataset search and Reverb login were unaffected.
August 5th, 2012 Degraded Service Event
ECHO experienced an unexpected outage due to a concurrency issue in Ruby on Rails.
July 17th, 2012 Degraded Service Event
ECHO experienced an unexpected outage due to a concurrency issue in Ruby on Rails.
June 8th, 2012 Degraded Service Event
ECHO experienced a problem with its search cluster in the Operational Environment. Our search platform, ElasticSearch, had become unresponsive on all three nodes. After some initial time spent troubleshooting, the decision was made to restart ElasticSearch. Attempts to restart ElasticSearch failed due to a Linux system problem forcing the java process to fail with a segmentation fault and resulting in intranode communication problems. We rebooted the Linux hosts serving ElasticSearch and our Oracle Database, thus allowing ElasticSearch to start.
August 10, 2010 Degraded Service Event
During a planned GSFC-managed maintenance activity, a redundant Power Distribution Unit (PDU) was taken offline. After taking the redundant PDU offline, the remaining PDU was unable to handle the load and the resulting power oversubscription caused critical GSFC managed network components to lose power. These network components were upstream from the ECHO network, causing a loss of external connectivity to all ECHO resources. Once power was restored, the necessary network components resumed operation and ECHO became available to external users.
June 22, 2010 Degraded Service Event
Access to all externally visible ECHO systems was lost resulting from a configuration change made to the network load balancers. A new host entry was added to the ECHO load balancers to support ECHO development activities. An initially unperceived error existed in one of the critical configuration files on one of the redundant load balancers. This caused a cascading set of issues which forced a loss of external accessibility. The ECHO system administrators initially resolved the issue to restore system access. However, the initial resolution ultimately resulted in an increasingly degraded level of availability. When this was noted by the ECHO team, action was taken to apply a permanent fix to restore access to the ECHO systems.
May 26, 2010 Degraded Service Event
The ECHO 10.23 release promoted the new ACL and Group Management capabilities into an active state whereby access to provider objects and catalog items are affected by the new provider configured ACLs. After the deployment of 10.23 into the Operational system it was immediately noticed that query performance was degraded for some providers, specifically LPDAAC_ECS, NSIDC_ECS, and LARC. During this initial period of poor performance, the average query time for a granule query jumped from 12s to 212s.
The ECHO development and database team correlated the poor query performance to long running database searches. After investigating the root cause, the ECHO development, database, and operations team worked to introduce 3 minor patches to the Operational system returning the average query time to its original performance level. During this work, there was no Operational downtime required due to the recent high availability efforts.
Mar 16, 2010 Degraded Service Event
All instances of ECHO and WIST experienced an outage on the evening of Tuesday 3/16/10 starting at 4:45pm EST. Analysis has identified that a significant network event occurred which caused a loss of internal and external network connectivity. ECHO Operations was notified of the system issues by its automated monitoring tool and immediately contacted ECHO System Administrators. ECHO System Administrators were able to restore connectivity to all systems by 6:00pm EST. At that point, the Partner Test and Testbed systems were fully available. The Operational system remained unavailable due to an issue with connectivity between the Oracle RAC nodes. This problem was identified at 7:00pm EST and the ECHO System Administration and Database teams worked to identify the root cause of the problem. By 9:00pm EST the decision was made to restart each RAC node and restore them to proper working order. This took approximately 1 hour, completing at 10:00pm EST. Subsequent to this activity, the Operational kernels were restarted, restoring ECHO and WIST search and order capabilities. Operational Ingest was restarted at 10:45pm EST. No data loss or corruption occurred as a result of this outage.
Oct 24, 2009 Degraded Service Event
All instances of WIST experienced a 90 minute outage on the morning of Saturday 10/24/09 from 9:00am EST to 10:30 EST. There were no other system impacts. The outage was promptly detected and reported to the ECHO Operations, DBA, and SA teams. The cause of the outage was self-correcting and has not been determined after investigation. There does not appear to have been any unexpected activities on the WIST host during that time, but it is noted that routine processes took longer to process than normal. ECHO has decided that these long running processes were a symptom of a system issue instead of the actual cause.
Sep 17, 2009 Degraded Service Event
On Thursday September 17th, at 9:25am (EDT), ECHO experienced a power outage that affected operational ECHO services. This outage was the result of two separate combined events.
- 1. GSFC Facilities Management Department (FMD) had scheduled to perform a repair to a GSFC power system component. The ECHO team did not get notification of this planned outage and was not prepared for it.
- 2. In conjunction to this power outage, a component of the ECHO's redundant power system had failure on 9/11/09. Although a service appointment from the vendor had been scheduled for Wednesday 9/16, other administrative issues prevented the technician from accessing the site. The technician returned on Thursday 9/17 and was repairing the failed component at the time of the outage. Had this repair been performed on Wednesday, there would have been no ECHO outage. The effect of the combined factors resulted in a power loss to a critical storage component for the Operational database.
Aug 26, 2009 Degraded Service Event
On August 26, at approximately 2 pm, ECHO experienced a slowness in response while submitting orders to the LPDAAC order fulfillment service. After the configured internal timeout, ECHO was correctly moving orders to the retry queue and processing subsequent orders. Due to the large number of orders which were being serviced for LPDAAC, ECHO was in a state where non-LPDAAC orders were not being serviced in a timely manner. A similar situation occurred around June 21st - 24th with LPDAAC order fulfillment connectivity. An NCR was written at that time, procedures put in place to detect the situation, and NCR 11004606 was written to track an ECHO enhancement to increase order dispatching fault tolerance.