Skip to main content
In recent years, system operations have got ever more complicated. In this interview, we talk about two functions of a product called "Job Management Partner 1," namely, an operation-automation function called "Job Management Partner 1/Automatic Operation" and a service-level monitoring function called "Job Management Partner 1/IT Service Level Management," which was born in response to developers expressing ideas like "I want to support operations managers who are struggling for 24 hours a day, 365 days a year to ensure stable operations of systems" and "I want to satisfy needs voiced from the site of operations management."
IIZUKAIn terms of a familiar example, operations management involves keeping operation of sites like on-line shopping stable for 24 hours a day, 365 days a year. In the business company, operating important systems for sales management, material ordering, and so on according to plan without any problems is the main work of operations management. Operation-management work can be split into two kinds: routine tasks and non-routine tasks. Routine tasks include publication of new services, backup and operational monitoring of systems, and monitoring of log files. Non-routine tasks include handling unforeseeable circumstances like service stoppages due to hardware breakdown and malfunctioning or malfunction of applications. As for Job Management Partner 1 Version 10, the products "Job Management Partner 1/IT Service Level Management" and "Job Management Partner 1/Automatic Operation" have been released, the operation-automation function of the latter product is mainly used for routine tasks. This function enables operations executed on a daily basis to be run rapidly and infallibly. Meanwhile, the service-level-monitoring function of Job Management Partner 1/IT Service Level Management is mainly used for non-routine tasks. This function helps to pre-empt faults or, if a fault does occur, to swiftly fix it.
IIZUKAWhile hearing at the fields of operations management and Ms. WADA actually worked with operations managers, we listened to a variety of on-site opinions. In light of these opinions, we understood that operations management should be performed in an effortless manner—without any mistakes and without generating faults—and that operation-management functions should be able to promptly provide solutions to the unlikeliest of problems. We took developing functions in response to those on-site voices as our mission.
WADAYes, I did. Up until now, even when we developed functions that we considered useful, they tended not to be used for actual systems operations. Given those circumstances, aiming to find out what kind of things were happening and what were needed at operations-management fields, over a period of two years, I worked alongside operations managers in performing operational design and actual operations. Up until then, we thought that operations managers would be grateful if we gave them amazing functions that were hitherto unavailable. On the contrary, when in field, I found that a function for confirming what are currently occurring on a simple, every-day basis was needed far more than an "amazing" function for handling errors that rarely occur. Generally, operations-management products have many functions, so we heard voices expressing opinions like "I have to set up various things and spend a lot of time on installing them" and "It is a problem if only skilled persons can use it." Accordingly, as for Job Management Partner 1/IT Service Level Management, we made sure that it can be used by unskilled persons for simple confirmation of the status of operational management. And the function we developed is called "service-level monitoring."
MASUDAService level expresses how soundly a system is operating. In particular, it expresses how easily users are using that system and whether operations like backup are being done without any trouble. The service-level-monitoring function of Job Management Partner 1/IT Service Level Management measures service level based on performance. It measures performance such as the extent to which the system is continuously operating in a stable manner and the response time of services, and on the basis of those performance indicators, it then judges whether the service level of the system is "OK" (good) or "not OK" (no good). For example, if the aggregate result indicates that a service is responding within three seconds for 99% of accesses, it evaluates whether a response time of three seconds is proper for that service.
Figure 1: Service-level monitoring function
WADAWhether the service level changes from "steady" to a different status is monitored and rated. For example, it is determined whether response time has got longer even though the access count has not increased and whether the utilization status of resources is "steady" or not. This steady status is called the "baseline." The service-level monitoring function informs operations managers when it is detected that current status exceed the baseline. As for the baseline, for example, its threshold is set automatically depending on the situation; namely, the response-time threshold value is set to about two seconds when 100 people are accessing a service and to about three seconds when 500 people are accessing it. In this manner, automatically setting and monitoring the threshold value is the baseline-monitoring function of Job Management Partner 1/IT Service Level Management.
Figure 2: Overview of baseline monitoring
MASUDABaseline monitoring resembles the way doctors view x-ray pictures. A doctor keeps a mental image of a healthy person's x-ray in his or her head, and by comparing that mental image with an x-ray image of a patient to be diagnosed, he or she can discern the diseased parts in the image. he or she can discern the diseased parts in the image. The baseline is like the healthy person's x-ray image. Job Management Partner 1/IT Service Level Management automatically learns the correct operation of a system, constantly compares the present operation of that system with that learnt correct operation, and informs an operations manager where there are problems.
IIZUKAEven if an operations manager isn't a specialist like a doctor, he or she can see where the operation of a system differs from the "steady" status. If the threshold value is set too low, many warnings would occur. And if the threshold value is set too high, most performance impediments would not be detected. That is to say, when setting the threshold, acquaintance and experience was required. Instead of that experienced person, baseline-monitoring function plays that part.
MASUDAWith Job Management Partner 1/IT Service Level Management, problems are promptly solved by so-called "drill down". Drill down, which aims to identify the cause and location of a malfunction, is an operation whereby components configuring a system are investigated in depth in a stepwise manner such that the OS, server, and middleware are either systematically isolated as problem. Conventionally, a person with some skill could solve a problem after taking some time. However, when a problem occurs, it is important to easily and quickly determine the location of the problem. Aiming to satisfy that requirement, we prepared a "drill down" screen—which displays system components in a tree format.
Figure 3: Drill down screen
WADAThe "drill down" screen of Job Management Partner 1/IT Service Level Management drills down relying on so-called "marks." A mark expresses the lowest-level result that infringes the baseline, which propagated to the upper level. It is possible to drill down relying on marks and identify the hosts configuring services, the monitoring items of the host, and so on. Although there could be more than ten hosts per service and one hundred monitoring items per host, adding up to a huge number of items, clicking relying on the marks simply leads to the location of the problems.
WADAYes, we configured the screen so that graphs representing the status of various items (like services and systems) can be arranged on the same time axis. The key point here is the fact that the same time axis is used. In this manner, for example, it can be understood visually that when the mean response time becomes longer than usual, that affected most is CPU usage. At conventional fields of operations management, data is exported from performance-monitoring products, inputted into spreadsheet software, processed and graphed by that software, and operations manager arranged these graphs manually. This procedure has now been trimmed down to a push of a mouse button. Since quick and efficient isolation of the problem is possible, when a fault occurs or a sign of a fault is noticed, it is possible to handle the problem with a sense of self-composure.
IIZUKAThere has been a mechanism to automate the business procedure itself. For example, business procedures like payroll accounting and materials ordering have been able to be automated. However, in the case of operations-management work concerning backup and publication of new systems, up until now, it has been done by a skilled person manually or by preparing script. It has, naturally, been a serious problem that if that person leaves the field (because of a job transfer or some other reason), field operations will stall. In response to voices to eliminate such individual effort as much as possible and enable anyone acquainted with basic works to perform operations management, an operations-management technology called "Run Book Automation" (RBA) has become popular recently. The operation-automation function of Job Management Partner 1/Automatic Operation has adopted this RBA technology.
IIZUKAConventionally, in the case of manual operation, work procedures written in documents are reviewed, operations managers manipulate screens or script stating operations is executed. In contrast, with RBA, the order of operations is set by arranging "icons" (components) on the screen. In other words, when components expressing the operations or flows are chosen and arranged on the screen, and parameters and applications are set and run, automation is achieved. Moreover, by RBA, during execution, which routes have been run can be clearly seen. Keeping an instruction stating that "In the case that this abnormal end takes this path, this cause can be considered." in a manual brings the advantage of smoother troubleshooting. The need to use the RBA product and make operations management easier has grown.
Figure 4: Example configuration by RBA product
(Job Management Partner 1/Automatic Operation)
IIZUKAAt first, we thought that it would be best to completely automate operations. But on listening to voices from the field, we were told complete automation would be a bit "scary." In other words, it was feared that if points that are usually confirmed by people were completely entrusted to the system, in the unlikely event that the system malfunctioned, it might continue to run uncontrollably in that state. Accordingly, we understood that operations managers want to verify important things themselves and that they feel uneasy if they can't do so. With that in mind, we provided a component in Job Management Partner 1/Automatic Operation that stops operations to ask the question: "This result has been output; what do you want to do?" After we first try operations using this confirmation component, if operations managers find that the point need not to be confirmed, then we set automation. In this way, operation can be automated in a stepwise manner while keeping the peace of mind of the operations managers.
IIZUKAAfter joining Hitachi, Mr. MASUDA and myself have been researching effective management technology using operations-management software for two or three years, while participating in a national project on "business-grid computing." As part of this project, the first topic of our research was "autonomous system operation." This is a technology that provides functions like automatically switching processing to another server if the current server malfunctions and automatically executing processing in a manner that distributes processing load by adding extra servers if accesses surge. Form now onwards, we are thinking of building a mechanism that can perform these functions easily.
MASUDAThe entire business world concerned with operation-management technology is aiming towards the ultimate goal, namely, autonomous system operation. With that goal in mind, as part of the research department working on operations management, we are also developing technologies thick and fast. And we want to push our research and development ahead from various viewpoints.
WADAThat's right. When I was working at an operations-management field, I experienced the mental pressure involved in operations management. That pressure comes from working to keep customers' systems running safely for 24 hours a day, 365 days a year. So that our customers feel a sense of safety, operations managers have to be ready to make calm judgments when abnormal situations arise. From that experience, I have come to understand the feelings of those involved in operations management. On the basis of this experience, we developed Job Management Partner 1/IT Service Level Management while discussing how new products should be configured with the design department.
From now onwards, while confirming voices from fields of operations management, and without ever taking the wrong course of action, we must aim to provide hit products that are accepted by the market. In other words, I think we must continue to provide products that don't simply provide functions but actually meet the needs of people working on-site.
(Publication: July 18, 2013)