Application of an Annealing/Condensing Algorithm to Abstract Data

the adaptation of numerical mathematical techniques for physical modeling to the development of a model for abstract data, and subsequent modeling thereof

"an idle mind is the devil's playground" -- anon
"Holy singularity, Batman! An UNLICENSED mathematician!" "Yes! And the 'lingering essence of a departed quantity' isn't the only proof of a higher power! Bwahaha!"
a remarkably pleasant experience with Perl and Tk; recycling in action; dinosaurs roar; people conspire; activities of a nebulous nature are graphically characterized as visibly nebular


An Article written for no particular reason by Fred Morris of Seattle Washington in July 2003 based on his Recent Experiences with the Subjects Declared

copyright (c) 2003 by Fred Morris
6739 3rd NW Seattle Washington USA 98117
m3047@inwa.net 206.297.6344
may not be reprinted, copied or sold without the permission of the author, except:
may be cited, excerpted or quoted for academic and scholarly purposes in accordance with accepted academic practices without prior permission


Abstract

A data model for relationships between points is defined. Two sets of abstract data:

are described in the context of this data model.

An set of equations describing attraction and repulsion between the points is described. An iterative simulation based on the equations is then performed. Focal points in the datasets become apparent.

Utility is discussed as is the experience with coding the simulator using Perl and Tk. The actual Perl code is available on request to the author.

Have you ever tried to get a handle on a large amount of meaningless data by noting each data point on a scrap of paper, and then moving the scraps into piles? It's a lot less work if you let the computer do the heavy lifting.

Motivation and Objectives

Iterative techniques are regularly used for simulation and modeling in the physical sciences, including chemistry, astronomy and physics. Well-characterized and provable algorithms exist. At a time when computational power was expensive, relatively speaking, the efficiency of the algorithm was important. Also, the physical sciences are, by and large, exact sciences.

Abstract data sets, meaning domains of data points without correlation to the physical world, are inexact (in the sense that 1/0 is inexact). Oftentimes the expected results are inexact as well: simply providing an intuitively accessible visualization of the dataset may be sufficient. I realize now that I had an incompletely formed notion that simply watching the annealing/condensation process might provide insight, and this was borne out.

It's also important that the algorithm be robust, because little is known about abstract datasets before the fact; and just as computational power was once expensive, human (brain) power still is: reworking an algorithm or problem as a theoretical understanding is developed might be the scientific way to go about it, but a lot of the applicability of this class of tool is to nonreproducible or highly variant data. Furthermore and as previously noted, simply watching the annealing process may provide the desired insight (or at least part of it): in an "efficient" and therefore highly tailored algorithm, a lot of the abstract knowledge about the dataset has been removed from the simulation and embodied in the algorithm.

Prior Art

As noted, these sorts of simulations are common in the physical sciences, in spite of the presence of theoretically defensible algorithms which might provide much faster convergence to a solution than what is employed in the simulation. Even the physics of 3D gaming engines qualifies in some respects. One oft-cited rationale for the selection of sub-optimal algorithms is stability; additionally, a suboptimal algorithm may be tractable by particular nuances of the hardware on which the computations are performed: it may be quicker to perform more "cheap" computations.

It's interesting to note that man has always willed a correlation between ideas and the physical world: astrology, saying "I see" when we think we understand what is being said to us, "color" and "charm" in subatomic physics. The list is endless and quite poetic.

So it's not surprising that someone would apply iterative simulation techniques to abstract data. I first saw it done in the early 1980s to correlate color patterns in high-altitude photographs to types of vegetation. Color signatures (variegated) were noted for known types of vegetation, and then this information was iterated until meaningful clusters (variegated) with a high correlation were formed; of course they didn't have Tk, they just got big piles of paper (at the end of the day) and if the clusters weren't tight enough then they submitted the data to a second run (and maybe tweaked either the data or the algorithm). When all was said and done, the results were quite good and they made some really nice (and accurate) vegetation maps. These days they'd probably apply a neural net to the problem, and that might be a better approach (although not necessarily computationally less expensive).

Results First

The algorithm turns out to be quite robust for reasonably conditioned data. It tends to go through several phases:

  1. Big Bang. Everything comes together and then flies apart. I didn't note anything particularly meaningful in this phase.
  2. Oscillation. "Orbits" form in spite of the massless nature of the points. Clusters tend to bounce together and then apart. This sort of behavior gives an early indication of the strength of attraction of clusters of points.
  3. Entropy. Things slowly approach a steady state.

Two examples follow. The algorithm was run without changes on the two datasets.

Contact Correlations for Joe Smith (fictional)

I did run this on my own contacts/events data from the same database. There is much more data, but the results were similar overall. It didn't tell me anything that I didn't already know, but it did allow me to visualize it; it would be useful to illustrate that knowledge to someone else.

The story is: Joe Smith and Paul Jones both work for Acme Corporation. They are both involved in a variety of civic activities.

 

initial state

big bang

oscillation

entropy

Extracts From The Intrusion Log

This is actual data from a 7 day period, without the chronic activity on ports 137 and 445. I suppose I should make the disclaimer that I am not asserting that these people are up to no good; still, there's no good reason for them to be doing what they're doing.

There is a lot more data in this set. The oscillation phase is more pronounced. The plotting mechanism makes a half-hearted attempt to suppress display of legends which overlap. It's interesting to watch them blink on and off. The clusters are pronounced; some clusters contain the same ip address, and others contain multiple addresses: clusters form around ip addresses, types of events, and points in time.

The nebula in the middle is interesting. I hadn't spotted this guy from my regular perusal of the logs: he comes back every few days and tries a couple of oddball ports. Although time is on the x axis, it is not precise: the nebula still forms, although the activity is continuous over the entire period.

 

 

initial state

big bang

oscillation

entropy

 

Adaptation of Datasets

Adapting the datasets proved to be the greatest creative challenge: without knowing before the fact, describe the important relationship(s) between seemingly unrelated datapoints. Well, we all have ideas and notions, and mine may not be the best; but they worked and, in both cases, on the first try.

Appropriate choices of functions for calculating attraction and repulsion between points is sufficient to produce self-organizing behavior, but this still leaves several attributes which can be tuned to provide additional informational content: "weight" which becomes size, and positioning on the X and Y axes.

The Massless Nature of Ideas

Attraction and repulsion as described in this section are demonstrably sufficient to allow self-organizing behavior to occur.

repulsion

One of the challenges in adapting a physical model to an abstract problem is that when we say ideas have weight, momentum, etc., the analogy only goes so far. Indeed, there's no compelling reason or need to model the "particles" as having weight, momentum, etc.: they can be massless, a simple mutual repulsion demonstrably serves to provide sufficient counterbalancing (dis)equilibrium for self-organizing behavior to occur. (Prigogine et al.)

Therefore, a fundamental presupposition is that points repel each other inversely with distance. Second, although assumptions are made that the [x,y] variance of the data is relatively small, there is no assumption that it is bounded (although boundaries could have been computed). Given that the computational cost of one mathematical division is the minimum effort required to create this effect given the constraints, a function was chosen which uses one mathematical division.

attraction

Points are deemed attracted to each other based on an identifiable and quantifiable relationship. The amount of attraction may vary depending on the "strength" of this relationship; the strength of this attraction is constant over distance, as there is no compelling rationale for it to be otherwise, the previously described repulsion which varies with distance is demonstrated to be sufficient for self-organizing behavior to occur.

The relationships between points in the datasets may be imagined as a multidimensional net, and attraction is largely (but need not be solely) a function of connectedness. There is a presupposition that each point need not be connected to each other point: points can be indifferent to each other. This is important computationally, as while computational effort is still on the order of N2, the effect of points which are indifferent to each other can be ignored and this substantially reduces the scalar multiplier of effort. Indeed it is computationally desirable that the connectedness between points should be minimized.

Weight == Visual Prominence

The size at which a point is drawn can be used to indicate the "weight" or importance of a point. The intuitive definition of this attribute is dependent on the nature of the dataset, and with large numbers of points it may become less useful as the clustering itself may manifest the desired visual cue.

Attraction Still Leaves [X,Y]

Borrowing from biochemistry now, my original conception of this was that it's nice to have clusters organize themselves but that it would be useful to assign some additional valence so that a phoresis, in the sense of a mobility which tends to carry a point to a certain location, could take place.

Suitable valences need not be chosen; but if they are, points will tend to migrate to the position in the graph which represents their valence in the absence of attractive/repulsive forces.

An Example from the intrusion logs

Data is extracted from the logs and placed in a table in an SQL database. The table has the following structure:

CREATE TABLE EventId (
  id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  ip1 tinyint(4) unsigned NOT NULL,
  ip2 tinyint(4) unsigned NOT NULL,
  ip3 tinyint(4) unsigned NOT NULL,
  ip4 tinyint(4) unsigned NOT NULL,
  eventtime datetime NOT NULL,
  type VARCHAR(16) NOT NULL,
  score integer not null default 0,
  KEY ix_EventId_iptime (ip1, ip2, ip3, ip3, eventtime),
  KEY ix_EventId_type (type)
);
id

id is a unique identifier for the event. It can be sparse, although in this example it is not.

ip1..4

These are the octets from the ip address associated with the event.

eventtime

This is the date/time at which the event was logged.

type and score

type is a description of the type of event. score is assigned based on type.

The program requires two datasets as input, one describing the points themselves, and the other describing attractions between points.

The data describing the points is extracted from the log as follows:

SELECT oE.id, COUNT(*),
       CONCAT( oE.ip1, '.', oE.ip2, '.', oE.ip3, '.', oE.ip4 )
INTO OUTFILE 'intrusion-points.txt'
FROM EventId AS oE, EventId AS xE
WHERE oE.ip1 = xE.ip1
  AND ( oE.ip2 = xE.ip2 OR oE.ip3 = xE.ip3 OR oE.ip4 = xE.ip4
     OR ABS(UNIX_TIMESTAMP(oE.eventtime) - UNIX_TIMESTAMP(xE.eventtime)) < 43200
     OR oE.type = xE.type
      )
GROUP by oE.id
ORDER BY oE.id;

The id field will be used to join records in this points dataset to records in the affinity dataset. The count of records (which will coincidentally be the same as the count of records in the affinity dataset which are associated with the point) is used to modify the size of points. The program does require that the records be sorted in ascending order by id, although as previously noted id can be sparse. Note that only records which have some discernible relationship to another record are selected.

The records in the affinity dataset are extracted as follows:

SELECT oE.id, xE.id,
       1 + IF( oE.ip2 = xE.ip2, 1, 0 ) + IF( oE.ip3 = xE.ip3, 1, 0 )
         + IF( oE.ip4 = xE.ip4, 1, 0 )
         + IF( oE.type = xE.type, 1, 0 )
         + IF(ABS(UNIX_TIMESTAMP(oE.eventtime) - UNIX_TIMESTAMP(xE.eventtime)) < 43200, 1, 0 )
       AS aff,
       TO_DAYS( oE.eventtime ) - 725000 AS x,
       oE.score AS y
INTO OUTFILE 'intrusion-affinity.txt'
FROM EventId AS oE, EventId AS xE
WHERE oE.ip1 = xE.ip1
  AND ( oE.ip2 = xE.ip2 OR oE.ip3 = xE.ip3 OR oE.ip4 = xE.ip4
     OR ABS(UNIX_TIMESTAMP(oE.eventtime) - UNIX_TIMESTAMP(xE.eventtime)) < 43200
     OR oE.type = xE.type
      )
ORDER BY oE.id, xE.id;

As can be seen, attraction is based on:

The program does require that the records be sorted in ascending order by the two id values, representing the point with which the record is associated and the point with which there is an affinity. (Therefore there will be two records for each mutual attraction, one for each of the points.)

Further Explication of the Algorithm

Initial Placement of Points

Points are initially placed near the [x,y] coordinate which would represent their valence under phoresis alone. For the stability of the algorithm singularities, defined as two points sharing the same [x,y] coordinate are undesirable as initial values. Therefore a spiral graph walk is used to place points which would represent singularities "near" their optimal coordinates.

Maximum Attraction

A maximum attraction M is determined by finding the maximum value of aff in the affinity dataset.

Fundamental repulsion

Given [x,y] as the vector expressing it, distance is computed as

d = sqrt( x2 + y2 )

and repulsion is calculated as

r = M / ( d + 0.5 )

Ignoring Fastmovers

Fastmovers are undesirable for the stability of the algorithm. It is possible, especially in the early iterations, that points which are indifferent to each other may approach singularity. The effect of fastmovers is minimized with the following computation.

Given the distance that a point moved at the last iteration is denoted by vt-1:

r = r / (1 + vt-1 )

(You can view this as a tacit acknowledgement of Niels Bohr et al.)

"Bigness"

The program allows the plotted size of points to be calculated as either the diameter or area of weight based on a command line switch.

Attraction

Since two points which are attracted to each other will both move towards each other, the attraction of one point for another is calculated as

a = -A / 2

Spin

It is possible that two points approaching each other directly may have ultimate destinations beyond each other. These sorts of local equilibriums will tend to be unstable, but to speed things along, repulsion is rotated clockwise by 90 degrees and reduced by a factor of 16 and this is added to the original repulsion component.

X and Y

At each iteration a value of -0.5 is added to the sum of the foregoing in the direction of the point's [x,y] valence.

Vanities in the name of Efficiency

No trigonometric calculations were used. Square roots are calculated to determine distance and ratios between x and y components. Given that P represents the number of points and A represents the number of attractors for a point, the number of square roots calculated for each point is:

1 + 3 ( P - 1 )2 + A

Why doesn't Tk::Dialog work?

I've done quite a bit with Perl, but I'd never used the Perl::Tk interface. The choice of this as the interface was largely cemented by the discovery that the Tk::Canvas has a method which produces direct output of what is displayed as PostScript.

All in all, the interface was quite simple to work with. The biggest hurdle was probably grasping that all that is really needed is to use Tk; instantiate a main window and place the desired widgets on it, and then call MainLoop();. Widgets which need to respond to user actions are given anonymous subroutines as handlers and these are invoked automagically. Furthermore you can more or less throw the widgets at the Canvas and say "you figure it out" and the placement is pretty good. (Egads, an OO lightbulb gag that works!)

When all was said and done I found that MainLoop lacked the flexibility which I needed and I wrote my own main loop which looks like this:

while (1) {
    if ($Quit) {
        $MW->destroy;
	last;
    }
    DoOneEvent($Running ? Tk::DONT_WAIT : Tk::ALL_EVENTS);   
    iterate()           if ($Running);
    save_is_active( 1 ) if (!$Running && !$SaveButtonActive);
}

DoOneEvent( Tk::DONT_WAIT ) is also called periodically during the computational loop.

save_is_active enables and disables the button which allows the output to be saved. Note the additional flag, $SaveButtonActive. It turns out that repeatedly enabling an enabled button does not allow Tk to reach a quiescent state (the program runs ok, but consumes all available idle cpu cycles). I find that sort of behavior extremely uncouth, and therefore I added an extra flag so that the button is only enabled once.

The only thing I couldn't get to work was throwing up a dialog if there was an error while trying to write the PostScript output file. This is doubly odd because invocation of $MW->getSaveFile to allow the user to browse for the location and enter the name of the file works fine and it will throw up a warning dialog if the entered name already exists. Not considering this to be of critical importance, I opted to print a message on the console and forget about it.

Other Items Available On Request

The following items are available on request. If you make a request, state your contact information, affiliation, and the purpose for which you intend to use the items. You must also state that you will not redistribute the items, or else state the conditions under which you intend to redistribute them.

scatter.plx

This is the program which does all of the heavy lifting.

Dotter.pm

This is a module which I wrote to plot the points on a Tk::Canvas.

JoeSmith data files

This comprises the two datasets which were used for the Joe Smith example above.


An improved version of this algorithm has been applied to the precinct-level voting canvass for King County, Washington. -- FWM, 18-Aug-2005


Fred Morris Consulting, Licensed in Seattle, WA, USA. since 1984

Document/Collaboration/Content Management Tools and Solutions

Better, Cheaper, Highly Adaptable, Less Hassles

Custom and Extraordinary Needs Data Processing Services

What else is on this web site?

An Internet Plumber... not a web cowboy

telephone: 206.297.6344
e-mail: x0xm3047x0xatx0xinwa.net