Protogenist Info Systems: February 2012

Wednesday 29 February 2012

Requirements for a FDR System

Usually, a Facial Detection and Recognition (FDR) system has the following
capabilities :

1. Interfacing with a video source for grabbing facial images. This implies the
possibilities of starting/stopping a video stream and capturing some frames
within it;

2. Automatic detection or manual selection of human faces may be found within
the scene;

3. Manipulate (create, add, delete) a database of faces. It is also recommended
the possibility of visual database browsing;

4. Launching the recognition process by comparing the face previously detected
with the database’s faces. The classification operation is usually preceded by
some image preprocessing operations and features extraction.

Figure 3 summarizes basic operations required by a facial detection and recognition
system,

Fig. 3. Operations required by a facial detection and recognition system.

Monday 27 February 2012

Cloud Storage Systems

Cloud storage systems are a subset, but often also the foundation of cloud computing
systems and therefore share the properties listed above. There are a plethora of
different cloud storage services around and here again it is very hard to give an exact
definition. On one side there are services which enable the end user to store simple
files as a backup solution. On the other hand there are a lot of new database systems
which are commonly called key/value stores. The expression Key/value store is an
umbrella term for services which are often also described as document-oriented,
attribute-oriented, distributed hash table, and key/value database. All of these names
are a variation on the same theme which emphasize different aspects of these systems.
Because of the frequent changes in this field and and due to the fact that every few
weeks a new systems appears on the horizon, we will refer the reader to online lists to
see a comparison of such systems.

One of the key distinction characteristics of cloud storage systems compared to
traditional RDBMS is the different and diametrically opposed emphasis on consistency
and availability. We have already mentioned the CAP-theorem in the introduction. The
older traditional RDBMS are built around the historical basis that consistency is the
main aim to achieve in a database system. As a result they often struggle scaling up
vertically beyond a small number of server nodes. With many servers, the failure rate
increases, and by enforcing consistency the availability decreases.

Cloud storage systems on the other hand are built around the premise to be available
almost constantly and to scale out on a large number of nodes. This is necessary to
serve the needs of highly demanded web services which operate on a worldwide scale.
Yet high availability can only be achieved by cutting back at the consistency guarantee.
Instead of full ACID guarantees as traditional RDBMS most cloud storage systems
support BASE properties. BASE stands for basically available, soft-state and eventual
consistency.

Saturday 25 February 2012

How Does DISCOVIR Perform Content-Based Image Retrieval?

We briefly outline the process of sharing and retrieving images in DISCOVIR
(DIStributed COntent-based Visual Information Retrieval). First, each peer is
responsible to perform feature extraction, for example, color, texture, shape, and
so on, on its shared images in each peer using the DISCOVIR client program.
With this program, each peer maintains its local index of feature vectors of its
image collection. When a peer, the requester, initiates a query by giving an example
image and a particular feature extraction method, it performs feature extraction on
the example image and sends the feature vector, contained in a query message, to
all its connecting peers. Consequently, other peers compare this query to their
feature vector index. Based on a distance measure, they find a set of similar images
and return results back to the requester. Likewise, these peers will propagate the
query to their connecting peers and this process continues to query other peers in a
large portion of the network.

DISCOVIR uses a plug-in architecture to support different feature extraction methods.
User may choose to download a plug-in in the format of compiled Java bytecode if
they want to perform CBIR based on that particular feature extraction method. For a
screen capture of the DISCOVIR client program, which can be downloaded from the
site [MIPLAB]. In the following, we describe the overall architecture of DISCOVIR,
which operates on the current Gnutella network, and the modification of query messages.

Gnutella Message Modification : The DISCOVIR system is compatible with the
Gnutella (v0.4) protocol [Gnutella]. In order to support the image query functionalities
mentioned, two types of messages are added. They are:

--ImageQuery—Special type of Query message. It carries the name of the feature
extraction method and the feature vector of the query image.

--ImageQueryHit—Special type of QueryHit message. It responds to the Image
Query message. It contains the location, file name and size of similar images retrieved,
and their similarity measure to the query. In addition, the location information of
corresponding thumbnails is added for the purpose of previewing the result set in a
faster speed .

Thursday 23 February 2012

Signals and Systems Using MATLAB

SSUM (Signals and Systems Using MATLAB) is a continually expanding suite
of exploratory demonstrations and applications. It demonstrates essential principles
and concepts of MSP without requiring advanced mathematics. To use SSUM,
MATLAB (matrix laboratory) must be installed as well as the signal processing
toolbox.

SSUM currently has 37 programs illustrating concepts of wave forms, modulation,
sampling and interpolation, aliasing, the frequency domain, convolution and
filtering, pole-zero diagrams, analysis and synthesis, signal features, and many
others. Many of these are applicable to sounds and images. There are also
demonstrations of curious topics such as sound cross synthesis, reverberation
using all-pass filters, additive synthesis of birdsong, sine wave speech synthesis,
and modeling of musical instruments.

SSUM is perfect for use in lectures, labs, and assignments. All SSUM programs
are wrapped in GUIs, so there is no need for typing unwieldy commands at the
prompt. Many of the applications are integrated as well. For instance, if one is
creating a waveform in an application, it can be sent to another application for
filtering, and to another to see its frequency content.

SSUM uses and extends the programming style used in the excellent “MATLAB
Auditory Demonstrations” application. Modularizing the code and keeping the
GUI separate from the functionality makes SSUM much more manageable.
When a new application is desired, it is quite easy to copy and reuse the functionality.

Tuesday 21 February 2012

Web Server Caching vs Proxy Caching

There is a significant difference between document caching in primary and
proxy Web servers. In proxy servers, caching is done on a proxy server’s
disk (main memory is primarily used for meta-data and network buffers).
The set of requested documents is huge (potentially the whole World Wide
Web may need to be cached). Primary Web servers, on the contrary, need
to cache a relatively small set of local documents in main memory. Also,
primary Web servers have more skewed access patterns (popular documents
are requested much more often than unpopular ones). This allows for better
cache utilization. Direct application of the proxy caching policies to primary
Web servers may not result in the best performance.

In Web proxies, CPU overhead is not very important because almost all
cached documents are stored on disk and the time to maintain auxiliary cache
data structures is much smaller than the time to read a file from disk. In
primary Web servers, documents are cached in memory. Hence, the internal
complexity of the caching algorithm becomes an important issue.

Monday 20 February 2012

Object pointer compression

The prior work on the subject by Adl-Tabatabai propose a straight forward
compression scheme for addressing the memory usage in 64-bit Java virtual
machines. They represent 64-bit pointers as 32-bit offsets from a base
address of a contiguous memory region. Dereferencing or decompressing a
pointer then involves adding the 32-bit offset to a base address yielding a
64-bit virtual address. Reverse, compressing a 64-bit virtual address into
a 32-bit offset requires substracting the 64-bit address from the base
address; the lower 32 bits are then stored. A similar approach was
proposed by Lattner and Adve for compressing pointers in linked data
structures.

The fact that 64-bit virtual addresses are represented as 32-bit offsets
from a base address implies that this pointer compression technique is
limited to Java programs that consume less than 4GB of storage. If a
Java program allocates more than 4GB of memory, the virtual machine
has to revert to the 64-bit pointer representation. This could for example
be done by setting the maximum heap size through a command line
option: if the maximum heap size is larger than 4GB, uncompressed
pointers are used; if smaller than 4GB, compressed pointers are used.

Fig. 1. Illustrating the basic idea of object-relative addressing (on the right)
compared against the traditional 64-bit addressing (on the left).

Adl-Tabatabai apply their pointer compression method to both vtable
pointers and pointers to other Java objects, so called object references.
The 32-bit object references are then relative offsets to the heap’s base
address; the 32-bit vtable pointers are relative offsets to the vtable
space’s base address.

We focus on compressing object references and do not address vtable
pointer compression. The reason is that vtable pointers are not that big
of an issue when it comes to pointer compression. The 32-bit vtable
pointer offsets are highly likely to be sufficient even for programs that
allocate very large amounts of memory; it is highly unlikely to require
more than 4GB of memory for allocating vtables. In other words, the
pointer compression method by Adl-Tabatabai et al. is likely to work
properly when applied to vtable pointers. Moreover, recent work by
Venstermans has proposed a technique that completely eliminates the
vtable pointer from the object header through typed virtual addressing.
We also refer to the related work section of this paper for a discussion
on object header reduction techniques.

Saturday 18 February 2012

Data mining functionalities, what kinds of patterns can be mined ?

We have observed various types of data stores and database systems on which
data mining can be performed. Let us now examine the kinds of data patterns that
can be mined. Data mining functionalities are used to specify the kind of patterns
to be found in data mining tasks. In general, data mining tasks can be classifi ed into
two categories: descriptive and predictive. Descriptive mining tasks characterize
the general properties of the data in the database. Predictive mining tasks perform
inference on the current data in order to make predictions.

In some cases, users may have no idea of which kinds of patterns in their data may
be interesting, and hence may like to search for several diff erent kinds of patterns in
parallel. Thus it is important to have a data mining system that can mine multiple
kinds of patterns to accommodate diff erent user expectations or applications.
Furthermore, data mining systems should be able to discover patterns at various
granularities (i.e., diff erent levels of abstraction). To encourage interactive and
exploratory mining, users should be able to easily "play" with the output patterns,
such as by mouse clicking. Operations that can be specifi ed by simple mouse clicks
include adding or dropping a dimension (or an attribute), swapping rows and columns
(pivoting, or axis rotation), changing dimension representations (e.g., from a 3-D cube
to a sequence of 2-D cross tabulations, or crosstabs), or using OLAP roll-up or drill
down operations along dimensions. Such operations allow data patterns to be
expressed from diff erent angles of view and at multiple levels of abstraction.

Data mining systems should also allow users to specify hints to guide or focus the
search for interesting patterns. Since some patterns may not hold for all of the data in
the database, a measure of certainty or "trustworthiness" is usually associated with
each discovered pattern.

Data mining functionalities, and the kinds of patterns are,

1. Concept/class description
2. Association analysis
3. Classi cation and prediction
4. Clustering analysis
5. Evolution and deviation analysis

Thursday 16 February 2012

Spatial Databases and Spatiotemporal Databases

Spatial databases contain spatial-related information. Examples include geographic
(map) databases, very large-scale integration (VLSI) or computed-aided design
databases, and medical and satellite image databases. Spatial data may be
represented in raster format, consisting of n-dimensional bit maps or pixel maps.
For example, a 2-D satellite image may be represented as raster data, where each
pixel registers the rainfall in a given area. Maps can be represented in vector
format, where roads, bridges, buildings, and lakes are represented as unions or
overlays of basic geometric constructs, such as points, lines, polygons, and the
partitions and networks formed by these components.

Geographic databases have numerous applications, ranging from forestry and ecology
planning to providing public service information regarding the location of telephone
and electric cables, pipes, and sewage systems. In addition, geographic databases
are commonly used in vehicle navigation and dispatching systems. An example of
such a system for taxis would store a city map with information regarding one-way
streets, suggested routes for moving from region A to region B during rush hour, and
the location of restaurants and hospitals, as well as the current location of each driver.

“What kind of data mining can be performed on spatial databases?” you may ask.
Data mining may uncover patterns describing the characteristics of houses located
near a specified kind of location, such as a park, for instance. Other patterns
may describe the climate of mountainous areas located at various altitudes, or
describe the change in trend of metropolitan poverty rates based on city distances
from major highways. The relationships among a set of spatial objects can be
examined in order to discover which subsets of objects are spatially auto
correlated or associated. Clusters and outliers can be identified by spatial
cluster analysis.Moreover, spatial classification can be performed to construct
models for prediction based on the relevant set of features of the spatial objects.
Furthermore, “spatial data cubes” may be constructed to organize data into
multidimensional structures and hierarchies, on which OLAP operations
(such as drill-down and roll-up) can be performed.

A spatial database that stores spatial objects that change with time is called a
spatiotemporal database, from which interesting information can be mined. For
example, we may be able to group the trends of moving objects and identify
some strangely moving vehicles, or distinguish a bioterrorist attack from a
normal outbreak of the flu based on the geographic spread of a disease with time.

Wednesday 15 February 2012

CONGESTION-DISTORTION OPTIMIZED SCHEDULING

CoDiO determines which packets to send, and when, by comparing different possible
schedules in terms of their expected impact on video distortion and network
congestion. Forming the expected Lagrangian of expected end-to-end delay (the
metric we choose for congestion) and distortion (i.e. mean squared error), is better
suited than the traditional rate-distortion metric proposed in the seminal work, when
evaluating the performance of a sender operating over a throughput-limited network.
In particular, end-to-end delay depends on the capacity of the network path, reflects
time varying network conditions, and increases without bound as the sending rate
approaches capacity.

To find an optimized schedule, CoDiO selects the most important packets in terms of
video distortion reduction, and transmits them in an order which minimizes the
congestion created on the network. For example, I or SI frames are sent in priority
whereas some B frames might not be transmitted at all. In addition, CoDiO avoids
sending packets in large bursts as this has the worse effect on queuing delay.

Fig. 1. Video quality performance of different schedulers under varying latency constraints.

Figure 1 illustrates the benefits of CoDiO scheduling compared to a traditional ARQ
scheduler which sends packet sequentially and re-transmits lost packets based on out
of order acknowledgement, as long as they are not past their play out deadline. The
performance of a scheduler denoted “CoDiO Light” is also represented. This is a
low complexity CoDiO scheduler which determines sequentially which frame should
be transmitted to minimize the expected Lagrangian cost of congestion and
distortion, and spaces transmissions based on the time needed for the last frame to
traverse the bottleneck link. In the experiment, the video sequence Mother and
Daughter, encoded with H.264 at 270 kb/s, is sent over a 400 kb/s capacity path;
the one-way propagation delay is 50ms and the packet loss rate is 2%. In this
simulation, network paths are simulated in the network simulator NS-2.

Video quality is measured as the average PSNR measured in decibel and the
performance is shown for different maximum tolerated latencies, that is the time
between which a video frame is available at the server and the time at which it is
played out at the receiver. As shown in Fig. 1, the video quality of the three
schedulers is comparable when the latency constraint is relaxed, however, for
tight delay constraints, both optimized schedulers outperform by a significant
margin the ARQ scheduler. The results also indicate that the low-complexity

CoDiO scheduler achieves most of the gain. Hence, in the following, it is this
type of scheduler which will be extended to the P2P scenario, to reduce
computational complexity at each peer.

Monday 13 February 2012

SECURITY REQUIREMENTS OF EMBEDDED SYSTEMS

Embedded systems often provide critical functions that could be sabotaged
by malicious entities. Before discussing the common security requirements
of embedded systems, it is important to note that there are many entities
involved in a typical embedded system design, manufacturing, and usage
chain. Security requirements vary depending on whose perspective we consider.

For example, let us consider a state-of-the-art cellular handset that is capable
of wireless voice, multimedia, and data communications. Figure 2 illustrates

security requirements from the viewpoint of the provider of HW/SW components
inside the cell phone (e.g., baseband processor, operating system), the
cell phone manufacturer, the cellular service provider, the application service
provider (e.g., mobile banking service), the content provider (e.g., music or
video), and the end user of the cell phone.

Fig. 2. Security requirements for a cell phone

Fig. 3. Common security requirements of embedded systems

The end user’s primary concerns may include the security of personal data
stored and communicated by the cell phone, while the content provider’s
primary concern may be copy protection of the multimedia content delivered to
the cell phone, and the cell phone manufacturer might additionally be concerned
with the secrecy of proprietary firmware that resides within the cell phone. For
each of these cases, the set of untrusted (potentially malicious) entities can also
vary. For example, from the perspective of the content provider, the end user of
the cell phone may be an untrusted entity. While this section outlines broad
security requirements typical of embedded systems, the security model for each
embedded system will dictate the combination of requirements that apply.

Figure 3 lists the typical security requirements seen across a wide range of
embedded systems, which are described as follows:

1. User identification refers to the process of validating users before allowing them to use the system.
2. Secure network access provides a network connection or service access only if the device is authorized
3. Secure communications functions include authenticating communicating peers, ensuring confidentiality and integrity of communicated data, preventing repudiation of a communication transaction, and protecting the identity of communicating entities.
4. Secure storage mandates confidentiality and integrity of sensitive information stored in the system.
5. Content security enforces the usage restrictions of the digital content stored or accessed by the system.
6. Availability ensures that the system can perform its intended function and service legitimate users at all times, without being disrupted by denial-of service attacks.

Friday 10 February 2012

Peer to Peer Networks

Recently there has been a substantial amount of research in P2P networks. For
example, P2P network topology has been an area of much interest. Basic peer
networks include random coupling of peers over a transport network such as
Gnutella (http://www.gnutella.com) and centralized server networks such as
that of Napster (http://www.napster.com) architecture. These networks suffer
from drawbacks such as scalability, lack of search guarantees, and
bottlenecks. Yang and Garcia-Molina discussed super-peer
networks that introduce hierarchy into the network in which super-peers have
additional capabilities and duties in the network that may include indexing the
content of other peers. Queries are broadcasted among super-peers, and these
queries are then forwarded to leaf peers. A network in which a deterministic
topology is maintained and known of by all nodes in the network. Therefore,
nodes at least have an idea of what the network beyond their scope looks like.
They can use this globally available information to reach locally optimal decisions
while routing and broadcasting search messages. Content addressable networks
(CAN) have provided significant improvements for keyword search. If
meta-information on a peer’s content is available, this information can be used
to organize the network in order to route queries more accurately and for more
efficient searching. Similarly, ontologies can be used to bootstrap the P2P
network organization: peers and the content that they provide can be classified
by relating their content to concepts in an ontology or concept hierarchy. The
classification determines, to a certain extent, a peer’s location in the network.
Peers routing queries can use their knowledge of this scheme to route and
broadcast queries efficiently.

Peer network layouts have also combined multiple ideas briefly mentioned
here. In addition, Nejdl proposed a super-peer based layout for
RDF based P2P networks. Similar to content addressable networks,
super-peers index the metadata context that the leaf peers have. Efficient
searching in P2P networks is very important as well. Typically, a P2P
node broadcasts a search request to its neighboring peers who propagate the
request to their peers and so on. However, this can be dramatically improved.
For example, Yang and Garcia-Molina have described techniques to increase
search effectiveness. These include iterative deepening, directed Breadth First
Search, and local indices over the data contained within r-hops from itself.
Ramanathan, Kalogeraki, and Pruyne proposed a mechanism in which peers
monitor which other peers frequently respond successfully to their requests for
information. When a peer is known to frequently provide good results, other
peers attempt to move closer to it in the network by creating a new connection
with that peer. This leads to clusters of peers with similar interests that allow to
limit the depth of searches required to find good results. Nejdl proposed using the
semantic indices contained in super-peers to forward queries more efficiently. Yu
and Singh proposed a vector-reputation scheme for query forwarding and
reorganization of the network. Tang, Xu and Dwarkadas made use of data
semantics in the pSearch project. In order to achieve efficient search, they rely on a
distributed hash table to extend LSI and VSM algorithms for their use in P2P
networks.

Thursday 9 February 2012

BOINC Server Complex and Scheduling

The Berkeley Open Infrastructure for Network Computing provides a generic
framework for implementing distributed computation applications within a
heterogeneous environment. This system is currently deployed in a beta test
for the SETI@home and Astropulse distributed applications. The system is
designed as a software platform utilizing computing resources from volunteer
computers. Computing resources are allocated to the computational problem
by the assignment of work units. Workunits are a subset of the entire
computational problem. Work units for the MD4 hashing problem were
chosen to create work units small enough to evenly distribute work among
the participating nodes and large enough to minimize start up and reporting
times. For the four participating nodes in the project, the entire password
space of all five character passwords were divided into 47 even work units.
This allowed for an even division of workunits containing 94^2 × 2 possible
passwords. Work unit input files contained three fields. The first field contained
the first password to check. The second field contained the last password to
check. The final field contained the matching hash target.

The BOINC architecture consists of a server complex handling scheduling,
central result processing, and participant account management, a core software
client running on participant nodes, and project specific client software running
on participant nodes. The server complex consists of a database, web server,
and five BOINC specific processes. Communication between participant
nodes and the server complex is handled via standard HTTP. Database
functions in BOINC are implemented for MySQL; however few, if any,
MySQL specific functions are used so it can be easily replaced by a comparable
database. A straightforward schema is used to store information on application
versions, participant hosts, user accounts, work units, and results. Additional
tables are included for tracking user participation level and managing teams.

BOINC uses HTTP for both interaction with the user for account management and
RPC communication between client nodes and the server complex. User account
management is accessed using a standard web browser to interact with a web site
implemented mainly in PHP. This web site allows users to manage preferences for job
execution of BOINC projects. RPC communication between the BOINC core client
and the server complex is handled by two cgi programs. The first cgi program, named
cgi, handles work scheduling based on client requests for work. The second program,
named file_upload_handler, receives completed work results from clients for
processing by the server complex. The MD4 password hashing project used the
standard web interface, cgi, and file_upload_handler from a standard BOINC
installation. The first BOINC process is named feeder. Feeder is responsible for
accessing the database and selecting candidate work units to be processed by
clients. The next BOINC process is the assimilator. Assimilator processes work unit
results and integrates them into the project. As valid results become available, as
similator does any project specific processing including storage of results in the
project database. Transitioner is the third BOINC process. This process handles
state transistions of work units. The fourth BOINC process is the validator.
Validator handles testing the validity of work unit results. Result validation comes
into play when redundant computing is used to compare the results of the same
work unit returned by two or more clients. The final BOINC process is the
file_deleter. File_deleter examines the state of work units to determine which
input and output files are no longer needed.

Scheduling workunits is handled by the cgi interfaces and the BOINC server
processes. Participating nodes contact scheduling server through the web server
interface named cgi. The participant node identifies itself and amount of work
effort available based on high and low water marks for processing time, available
memory, and disk space. Cgi then selects viable workunits from the shared
memory segment filled by the feeder process. Cgi bases workunit selection
based upon client capabilities and also favoring work units that have been
evaluated as infeasible by other participating nodes. The available work units
returned by feeder will be affected by work unit configurations requiring
redundant computing and result validation. Work units that have already been
processed may need additional results from other participating nodes in order
to arrive at a quorum of identical results in order to accept a result unit. The
MD4 password hashing project did not implement redundant computing.

Configuration of the MD4 password hashing project manipulated the estimated
running time of the individual workunits in order to accommodate the scheduling
algorithm. Since the entire MD4 password search required a running time
shorter than the default low water mark for the default BOINC project client, the
first node requesting work would consume all available work units. To circumvent
this scheduling problem, the estimated time of the work units reported running
times far greater than the actual running time. This caused the scheduling algorithm
to distribute the work units more evenly across the participating nodes.

Wednesday 8 February 2012

Preventing SSH Dictionary Attacks

A first line of defense in preventing a computer attack is the firewall. Firewalls, either
network based or host based, are configured to allow or deny connections into or out
of its perimeter based on the protected entity’s security policy. The policy states what
types of connections are allowed into or out of the entity. However, when network
service such as SSH must be offered, the firewall is configured to allow connections
to the service. The solution needed is an application that can detect a dictionary
attack and dynamically apply firewall blocking rules against the source of the
dictionary attack. This solution already exists. The following is a short list
of available applications that offer protection against the SSH dictionary attack:

Snort Intrusion Detection System
DenyHosts application program
OpenSSH timelox software patch
SSHD_Sentry application program

These applications detect SSH dictionary attacks against a local host and

can be configured and/or modified to automatically apply firewall blocking

rules to the local host from the source of the attack.

Classically, firewalls and access control mechanisms are implemented as

static protection mechanisms. The rules are configured based on the security

policy for the host or network that these devices protect, and the rules

remain constant unless there is a security policy change. Intrusion detection

is technology designed to monitor hosts and/or networks. These systems

monitor a host or network based on configurable rules and specifications.

Once abnormal activity is observed, the system will send an alert to the

system administrator. It is the responsibility of the system administrator to act

on the alerts they receive from the detection system. Unfortunately, the

response time of human administrators is too slow compared to the speed of

modern day attacks. As a result, research in the field of intrusion detection

has begun to focus on the concept of Active Response. Active response is

the act of detection systems dynamically responding to real time attacks

without the need of human advisory. Indeed, the detection techniques listed in

this section are simple mechanisms that offer active response by dynamically

blocking access to servers from the sources of SSH dictionary attacks.

However, a distributed solution that offers dissemination of the detection

information and security policy to participating neighbors can offer greater

security by way of preemptive and proactive protection.