Dom Utilities Functions
You can download the source code here.
Brief documentation
I developed this set of functions as a participant in a project of building
a crawler that is intended to gather some particular information from various
websites (for example, weather forecasts). We did not have any contact with these
websites, and thus could not influence the way they are presenting the data. None
of them had the info we needed in any convenient format such as RSS. The only
thing we had was the raw HTML their server is sending out. Most of you probably
know, that this usualy doesn't obey many rules of well-formedness, or compliance
to any schema.
We needed some set of tools that will enable us to quickly perform sequences of
operations like (for example):
1. load the page
2. find the span with the text "Today's forecast"
3. go the to table row containing this span
4. go three rows down
5. go to the fourth cell, and in it, to the third div
6. fetch the text from that div, but without the text of it's child-nodes (if there
are any, such as <span> or <div>).
The attached set of tools does exactly that. You may modify and redistribute
it as you wish. Enjoy.
Here is a brief explanation on how to use:
Let's start from the first operation, which is usually loading the page:
$doc = new DOMDocument();
ob_start(); //this generates a lot of errors, that dump the output file
$doc->loadHTMLFile(some-URL);
ob_end_clean();
The ob_start(); and ob_end_clean(); are needed, because as stated earlier the average
page you will load from the internet will be so full of non well-formedness, that the
operation $doc->loadHTMLFile(some-URL) will generate a huge amount of warnings. Although
the information in these warnings may be interesting, it is practically useless, unless
you can influence the way the page is generated by the server.
In case the server doesn't like crawlers, you can disguise yourself as an ordinary browser,
with the help of the function: get_include_contents(some-URL), which will enable you
to present yourself as an ordinary browser (see function documentation in the source code
itself). Please also make sure that what you do is legal.
$doc = new DOMDocument();
ob_start();
$doc->loadHTML(get_include_contents(some-URL));
ob_end_clean();
Now, after loading the page, you might want to start walking through it and collecting the
information you want.
The first function you might want to use is:
findElementWithTagAttrValue($DOMDoc, $tag, $attr, $val, $fullmatch=true)
This function receives as first parameter the DOMDocument you created earlier,
the second parameter is the html tag you are looking for (i.e., 'td', 'span', 'div'),
the third parameter is the attribute that is in the requested element (for example: 'id')
the fourth parameter is the value of that attribute (for example: 'MainContent')
the fifth optional parameter indicates weather we require an exact match or only a substring.
example of usage:
findElementWithTagAttrValue($doc, 'div', 'id', 'MainContent') will find the div element (DOMNode) with
the attribute id="MainContent"
findElementWithTagAttrValue($doc, 'div', 'id', 'MainContent',false) will find the div element with
the attribute id, whose value has "MainContent" as a substring (for example "MainContent1").
This function can also receive a DOMNode as a first parameter, meaning it can operate also on a part
of a document.
The function:
findElementsArrayWithTagAttrValue($DOMDoc, $tag, $attr, $val, $fullmatch=true)
does exactly the same as the previous function, only that it returns an array of nodes
that meet the requested criteria. This is useful for cases when there are more than one element
in the document that meet the requested criteria.
The function:
findElementWithTagTextContent($DOMDoc, $tag, $textContent, $fullmatch=true)
finds the html tag of type $tag, with text content $textContent
example of use:
findElementWithTagTextContent($doc, 'td', 'Forecast for tomorrow')
will return the td tag (DOMNode) who's text content is 'Forecast for tomorrow'
Please note that the text examined is only the text of the node itself, and not of any subnodes
(unlike DOMNode->textContent).
The function:
findElementsArrayWithTagTextContent($DOMDoc, $tag, $textContent, $fullmatch=true)
does the same as the previous function, only that it returns an array of elements that meet
the requested criteria.
The functions:
findElementWithTagAttValueTextContent($DOMDoc, $tag, $attr, $val, $attrfullmatch, $textContent, $textfullmatch)
findElementsArrayWithTagAttValueTextContent($DOMDoc, $tag, $attr, $val, $attrfullmatch, $textContent, $textfullmatch)
Combine the requirements of the attribute value and of the text contents.
The function:
getTextOfNode($node) returns the text of the node itself without it's subnodes , unlike DOMNode->textContent.
The function:
getAttributesArray($domNode) returns the attributes of the node as an associative array.
The function:
checkAttValue($node,$strAtt) returns the value of an attribute of a node
The function:
getDeepNode($domNode,$strHierarchy)
"digs" into the hierarchy of the given node and fetches the node according to the given spec.
$strHierarchy spec: childNumber:childType;childNumber:childtype;...
child number: starts from 1
child type: for example: a, td, tr, * (for any) this is case insensitive
Example of usage:
getDeepNode($node,'1:table;4:tr;2:td;3:div') will return the third div of the second cell
of the fourth row of the first table in $node.
The functions:
findRelativeSiblingOfType($refNode,$destinationNodeType,$siblingNumber, $attrParams = array())
findSiblingOfTypeFromAbove($refNode,$destinationNodeType,$siblingNumber=1, $attrParams = array())
are useful for going forwards, backwards, up and down in the document hierarchy. See
documentation in the code itself.
Back to the example I started with, the following sequence of code does the job:
$doc = new DOMDocument();
ob_start();
$doc->loadHTMLFile(some-URL);
ob_end_clean();
$span1 = findElementWithTagTextContent($doc, 'span', "Today's forecast");
$containingRow = findSiblingOfTypeFromAbove($span1,'tr');
$threeRowsDown = findRelativeSiblingOfType($containingRow,'tr',4);
$thirdDivOfFourthCell = getDeepNode($threeRowsDown,'4:td;3:div');
$requiredInfo = getTextOfNode($thirdDivOfFourthCell);
That's it !! Simple, isn't it ?
Cheers.
This page is generously hosted by Tinte Toner Shop